What's the best way to write robots.txt for github pages using multiple repos?


Problem:

I am using Github pages to build my personal website with Jekyll. I have a head site in the username.github.io repo, project A site in the projectA repo, project B in the projectB repo and so on. I have put a CNAME file under the username.github.io repo so that all of my sites are under the customized domain name (www.mydomain.com). I have noticed that with robots.txt file pointing to the sitemap.txt file under each repo, the sitemap.txt can only contain page links for pages in each separate repo. So, I have a couple of questions:

  1. Since my site is structured as www.mydomain.com, www.mydomain.com/projectA, www.mydomain.com/projectB and so on corresponding to the pages in single repos, will the search engine recognize all of my site pages even though the sitemap.txt under the username.github.io head repo only has the page links generated in the single repo?

  2. What is the best way to write the robots.txt file in my case?

Thanks! Qi


Solution:

Standards and disclaimer

Sitemap: in robots.txt is a nonstandard extension according to Wikipedia. Remember that:

Using the Sitemap protocol does not guarantee that web pages are included in search engines, but provides hints for web crawlers to do a better job of crawling your site.

Wikipedia also lists allow: as a nonstandard extension.

Multiple sitemaps in robots.txt

You can specify more than one Sitemap file per robots.txt file. When specifying more than one sitemap in robots.txt this is the format:

Sitemap: http://www.example.com/sitemap-host1.xml

Sitemap: http://www.example.com/sitemap-host2.xml

An index of sitemaps

There is also a type of sitemap file that is an index of sitemap files.

If you have a Sitemap index file, you can include the location of just that file. You don't need to list each individual Sitemap listed in the index file.

<?xml version="1.0" encoding="UTF-8"?>

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

   <sitemap>

      <loc>http://www.example.com/sitemap1.xml.gz</loc>

      <lastmod>2004-10-01T18:23:17+00:00</lastmod>

   </sitemap>

   <sitemap>

      <loc>http://www.example.com/sitemap2.xml.gz</loc>

      <lastmod>2005-01-01</lastmod>

   </sitemap>

</sitemapindex>

<lastmod> is optional.

About excluding content

The Sitemaps protocol enables you to let search engines know what content you would like indexed. To tell search engines the content you don't want indexed, use a robots.txt file or robots meta tag. See robotstxt.org for more information on how to exclude content from search engines.

If you want search engines not to index anything it should be in the robots.txt file (in the User Page repository) as:

User-agent: *
Disallow: /project_to_disallow/
Disallow: /projectname/page_to_disallow.html

Alternatively you can use the robots tag.

Suggestions

User-agent: *
Disallow: /project_to_disallow/
Disallow: /projectname/page_to_disallow.html

Sitemap: http://www.example.com/sitemap.xml

Sitemap: http://www.example.com/projectA/sitemap.xml

Sitemap: http://www.example.com/projectB/sitemap.xml

or, if you are using a sitemap index file

User-agent: *
Disallow: /project_to_disallow/
Disallow: /projectname/page_to_disallow.html

Sitemap: http://www.example.com/siteindex.xml

where http://www.example.com/siteindex.xml looks like

<?xml version="1.0" encoding="UTF-8"?>

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

   <sitemap>

      <loc>http://www.example.com/sitemap.xml</loc>

   </sitemap>

   <sitemap>

      <loc>http://www.example.com/projectA/sitemap.xml</loc>

   </sitemap>

   <sitemap>

      <loc>http://www.example.com/projectB/sitemap.xml</loc>

   </sitemap>

</sitemapindex>

For info on how set up robots.txt with GitHub Pages see my answer here.

Recent Tips

  1. Biopython: Cant use .count() for biopython
  2. How can I find out the token balance of an address?
  3. ref value is undefined in vue (modal, textarea, $refs)
  4. Azure - HDInsight Hbase Data Insertion Failed
  5. SignalR overwriting OnConnected(), OnDisconnected()
  6. DatePickerDialog displays with two borders
  7. "type 'double' is not a subtype of type 'int' in type cast" error in flutter. What should i do?
  8. hiding the autocomplete list when user click outside the textbox is not working as expected
  9. JSF IceFaces basic problem with redisplaying input value
  10. How to validate material ui TextField in reactjs?
  11. Go and MongoDB connection won't work with panic log "no reachable server"
  12. WordPress Posts Pagination Not Working
  13. F# sprintf won't print in interactive console
  14. Spring Integration get FTP files recursively with outbound-gateway
  15. Jade mixins not getting working from external file
  16. Can not access defined exports from the webpack bundle?
  17. Completely new to Node.js - API Programming
  18. Formatting Compare-Object Ouput
  19. Add dynamically added textbox value from User Control to main form
  20. Create a ByteBuf in Netty 4.0
  21. Is it possible to do computation before super() in the constructor?
  22. Q-learning Updating Frequency
  23. Wrong reload order when using Gulp and browserSync
  24. I use hugo build static page. But don't have content
  25. How to change background color and set bar colors based on conditional formatting in VBA?
  26. Problem when comparing two numeric values in SAS
  27. Is ACE reactor timer managment thread safe?
  28. Why Express res.render dumps the render output (EJS template) in console?
  29. Define generic typescript sort function of a certain type
  30. Eclipse RCP: TableViewer setInput from another view