Robots.txt: All You Need to Know

Robots.txt is the standard means by which websites tell search engine spiders and other crawlers which pages are open to scanning and which ones are off-limits. Also known as the robots exclusion standard or robots exclusion protocol, it’s used by most websites today and honored by most web crawlers.

The protocol is often used on websites that are undergoing development or on webpages that are not meant for public access. In search engine optimization (SEO), robots.txt plays a major role in optimizing the crawling and indexing process of search engines.

 

History

The robots.txt protocol was reportedly proposed by internet pioneer and Allweb creator Martijn Koster. He did so in early 1994 while working for Nexor, a Nottingham-based computer security and defense company. British writer Charlie Stross claims to have pushed Koster into suggesting the protocol when he developed a malicious crawler that caused problems on the latter’s servers.

Thanks to its simplicity and usefulness, many websites and early search engines adopted robots.txt. To this day, modern search engines such as Google, Bing, Yahoo and others still respect the protocol and stay off pages that it forbids access to.

For SEOs, robots.txt became an integral part of the optimization process as the community became more aware of concepts such as link equity flow and crawl budget. Today, experienced SEOs rely on the protocol to hard-block bots from dynamic pages, admin pages, checkout pages and other similar web documents.

Not all crawlers, however, adhered to the standard. Spambots, content scrapers, email scrapers, malware and hacking software all disregard the robots.txt file and its instructions. In some cases, malicious crawlers even prioritize crawling the pages that the robots.txt file forbids.

Archiving organizations such as Archive Team and Internet Archive also ignore the standard, viewing it as an obsolete protocol that’s oriented more towards search engines. Archiving groups generally aim to store information that monitors the evolution of the Internet and their founders seem to view robots.txt as a hindrance to chronicling Web history.

 

Usage

A robots.txt file is usually uploaded to a website’s root directory. Most bots are programmed to look for it in the address www.example.com/robots.txt. For most bots, not finding a valid robots.txt file in that location is a signal that every page in the site is open for scanning even if the file is, in fact, uploaded to another address.

Creating a robots.txt file is literally as easy as writing instructions on notepad and saving the .txt file with the filename robots.txt. You will then need to upload it via FTP or cPanel to the root domain directory and you should be all set. More modern CMS platforms and SEO plugins sometimes create a robots.txt file automatically so you can just go into it and perform edits as needed.

The following are the most common uses of robots.txt files:

  • Indexing Denial – Among all the reasons robots.txt is used, this is the most common. Webmasters will want to prevent search engine bots from crawling and indexing pages that aren’t relevant to the experiences that they want searchers to have. Examples of these pages include staging environments, internal search results, user-generated content, PDFs, filter-generated pages and more.
  • Crawl Budget Conservation – Larger websites with thousands of pages don’t usually get all their pages crawled when Googlebot visits. To increase the chances that the important pages are scanned, some SEOs block crawler access to unimportant pages.

Frequent and regular crawls on organic traffic landing pages means that the optimizations you apply will have an impact on the SERPs sooner than later. It also means that the pages that your pages link to will benefit faster from link equity transfer, whether they’re internal or external.

  • Link Equity Flow Optimization – Springboarding off the last bullet, the robots.txt protocol is also useful in the optimization of link equity flow to a site’s pages. By sealing off crawler access to unimportant pages, internal link equity is retained in organic traffic landing pages at a more potent state. This means that your site’s ranking power becomes more concentrated in the pages that really matter, allowing you to move up higher in the SERPs and draw more organic traffic despite having fewer indexed pages.
  • Sitemap Listing – The robots.txt file can also be used to tell search engines where they can find the site’s XML sitemap. This is optional since you can submit your XML sitemap to Google Search Console and get the same results, but it doesn’t hurt to have it there either.
  • Security – Some pages are not meant to be found by non-administrators of a website, much less get listed in search engines. Login pages to the site’s back end and staging pages are some prime examples. The more confidential that these pages remain, the lower your risk of being attacked.
  • Specifying Crawl Delays – Large websites such as ecommerce sites and wikis sometimes set content live by the bulk. There are instances when bots pick up on the addition quickly and rush to scan all the new URLs. When this happens, the sheer volume of pages being scanned at the same time can cause stress on servers and trigger slowdowns or downtmes. This can be prevented by writing the ronots.txt instructions in such a way that bots crawl new pages incrementally, giving the servers some breathing room.

 

Writing and Formatting

The Robots exclusion protocol has a very basic “language” that even non-coders can learn in a very short amount of time. It mostly involves specifying which crawlers you are addressing and what restrictions you’re setting on them. These are the basic common terms that you need to be familiar with to write a functional robots.txt file:

  • User-agent: Denotes the name of the web crawler that you’re addressing. That would be Googlebot for Google’s organic crawler, Bingbot for Bing, Rogerbot for Moz and so on. The wildcard character “*” can be used to address all crawlers.
  • Disallow: This instruction is followed by a directory path such as /category to tell bots not to crawl every URL in that section of the website. Individual URLs such as /category/sample-page.html may also be kept out of bot reach with the disallow instruction.
  • Crawl-delay: This tells bots how many milliseconds the crawl delay should be. The value often varies depending on the size of the website and the capacity of its servers.
  • Sitemap: This indicates the file location of the site’s XML sitemap.

 

Suppose you’re the administrator of a WordPress site and you want to make sure that your backend pages, and dynamic pages are never included in the SERPs of all search engines. Your robots.txt file would likely say something like:

robots

The first line addresses all crawlers with the asterisk, while the second line specifies that all pages with URLs that contain /wp-admin are off-limits. The third line, on the other hand, tells bots that all pages with the question mark character in them should not be indexed. Question mark and equal signs are characters that are found in dynamic URLs.

glendemands sample robots

Note that you don’t have to include the root domain when specifying pages and directories that you want to block access to. The URL slug or file path will suffice.

Best Practices and Side Notes

There are several best practices to follow so you can ensure that your robots.txt configuration yields a positive impact to your SEO campaign and overall user experience. Here are some of the most important:

  • Never upload your robots.txt file anywhere except your root directory. You also shouldn’t name it anything other than the default file name. If bots can’t find it in www.example.com/robots.txt, they might not be able to find it at all, leading them to assume that every page on your site is open to crawling.
  • The file name is case sensitive. Most crawlers will view a file named robots.txt differently than they would robos.Txt. Make sure to use the all-lower case one for best results.
  • As mentioned previously, malicious crawlers sometimes prioritize the crawling of URLs that robots.txt files block in hopes of finding entry points to the site’s back end. For security purposes, use the “noindex” meta tag instead of robots.txt to prevent the indexing of pages that you want to keep private.
  • In some cases, webmasters write disallow instructions that inadvertently block bot access to CSS and JavaScript files. These elements need to be crawled and indexed to allow for the optimal indexing of a website. When blocking off entre directory paths with robots.txt, make sure that these files remain accessible to search engine bots.
  • Pages that are sealed off from bot access will not pass link equity to both internal and external pages that they link to. If you want a webpage off the index but you’re intent on letting it pass link equity, use the “noindex,follow” meta directive tag instead.
  • Subdomains are viewed by most search engines as entirely different sites. That means the robots.txt file on the root domain will not be followed in its subdomains. In effect, www.glendemands.com must have a separate robots.txt file from a subdomain like blog.glendemands.com.

You can also test your ribots.txt file’s health through Google Search Console’s robots.txt tester tool. Just go to your property and check the robots.txt tester tool under the Crawl section. If the file is working as it should, you should see something like this:

GSC robots.txt tester

References:

 

Glen Dimaandal
Glen Dimaandal
Glen Dimaandal is the founder and CEO of SearchWorks.Ph. He has been doing SEO since 2008 and is consistently featured in mainstream media and industry conferences. His core skills include SEO, SEM, data analytics and business development.
Glen Dimaandal
Glen Dimaandal
Glen Dimaandal is the founder and CEO of SearchWorks.Ph. He has been doing SEO since 2008 and is consistently featured in mainstream media and industry conferences. His core skills include SEO, SEM, data analytics and business development.