The Robots.txt file is a text file that webmasters use to instruct search engine robots (often known as spider bots) on how to crawl their website’s pages.
The robots.txt file, like the rest of the Robots Exclusion Protocol (REP), is a collection of internet standards that govern how robots crawl the web, access and index content, and serve that site content up to users.
The REP is important because it helps webmasters manage how search engine bots (google bots) access their website’s pages and ultimately helps those same webmasters determine whether their pages appear in the search engine result pages (SERPs).
Search engines have two main jobs:
Crawling is the process of visiting websites and gathering information about them so that they can be indexed. Indexing is the process of adding those websites to their search engine.
Robots.txt tells other search engines which pages they are allowed to visit. This helps prevent overloading your website with too many requests.
A robots.txt file is used to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.
The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned.
Not all robots cooperate with the standard. Email harvesters, spambots, malware, and robots that scan for vulnerabilities do not follow the standard. Web crawlers mostly follow the protocol; well-behaved web crawlers will honor the directives in a robots.txt file.
In addition, conforming crawlers will usually indicate to the Webmaster whether they obeyed or ignored the directives, via their log files or other communication channels. Non-conforming web crawlers may ignore the directives completely.
Robots.txt is important because it helps prevent overloading your website with too many requests. It also helps you control which other pages are indexed by the search engines.
There are three primary reasons to employ a robots.txt file.
- Block Non-Public Pages
- Maximize Crawl Budget
- Prevent Indexing of Resources
You might want to exclude specific pages from index pages. For example, you could have a staging version of a page. However, you don’t want anybody but your family to visit them. In this instance, robots.txt can be used to prevent crawlers and bots from indexing these pages (leaving certain pages blocked).
Your website has a crawl budget. This is the number of times search engines can visit and index your site in a given period.
You want to make sure that your website is easy to crawl so that you don’t waste this valuable resource. By using robots.txt, you can tell search engines which pages are most important.
Your website has resources like images and PDFs. You don’t want these to show up in Google search results because they are not web pages.
With robots.txt, you can prevent search engines from indexing these resources and keep track of the process using the Google search console.
When visitors come to a website, search engines and other crawling robots (such as Facebook’s crawler, Facebot) are aware to look for a robots.txt file. They’ll only look for it in one place: the main directory (typically your root domain or home page).
If a user agent access goes to www.example.com/robots.txt and does not discover a robot file, it will assume the site does not have one and proceed with crawling the page (and perhaps even the whole website) in any case.
Always include your robots.txt file in the main directory or root domain to assure that it is found (this is why it is also called the “root” file).
Creating and editing your robots.txt file is easy. You can do it with any text editor, even Notepad on a Windows.
Just open up a new document, insert the necessary information (discussed below), and then save the file as “robots.txt” in the root directory of your site.
(The main directory is the top-most folder where all your website’s files are stored. It is also sometimes called the “public_html” or “www” directory.)
The above code tells all robots that they are welcome to visit every page on your site.
If you want to block all robots from your site, you would use the following code:
This tells all user agents not to visit any pages on your site.
Of course, you don’t want to block all robots from your site, as that would defeat the purpose of having a website in the first place.
What you really want to do is block only certain user agents or directories that you don’t want to be indexed and initiate a crawl delay.
To block specific user agents, simply replace the * character in the user-agent line with the name of the user agent you want to block. For example, if you wanted to block only Google’s crawler, you would use the following code:
To block specific directories, simply replace the / character in the Disallow line with the name of the directory you want to block.
For example, if you wanted to block only the /images/directory, you would use the following code:
Remember, you can block multiple user agents and directories by including multiple lines in your robots.txt file.
Just make sure to put each user-agent on its own line, followed by the Disallow directive (or directives) for that user-agent. For example:
Finally, remember that robots.txt is a public file, so don’t include any sensitive information in it that you don’t want to be made public.
For example, don’t put your email address or password in the file, as anyone who views the file will be able to see that information.
You’ll need to first generate your robots.txt file. It’s a text file, so you may create one with Windows Notepad. The format is always the same, even if you make your robots.txt file in any way:
First, you specify which User-Agent(s) you want to target.
You can use an asterisk as a wildcard to match all user agents.
Next, on the same line (or the next line, it doesn’t matter), you specify what action to take.
This will usually be “Disallow”, which tells the user agent not to index a specific URL, file, or directory.
You may also use “Allow” to override previous “Disallow” directives.
Finally, you end the entry with a blank line.
If you want to make another entry, you simply start the process over again from step 1.
Here’s an example of a robots.txt file:
In the above example, we’re targeting all user agents with the asterisk wildcard. We then tell the user agent not to index the /cgi-bin/, /tmp/, and /~joe/ directories.
Make Your Robots.txt File Easy to Find
Make sure that your robots.txt file is located in the root directory of your website.
The root directory is also sometimes called the “public_html” or “www” directory.
If you put your robots.txt file anywhere else, search engines will not be able to find it and index your site accordingly.
When creating your robots.txt file, make sure to follow the guidelines set forth by the major search engines.
You can find these guidelines at the following links:
While it’s tempting to just block all user agents from indexing your entire site, doing so will result in your site not being indexed at all.
Therefore, only block user agents and directories that you don’t want to be indexed. And even then, be very careful about what you block.
For example, if you accidentally block your home page from being indexed, your entire site will disappear from the search results.
After you create and upload your robots.txt file, make sure to monitor your website’s traffic to ensure that the file is working as intended.
If you notice a sudden drop in traffic after creating or modifying your robots.txt file, it’s likely that you’ve made a mistake in the file.
In that case, you’ll need to edit the file and re-upload it to your server.
In addition to using robots.txt, you can also use the Robots Meta Tag on individual pages of your website to control how search engine spiders index those pages.
The Robots Meta Tag is a piece of HTML code that goes in the <head> section of your web page. It looks like this:
<meta name=”robots” content=”index, follow”>
The “content” attribute tells search engine spiders what to do with that particular page.
In the above example, the page will be indexed (“index”) and all links on the page will be followed (“follow”).
Here are the other values that can go in the “content” attribute:
- noindex – This value tells search engine spiders not to index the page.
- nofollow – This value tells search engine spiders not to follow any links on the page.
- none – This value is the same as “noindex, nofollow”.
- noarchive – This value tells search engine spiders not to save a cached copy of the page.
- nosnippet – This value tells search engine spiders not to show a description of the page in search results.
The robots.txt file is a very important part of SEO.
Make sure that you create and upload a robots.txt file to your server if you want search engines to index your site.
And be very careful about what you include in the file.
If you make a mistake, it could result in your entire site being removed from search engine results, preventing search engines from helping you gain traffic through an organic search. Hopefully, our guide has helped you learn everything you need to know about the robots.txt file.
If you have any questions, feel free to post a comment.