Home » SEO Glossary » What is Robots.txt File?

What is Robots.txt File?

by | 26 Jun, 2022

More Definitions

If you link to any of my definitions in your blog posts, send me a message to [email protected] and I’ll feature your post in my next roundup.

What Are Breadcrumbs In SEO?

Breadcrumbs are a type of navigation element that helps users understand their location within a website.

What are LSI Keywords?

LSI (Latent Semantic Indexing) keywords are simply phrases related to the target keyword in your search engine.

What are Stop Words?

Stop words are words that are typically ignored by search engines when indexing web pages for relevancy.

What is a Backlink?

An SEO title tag is a concise description of your content made from relevant keywords.

What is A Slug In SEO?

A slug is a keyword-rich URL that is used to identify a web page.

What is a Web Crawler?

A web crawler, also known as a spider, is a program that visits websites and scrapes data: the content and the HTML structure.

What Is A/B Testing In SEO?

A/B testing in SEO, or “split testing”, is a process of making two versions of the same content, showing them to the same audience segment, and comparing their performance.

What is Amazon SEO?

Amazon SEO is a combination of optimization strategies that help your product listing rank higher on Amazon search results pages.

What is an SEO Title Tag?

An SEO title tag is a concise description of your content made from relevant keywords.

What is Anchor Text In SEO?

Anchor text is the visible, clickable text in a hyperlink. Other names for anchor text include link title, link text, and link label.

The Robots.txt file is a text file that webmasters use to instruct search engine robots (often known as spider bots) on how to crawl their website’s pages.

The robots.txt file, like the rest of the Robots Exclusion Protocol (REP), is a collection of internet standards that govern how robots crawl the web, access and index content, and serve that site content up to users.

The REP is important because it helps webmasters manage how search engine bots (google bots) access their website’s pages and ultimately helps those same webmasters determine whether their pages appear in the search engine result pages (SERPs).

How Does Robots.txt Work?

Search engines have two main jobs:

  • Crawling
  • Indexing

Crawling is the process of visiting websites and gathering information about them so that they can be indexed. Indexing is the process of adding those websites to their search engine.

Robots.txt tells other search engines which pages they are allowed to visit. This helps prevent overloading your website with too many requests.

What is a robots.txt file used for?

A robots.txt file is used to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned.

Not all robots cooperate with the standard. Email harvesters, spambots, malware, and robots that scan for vulnerabilities do not follow the standard. Web crawlers mostly follow the protocol; well-behaved web crawlers will honor the directives in a robots.txt file.

In addition, conforming crawlers will usually indicate to the Webmaster whether they obeyed or ignored the directives, via their log files or other communication channels. Non-conforming web crawlers may ignore the directives completely.

Why Is Robots.txt Important?

Robots.txt is important because it helps prevent overloading your website with too many requests. It also helps you control which other pages are indexed by the search engines.

There are three primary reasons to employ a robots.txt file.

  • Block Non-Public Pages
  • Maximize Crawl Budget
  • Prevent Indexing of Resources

1. Block Non-Public Pages:

You might want to exclude specific pages from index pages. For example, you could have a staging version of a page. However, you don’t want anybody but your family to visit them. In this instance, robots.txt can be used to prevent crawlers and bots from indexing these pages (leaving certain pages blocked).

2. Maximize Google Crawl Budget:

Your website has a crawl budget. This is the number of times search engines can visit and index your site in a given period.

You want to make sure that your website is easy to crawl so that you don’t waste this valuable resource. By using robots.txt, you can tell search engines which pages are most important.

3. Prevent Indexing of Resources:

Your website has resources like images and PDFs. You don’t want these to show up in Google search results because they are not web pages.

With robots.txt, you can prevent search engines from indexing these resources and keep track of the process using the Google search console.

Where Does Robots.txt Go On a Site?

When visitors come to a website, search engines and other crawling robots (such as Facebook’s crawler, Facebot) are aware to look for a robots.txt file. They’ll only look for it in one place: the main directory (typically your root domain or home page).

If a user agent access goes to www.example.com/robots.txt and does not discover a robot file, it will assume the site does not have one and proceed with crawling the page (and perhaps even the whole website) in any case.

Always include your robots.txt file in the main directory or root domain to assure that it is found (this is why it is also called the “root” file).

How to Create a Robots.txt File?

Creating and editing your robots.txt file is easy. You can do it with any text editor, even Notepad on a Windows.

Just open up a new document, insert the necessary information (discussed below), and then save the file as “robots.txt” in the root directory of your site.

(The main directory is the top-most folder where all your website’s files are stored. It is also sometimes called the “public_html” or “www” directory.)

User-agent: *

Disallow:

The above code tells all robots that they are welcome to visit every page on your site.

If you want to block all robots from your site, you would use the following code:

User-agent: *

Disallow: /

This tells all user agents not to visit any pages on your site.

Of course, you don’t want to block all robots from your site, as that would defeat the purpose of having a website in the first place.

What you really want to do is block only certain user agents or directories that you don’t want to be indexed and initiate a crawl delay.

How to Block Specific User-Agents?

To block specific user agents, simply replace the * character in the user-agent line with the name of the user agent you want to block. For example, if you wanted to block only Google’s crawler, you would use the following code:

User-agent: Googlebot

Disallow: /

How to Block Specific Directories?

To block specific directories, simply replace the / character in the Disallow line with the name of the directory you want to block.

For example, if you wanted to block only the /images/directory, you would use the following code:

User-agent: *

Disallow: /images/

Remember, you can block multiple user agents and directories by including multiple lines in your robots.txt file.

Just make sure to put each user-agent on its own line, followed by the Disallow directive (or directives) for that user-agent. For example:

User-agent: Googlebot

Disallow: /folder1/

Disallow: /folder2/file.html

User-agent: Bingbot

Disallow: /folder3/

Finally, remember that robots.txt is a public file, so don’t include any sensitive information in it that you don’t want to be made public.

For example, don’t put your email address or password in the file, as anyone who views the file will be able to see that information.

Best Practices

Enabling Robots.txt Creating a Robots.txt File

You’ll need to first generate your robots.txt file. It’s a text file, so you may create one with Windows Notepad. The format is always the same, even if you make your robots.txt file in any way:

First, you specify which User-Agent(s) you want to target.

You can use an asterisk as a wildcard to match all user agents.

Next, on the same line (or the next line, it doesn’t matter), you specify what action to take.

This will usually be “Disallow”, which tells the user agent not to index a specific URL, file, or directory.

You may also use “Allow” to override previous “Disallow” directives.

Finally, you end the entry with a blank line.

If you want to make another entry, you simply start the process over again from step 1.

Here’s an example of a robots.txt file:

User-agent: *

Disallow: /cgi-bin/

Disallow: /tmp/

Disallow: /~joe/

In the above example, we’re targeting all user agents with the asterisk wildcard. We then tell the user agent not to index the /cgi-bin/, /tmp/, and /~joe/ directories.

Make Your Robots.txt File Easy to Find

Make sure that your robots.txt file is located in the root directory of your website.

The root directory is also sometimes called the “public_html” or “www” directory.

If you put your robots.txt file anywhere else, search engines will not be able to find it and index your site accordingly.

Follow the Guidelines

When creating your robots.txt file, make sure to follow the guidelines set forth by the major search engines.

You can find these guidelines at the following links:

Be Careful What You Disallow

While it’s tempting to just block all user agents from indexing your entire site, doing so will result in your site not being indexed at all.

Therefore, only block user agents and directories that you don’t want to be indexed. And even then, be very careful about what you block.

For example, if you accidentally block your home page from being indexed, your entire site will disappear from the search results.

Monitor Your Traffic

After you create and upload your robots.txt file, make sure to monitor your website’s traffic to ensure that the file is working as intended.

If you notice a sudden drop in traffic after creating or modifying your robots.txt file, it’s likely that you’ve made a mistake in the file.

In that case, you’ll need to edit the file and re-upload it to your server.

Meta Robots Tag

In addition to using robots.txt, you can also use the Robots Meta Tag on individual pages of your website to control how search engine spiders index those pages.

The Robots Meta Tag is a piece of HTML code that goes in the <head> section of your web page. It looks like this:

<meta name=”robots” content=”index, follow”>

The “content” attribute tells search engine spiders what to do with that particular page.

In the above example, the page will be indexed (“index”) and all links on the page will be followed (“follow”).

Here are the other values that can go in the “content” attribute:

  • noindex – This value tells search engine spiders not to index the page.
  • nofollow – This value tells search engine spiders not to follow any links on the page.
  • none – This value is the same as “noindex, nofollow”.
  • noarchive – This value tells search engine spiders not to save a cached copy of the page.
  • nosnippet – This value tells search engine spiders not to show a description of the page in search results.

Conclusion

The robots.txt file is a very important part of SEO.

Make sure that you create and upload a robots.txt file to your server if you want search engines to index your site.

And be very careful about what you include in the file.

If you make a mistake, it could result in your entire site being removed from search engine results, preventing search engines from helping you gain traffic through an organic search. Hopefully, our guide has helped you learn everything you need to know about the robots.txt file.

If you have any questions, feel free to post a comment.

Mihael D. Cacic
“Digital Marketing Mad Scientist”

Physicist turned SEO Content Marketer. For the past few years, Mihael worked with many big SaaS and service businesses helping them rank higher and get more customers. Now here to share his secrets on how to make hyper-profitable blogs in hyper-efficient ways.

Mihael is a digital marketing mad scientist. He’s a sharp marketer with high energy and lots of ideas. The work he did leveled up our whole team.”

Sujan Patel

Founder, MailShake

Most recent win:

Mihael Cacic Signups

Increased monthly signups from 20 to 200/month in 7 months for one client.

Saying that Mihael is a content marketing guru is an understatement. His attention to detail is on another level. He doesn’t give room to the slightest mistake and makes sure each piece is the best out there.

Martin Angila

Writer, Notch Content

Mihael is brilliant, organized, considerate, and honest. A rare mix in today’s world. He is extremely analytical and can grasp complex topics quickly. If you’re looking to grow your blog, listen to Mihael – he knows what he’s doing.

Lia Parisyan Schmidt

Brand Strategist