A Guide to Robot.txt Files

A Guide to Robot.txt Files

Estimated reading time: 6 minutes

What are Robot.txt Files

When it comes to new websites or pages being uploaded onto the Internet, search engines have two main jobs:

Crawling (also known as “spidering”) the website to discover content.
Indexing (saving) those web pages so that they can be served up in search results.

Robots.txt is a text file that instructs web robots (most often search engines) how to crawl pages on your website for indexing. The commands within the robot.txt file indicate access to your site by section and by specific kinds of web crawlers (such as desktop crawlers vs. mobile crawlers).

Simply put, the robot.txt file dictates which web-crawling software has access to which pages or files on your website. These crawl instructions are specified by “disallowing” or “allowing” the behaviour of web robots.

How Does Robot.txt Work?

When arriving at a website the first thing the search crawler will look for is the robots.txt file, which will instruct it how to crawl the site.
Depending on the directives contained within the robot.txt file, the web spider will crawl and index certain pages and ignore others.

Here are a few examples of what the instructions within a robot.txt file might look like:

Blocking all web crawlers from all content on your website:
User-agent: *
Disallow: /

Allowing all web crawlers access to all content on your website:
User-agent: *

Blocking a specific web crawler from a specific web folder:
User-agent: Bingbot
Disallow: /example-subfolder/

Blocking a specific web crawler from a specific web page:
User-agent: Googlebot
Disallow: /example-subfolder/blocked-page.html

Why Robot.txt is Important for SEO

You might be wondering why anyone would want to prevent web robots from crawling their site.

Search engine bots like Googlebot has a “crawl budget”. Crawl Budget is the number of pages on a website a search engine bot will crawl and index within a given timeframe. This is calculated by taking crawl rate and crawl demand into consideration.

Since search engine bots can only give websites a certain amount of attention in crawling and indexing it, you have to be clever about how that budget is spent. That’s what the robot.txt file is for.

For example, if you have a big website with many pages, the Googlebot my not have enough crawl budget to crawl all of your pages. In this case, you’d want to restrict access to unimportant pages such as thank-you pages and prioritise the important ones. This will allow search engines to maximise their crawl budget.

Robots.txt files are useful if you want search engines not to index:

• Login or thank-you pages
• Duplicate or broken pages
• Internal search result pages
• Paid search landing pages
• Certain areas of your website or an entire domain
• Certain pictures that you want to keep out of the Google image search
• Certain files on your website such PDFs
• Websites under construction, not ready for launch

It’s important to note that robots.txt isn’t meant to hide secure pages such as admin or private pages. Pages with sensitive information shouldn’t be included in the robots.txt file as it, in fact, highlights their location to other web crawlers. The best way to securely prevent robots from accessing any private content on your website is to password protect the area where they are stored.

How to Create a Robot.txt File for SEO

Creating a Robot.txt file – Even if you want all robots to have access to every page on your website, it’s still good practice to add a robots.txt file. You can check if your website has a robots.txt file by adding /robots.txt immediately after your domain name in the address bar at the top, e.g. examplewebsite.com/robots.txt. If your website doesn’t have a robots.txt file, it’s a good idea to create one as soon as possible, following these steps:

  1. Use a text editor to create a new file and name it “robots.txt”

    Choose a text editor that is able to create standard UTF-8 text files. Text editors such as Notepad on Windows PCs or TextEdit on Macs work well. Save the file as a text-delimited file, ensuring that the extension of the file is named “.txt”.

    Tip: Don’t use a word processor. They often save files in a proprietary format, which can add unexpected characters that can cause problems for crawlers.

  2. Use the correct robot.txt file syntax

    A robots.txt file consists of one or more groups and each group consists of multiple rules or directives (instructions) with one directive per line. A group consists of the following information:

    • Who the group applies to (the user agent). Google user agent names are listed in the Google list of user agents.
    • Which directories/files the agent can access, and/or
    • Which directories/files the agent cannot access

    Tip: Rules are case-sensitive. For example:
    Disallow: /example-subfolder applies to http://www.example.com/example-subfolder, but not to http://www.example.com/EXAMPLE-SUBFOLDER.
    Free robot.txt generating tools like SEOptimer is an easy way of creating a robot.txt file without any errors.

  3. Add your XML sitemap to your robot.txt file

    Robots.txt files should also include the location of another very important file: the XML sitemap. Not only will it ensure that search engine bots can discover all of your pages. It will also help them understand the importance of your pages in relation to one another and quickly find new content on your website. Adding your XML sitemap to your robot.txt file will also save you the trouble of having to submit your sitemap to every search engine individually.

    An instruction for all web crawlers to access your sitemap would look like this:
    Sitemap: http://www.example.com/sitemap.xml
    User-agent: *

  4. Upload your robot.txt file to the root directory of your website

    This is usually a root-level folder called “htdocs” or “www”. For example, to control crawling on all URLs below http://www.example.com/, the robots.txt file must be located at http://www.example.com/robots.txt. It cannot be placed in a subdirectory (for example, at http://example.com/pages/robots.txt).

    Tip: If you use any sub-domains, create a robots.txt file for each sub-domain.

  5. Validate your robot.txt file

    Check the robots.txt file by entering yourdomain.com/robots.txt into the browser address bar. If you have a file there, it is your robots.txt file. You can also use Google Search Console to test your robots.txt file.

    Important to note: An incorrect robots.txt file can block bots and crawlers from discovering all the pages from your site. So make sure the directives within your robot.txt file are correct.

A note to our visitors

This website has updated its privacy policy in compliance with changes to European Union data protection law, for all members globally. We’ve also updated our Privacy Policy to give you more information about your rights and responsibilities with respect to your privacy and personal information. Please read this to review the updates about which cookies we use and what information we collect on our site. By continuing to use this site, you are agreeing to our updated privacy policy.