Where is the robots.txt file and what does it do?

RCreating a robots.txt file is a simple, yet effective solution which can be used to ensure Website optimisation. You can create this file on your own if your website is hosted on your own server. You can also get help creating a robots.txt file if you have shared hosting. If you need a new web hosting plan and help creating a robots.txt file, don’t hesitate to consult your web hosting options.

What is the robots.txt file?

The robots.txt file is a simple text file used to give instructions to web robots, sometimes referred to as spiders or web crawlers, as to the pages of your website they can browse. Googlebot is one such crawler and, as its name suggests, is the web robot used by Google. It is programmed to verify the robots.txt file before browsing any other files or file locations in order to determine which pages of your website can be browsed and consequently indexed. The robots.txt file is not mandatory. However, you can consider it if your site has pages/content that you do not wish for crawlers to visit.

Where is the robots.txt file located?

The robots.txt file is located on your server in the top-level directory. That is, at the root-level in the same location that the index.html file can be found. Web robots are programmed to look for the robots.txt file in the top-level directory. Therefore, ensure it is always in this location. As stated previously, the robots.txt file is not a mandatory element of website design. This means that you can omit it if you wish to give web crawlers complete access to your website. You can easily find out if a website contains a robots.txt file.

To find and view the file, simply add /robots.txt to the end of a website name.

If you use a shared host for your website, your new web hosting environment will not be able to accommodate your website’s unique robots.txt file. You can however ask the administrator of your new web hosting environment to help you with the configuration for this file.

What does the robots.txt file look like?

Thankfully, the robots.txt file is easy to create and requires little to no technical expertise. This is because it uses very simple syntax and is therefore easy to replicate. You can create this file using a plain text editor such as Notepad or copy and paste an existing model if it meets your needs. Note that the filename should be in lowercase letters as this is the file which web robots are programmed to examine. It consists of key words or characters which are interpreted by the crawler such as:

  • User-agent
  • Disallow
  • Allow (specific to Googlebots)

Each block of text represents a different rule. The file can contain as little as two lines or can contain many lines to account for different types of web robots and as many directives. Hosting.uk/robots.txt is a good example of a robots.txt file with a detailed list of directives.

How does the robots.txt file work?

A web robot such as Googlebot receives instructions for a search and proceeds to scour the web’s content. Its aim is to present a list of websites or pages that may meet the required search request. When checking to see which websites to index, the web robot will first look at the robots.txt file to determine whether or not it has permission to examine and list the pages of that site, and if so, which pages can be listed.

User-agent

The line containing the key word “user-agent” is used to indicate whether or not the instruction which follows is directed at a specific web robot such as Googlebot, or if it applies to all web robots.

Example 1
User-agent:*

This means that the block of instructions which follows applies to all web robots. In this case, the asterisk is used to indicate “all web robots”.

Example 2
User-agent: Googlebot

This means that the block of instructions which follows applies specifically to Googlebot. Here, the asterisk is replaced by “Googlebot” to indicate that the directive applies to Googlebot.

Disallow

This instruction, which must be preceded by the line indicating the name of the user-agent, indicates what pages, directories, or even file types the web robot is unauthorised to visit. You can create the instruction to allow all content, to disallow all content, or to create conditional allowances. You might wonder, under what circumstances would anyone consider disallowing all content? A website owner might want to do this while a site is under construction, for example, and may later change the robots.txt file once the site is complete.

The following examples show how the Disallow line can be used in different cases:

Example 3.1
Disallow: /donotcrawl

This line indicates that the folder “donotcrawl” must not be visited by the web robot or robots in question, that is, only disallow the folder “donotcrawl”, and by extension any of its content.

For a more concrete example of how this line is used, consider the following example:

Example 3.2
User-agent: Googlebot
Disallow: /donotcrawl

This block of instructions applies specifically to the web robot Googlebot. It indicates that Googlebot is not allowed to crawl the folder “donotcrawl” and by extension any of its content.

Example 4
Disallow:

This line means that web robots or robots in question can visit all content, that is, disallow nothing. This has the same result as using no robots.txt file or leaving it empty.

Example 5
Disallow: /

This line indicates that no content can be visited by the web robot or robots in question, that is, disallow everything.

Example 6
Disallow:*.png

This line indicates that all files with the extension “.png” should not be visited by the web robot or robots in question, that is disallow all PNG files.

Allow

The allow instruction can specifically be interpreted by Googlebot to indicate exceptions to the disallow instruction. Consider the following example:

Example 7
User-agent: Googlebot
Disallow: /donotcrawl
Disallow: *.jpg

This block of instructions indicates that the content of the folder “donotcrawl” is not to be visited, with the exception of content with the extension .jpg, that is, disallow everything in the folder “donotcrawl” except JPG files.

How does the robots.txt file help with Website optimisation?

Website optimisation is not just about improving the content on your website but also its visibility to users. If you went through the trouble of creating a website, it would stand to reason that you want people to see it. By creating a robots.txt file, you can improve the visibility of your website since the file will serve to help web robots determine what parts of your site should be visited and ultimately, can be indexed.

Having a robots.txt file therefore helps you with website optimisation because it helps you control website visibility. Of course, the robots.txt file should be coupled with other techniques for Website optimisation such as SEO. Where necessary, a robots.txt file should therefore be carefully constructed and reviewed so as to ensure that pages that should be crawled and indexed are not disallowed.