Robots.txt and Website Crawlers
(Last Updated On: July 27, 2018)


A website’s robots.txt file is more important than a lot of people realise and, in this article, we’ll be explaining everything that you need to know about it.

Most website owners go through a phase of fine-tuning every possible aspect of their site. Reducing the file size of images, minimising page load speeds, and the key focus of today’s article, creating a robots.txt file. Unfortunately, a lot of people don’t fully understand what the robots.txt file is used for and this leads to them poorly creating their own.

If you’re one of these people, don’t worry! The robots.txt file can seem extremely complicated if you aren’t familiar with it, but it’s time to for this to change.

Robots.txt File – What Exactly is it?

Search engines use bots called “spiders” that crawl the internet and identify which pages on a website can be indexed. Indexed pages will contribute to your Google rankings but any pages that you configured as non-indexed won’t contribute (there are actually some benefits to this).

A robots.txt file simply guides search engines’ bots so that they know which pages should be indexed and which ones shouldn’t.

Benefits of Web Crawlers

Crawlers are incredibly beneficial to both search engines and website owners. Below are a few of the key benefits that come with using crawlers.

  • Website owners can choose to have only the most important pages of a website indexed by the crawler.
  • Website owners are able to publish as much content as they want. However, they can choose the specific web pages which will impact their search rankings the most, without having the entire site indexed.
  • Search engines can ensure that all of the search results shown are accurate, valuable, and are overall high-quality results.

We could go on for days about web crawlers but the main thing to keep in mind is that they can be incredibly helpful when it comes to online marketing.

SEO and Robots.txt

For those of you who are creating (or adjusting) a robots.txt for their website, you’re likely doing it because you’ve read elsewhere that it will support your SEO rankings. It’s true, but you should also take the time to know why this is.

Elaborating on what we’ve explained so far, if a crawler doesn’t find a robots.txt file then they’ll proceed to crawl your entire website. Every page, video, image – everything. The longer it takes for the bots to crawl your website, the more it’ll affect your search rankings.

Using robots.txt gives online marketers like yourself an opportunity to limit how much of your website these bots crawl, thus preventing your rankings from falling.

It’s also worth acknowledging that small adjustments to your website like this will make all the difference, so even just creating a basic robots file will do some good.

Creating a Robots.txt File

First time creating a robots.txt file? Don’t stress, it’s not going to be as difficult as you’d think. Plenty of templates are available on the internet. After a quick look, you’ll see that it’s not something to get too frustrated over. So, how can you create your own? Here’s a very simple, and generic, robots file.

User-agent: *
Allow: /

To help you better understand this:

 

  • A “User-agent” is the umbrella term used to describe a web crawler. With this entry set as *, the following arguments will concern all crawlers.
  • “Allow” is used to explain to web crawlers which pages can be crawled and if slash (/) is assigned here then every webpage will be accessible.
  • It’s worth noting that “Disallow” can be substituted for “Allow” if needed, but this is dependent on how you plan on using robots.txt.

This is an incredibly simple robots.txt file but if you’re a beginner, it’s a good place to start. You don’t have to go overboard when writing this file. So, if you want to stick with adding just a few arguments, there’s nothing wrong with that.

Let’s Take Things a Step Further…

However, if you want to take things up a notch, here’s a slightly more advanced robots.txt file to consider.

#GoogleRule
User-agent: Googlebot
Disallow: /video-tutorials/
Disallow: /*.php

#FacebookRule
User-agent: Facebot
Disallow: /images/

Sitemap: https://www.yourwebsite.com/sitemap.xml

Whoa! This example has a few more entries and likely some things that you’re unfamiliar with, but it’s not as complicated as it looks. Here are a few keynotes.

  • Anything with a hashtag (#) at the beginning will be ignored by the crawlers. This feature is purely in place for you to keep a brief note of what’s going on throughout the robots.txt file.
  • Instead of addressing all crawlers, these two sets of rules are specifically for Google’s standard search bot and Facebook’s crawler.
  • The “GoogleRule” prevents Googlebot from accessing the folder “/video-tutorials/” on your website as well as any files that have the file extension “PHP”.
  • The “FacebookRule” prevents Facebot from crawling your website’s “/images/” folder.
  • Although not necessary, the “Sitemap” entry can be used to point crawlers towards your website’s sitemap location.
  • Sitemaps are used to help “map” the different pages/content of a website. This can help crawlers to analyse your website and crawl each page. Having a sitemap isn’t necessary but is recommended by most search engines.

Hopefully, these notes will help you to gain a better understanding of what’s going on in the robots.txt above. It is an understandably scary file at first sight (especially for anyone who is inexperienced with it), but once you know the file’s purpose and you’ve got basic knowledge of how it works, you can slowly start to create your own.

“Where do I put my robots.txt file?”

Generally speaking, you should save the robots.txt file in the root folder of your website. This folder’s path is “/” and it contains absolutely every file associated with your website. After you upload the robots file you’ll be able to view it at:

https://www.yourwebsite.com/robots.txt

Don’t worry; even though anyone can access the file using this web address, there’s no information in it that makes your website vulnerable by any means.

To Conclude…

Reiterating what we previously said, you can stick with a basic robots.txt file like the one we presented before. Going that little bit further will help your online marketing efforts, specifically with SEO, but it’s not always necessary.

So, there you have it. You’ll now be able to write a working robots.txt file and store it in the correct website directory in order to reap the benefits that come with it!