Google is a search engine that follows links. For Google to know about your site, it has to find it by following a link from another site. When Google cames around to crawl any other site after we put the link there, it can discover the existence of my other, new site. And Google indexes it.
The Robots Exclusion Standard was developed in 1994 so that website owners can advise search engines how to crawl your website. It works in a similar way as the robots meta tag which I discussed in great length recently. The main difference being that the robots.txt file will stop search engines from seeing a page or directory, whereas the robots meta tag only controls whether it is indexed.
Web Robots (also known as Web Wanderers, Crawlers, or Spiders), are programs that traverse the Web automatically. Search engines such as Google use them to index the web content, spammers use them to scan for email addresses, and they have many other uses.
Placing a robots.txt file in the root of your domain lets you stop search engines indexing sensitive files and directories. For example, you could stop a search engine from crawling your images folder or from indexing a PDF file that is located in a secret folder.
Major searches will follow the rules that you set. Be aware, however, that the rules you define in your robots.txt file cannot be enforced. Crawlers for malicious software and poor search engines might not comply with your rules and index whatever they want. Thankfully, major search engines follow the standard, including Google, Bing, Yandex, Ask, and Baidu.
Starting with this Article, we would like to show you how to create a robots.txt file and show you what files and directories you may want to hide from search engines for a WordPress website.
The Basic Rules of the Robots Exclusion Standard
A robots.txt file can be created in seconds. All you have to do is open up a text editor and save a blank file as robots.txt. Once you have added some rules to the file, save the file and upload it to the root of your domain i.e. www.yourwebsite.com/robots.txt. Please ensure you upload robots.txt to the root of your domain; even if WordPress is installed in a subdirectory.
We recommend file permissions of 644 for the file. Most hosting setups will set up that file with those permissions after you upload the file. You should also check out the WordPress plugin WP Robots Txt; which allows you to modify the robots.txt file directly through the WordPress admin area. It will save you from having to re-upload your robots.txt file via FTP every time you modify it.
Search engines will look for a robots.txt file at the root of your domain whenever they crawl your website. Please note that a separate robots.txt file will need to be configured for each subdomain and for other protocols such as https://www.yourwebsite.com
It does not take long to get a full understanding of the robots exclusion standard, as there are only a few rules to learn. These rules are usually referred to as directives.
The two main directives of the standard are:
- User-agent – Defines the search engine that a rule applies to;
- Disallow – Advises a search engine not to crawl and index a file, page, or directory;
An asterisk (*) can be used as a wildcard with User-agent to refer to all search engines. For example, you could add the following to your website robots.txt file to block search engines from crawling your whole website.
The above directive is useful if you are developing a new website and do not want search engines to index your incomplete website.
Some websites use the disallow directive without a forward slash to state that a website can be crawled. This allows search engines complete access to your website.
The following code states that all search engines can crawl your website. There is no reason to enter this code on its own in a robots.txt file, as search engines will crawl your website even if you do not define add this code to your robots.txt file. However, it can be used at the end of a robots.txt file to refer to all other user agents.
You can see in the example below that I have specified the images folder using /images/ and not www.yourwebsite.com/images/. This is because robots.txt uses relative paths, not absolute URL paths. The forward slash (/) refers to the root of a domain and therefore applies rules to your whole website. Paths are case sensitive, so be sure to use the correct case when defining files, pages, and directories.
In order to define directives for specific search engines, you need to know the name of the search engine spider (aka the user agent). Googlebot-Image, for example, will define rules for the Google Images spider.
Please note that if you are defining specific user agents, it is important to list them at the start of your robots.txt file. You can then use User-agent: * at the end to match any user agents that were not defined explicitly.
It is not always search engines that crawl your website; which is why the term user agent, robot, or bot, is frequently used instead of the term crawler. The number of internet bots that can potentially crawl your website is huge. The website Bots vs Browsers currently lists around 1.4 million user agents in its database and this number continues to grow every day. The list contains browsers, gaming devices, operating systems, bots, and more.
Bots vs Browsers is a useful reference for checking the details of a user agent that you have never heard of before. You can also reference User-Agents.org and User Agent String. Thankfully, you do not need to remember a long list of user agents and search engine crawlers. You just need to know the names of bots and crawlers that you want to apply specific rules to; and use the * wildcard to apply rules to all search engines for everything else.
Below are some common search engine spiders that you may want to use:
- Bingbot – Bing
- Googlebot – Google
- Googlebot-Image – Google Images
- Googlebot-News – Google News
- Teoma – Ask
4. Create and Configure: Basic Rules of the Robots Exclusion Standard