Robots.txt file is used by webmaster to instruct robots about their site. This practice is called The Robots Exclusion Protocol.
What are Robots?
Web robots also known as spiders or crawlers are programs that travel across the web pages automatically. Search engines like Google and Bing use them to crawl and index the web content. Some of the common purpose for which robots are used are Indexing, HTML validation, Link validation, “What’s New” monitoring, Mirroring.
How It Works?
Before indexing the web pages search engine robots look for the /robots.txt file. This file contains set of instructions for search engines. Based on the instructions found in Robots.txt file, search engines index, selectively index or no index the pages on the site.
Commonly used Instructions
This instruction tells all robots to visit all directories because the wildcard * specifies all robots:
User-agent: *
Disallow:
Instruction to keep all robots out of the website. Use this if the website is not to be indexed.
User-agent: *
Disallow: /
Instruction to keep any particular robot out of site
User-agent: BadBot
Disallow: /
Instruction to allow a single robot and disallow all others
User-agent: Google
Disallow:
User-agent: *
Disallow: /
Instruction to read all files except one
User-agent: *
Disallow: /~joe/stuff/
Allow Directive- In order to be compatible to all robots, if one wants to allow single files inside an otherwise disallowed directory, it is necessary to place the Allow directive(s) first, followed by the Disallow, for example:
User-agent: *
Allow: /folder1/myfile.html
Disallow: /folder1/
Instructions to block files of a specific file type (for example, .gif):
User-agent: *
Disallow: /*.gif$
Which robots will follow instructions in robot.txt file?
All legitimate search engines and standard programs while malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
Where to put robots.txt file
Put your robots.txt file in the top-level directory of your web server where you put your website index file. For instance, http://www.example.com/robots.txt is a valid location.