Mark Hobley
ROBOT EXCLUSION PROTOCOL

The Robots Exclusion Protocol is a method of instucting robots not to visit parts of a website.

When a Robot visits a Web site, it looks for a robots.txt file in the root directory.

If the file exists, this is analysed for entries such as:

User-agent: *
Disallow: /

These entries tell the robot not to visit parts of the site.

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/

The example above tells all robots not to visit the /cgi-bin/ and /tmp/ directories.

Note: A separate "Disallow" line is required for every URL prefix that is to be excluded. (For example, you cannot say "Disallow: /cgi-bin/ /tmp/").

You may not have blank lines in a record, as they are used to delimit multiple records.

Regular expression are not supported in either the user-agent or disallow lines. (For example, you cannot have lines like "Disallow: /tmp/*" or "Disallow: *.gif".)

The '*' in the user-agent field is a special value meaning "any robot".

Everything not explicitly disallowed may be retrieved by the robot.

Here follow some examples:

To exclude all robots from the entire server:

User-agent: *
Disallow: /

To allow all robots complete access to the entire server:

User-agent: *
Disallow:

To exclude all robots from part of the server:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/

To exclude a single robot:

User-agent: BadBot
Disallow: /

To allow a single robot:

User-agent: WebCrawler
Disallow:

User-agent: *
Disallow: /

To exclude a single file:

User-agent: *
Disallow: /dir/private.html



The Robot exclusion protocol is not case sensitive, but filenames may be. It is good practice to capitalize the first letter of User-agent and Disallow. Everything else should be lower case.

See also: Robot Exclusion Metatag