You can use a robots text file to block a search engine spider from crawling your Web site or a part of your site. For instance, you may have a development version of your Web site where you work on changes and additions to test them before they become part of your live Web site. You don't want search engines to index this “in-progress” copy of your Web site because that would cause a duplicate-content conflict with your actual Web site. You also wouldn’t want users to find your in-progress pages. So you need to block the search engines from seeing those pages.
The robots text file’s job is to give the search engines instructions on what not to spider within your Web site. This is a simple text file that you can create using a program like Notepad, and then save with the filename robots.txt. Place the file at the root of your Web site (such as www.yourdomain.com/robots.txt), which is where the spiders expect to find it. In fact, whenever the search engine spiders come to your site, the first thing they look for is your robots text file. This is why you should always have a robots text file on your site, even if it’s blank. You don’t want the spiders’ first impression of your site to be a 404 error (the error that comes up when a file cannot be located).
With a robots text file, you can selectively exclude particular pages, directories, or the entire site. You have to write the HTML code just so, or the spiders ignore it. The command syntax you need to use comes from the Robots Exclusion Protocol (REP), which is a standard protocol for all Web sites. And it’s very exact; only specific commands are allowed, and they must be written correctly with specific placement, uppercase/lowercase letters, punctuation, and spacing. This file is one place where you don’t want your Webmaster getting creative.
A very simple robots text file could look like this:
User-agent: * Disallow: /personal/
This robots text file tells all search engine robots that they’re welcome to crawl anywhere on your Web site except for the directory named /personal/.
Before writing a command line (such as Disallow: /personal/), you first have to identify which robot(s) you’re addressing. In this case, the line User-agent: * addresses all robots because it uses an asterisk, which is known as the wild card character because it represents any character. If you want to give different instructions to different search engines, as many sites do, write separate User-agent lines followed by their specific command lines. In each User-agent: line, you would replace the asterisk (*) character with the name of a specific robot:
User-agent: Googlebot would get Google’s attention.
User-agent: Slurp would address Yahoo!.
User-agent: MSNBot would address Microsoft Live Search.
Note that if your robots text file has User-agent: * instructions as well as another User-agent: line specifying a specific robot, the specific robot follows the commands you gave it individually instead of the more general instructions.
You can type just a few different commands into a robots.txt file:
Excluding the whole site. To exclude the robot from the entire server, you use the command:
Disallow: /
This command actually removes all of your site’s Web pages from the search index, so be careful not to do this unless that is what you really want.
Excluding a directory. (A word of caution — usually, you want to be much more selective than excluding a whole directory.) To exclude a directory (including all of its contents and subdirectories), put it inside slashes:
Disallow: /personal/
Excluding a page. You can write a command to exclude just a particular page. You only use a slash at the beginning and must include the file extension at the end. Here’s an example:
Disallow: /private-file.htm
Directing the spiders to your site map. In addition to Disallow:, another useful command for your SEO efforts specifies where the robot can find your site map — the page containing links throughout your site organization, like a table of contents:
Sitemap: http://www.yourdomain.com/sitemap.xml
It should be pointed out that in addition to the previously listed commands, Google recognizes Allow as well. This is applicable to Google only and may confuse other engines, so you should avoid using it.
You should always include at the end of your robots text file a Sitemap: command line. This ensures that the robots find your site map, which helps them navigate more fully through your site so that more of your site gets indexed.
A few notes about the robots text file syntax:
The commands are case-sensitive, so you need a capital D in Disallow.
There should always be a space following the colon after the command.
To exclude an entire directory, put a forward slash after as well as before the directory name.
If you are running on a UNIX machine, everything is case-sensitive.
All files not specifically excluded are available for spidering and indexing.
To see a complete list of the commands, robot names, and instructions about writing robots text files, go to the Web Robot Pages.
As a further safeguard, make it part of your weekly site maintenance to check your robots text file. It’s such a powerful on/off switch for your site’s SEO efforts that it merits a regular peek to make sure it’s still “on” and functioning properly.