Robots.txt
From Helpful
The short version is that you can request specific (or all) robots not to visit URLs and directories.
It lets you opt out of robot's basic find-everything behaviour.
Contents |
Potential uses
The main practical uses for robots.txt are probably:
- preventing unfinished work or relatively private data from appearing in searches - an easier alternative to making sure you have no links to it (though at the same time it's a security issue in that you reveal where your relatively private data sits)
- prevent spiders from wasting bandwidth on things that are pointless to index, such as:
- temporary directories
- caches
- scripts that reference themselves (with different values) and so may easily generate thousands of distinct generated pages. Note that nofollow is an easy first/additional step for this case.
- very large files (e.g. where you may want to put up downloads, raw originals, and such for only interested people)
- selectively disallowing spiders, such as programs (BlackWidow, wget, etc.), wayback, or search engines.
It is common that a spider will take a few days to notice the change in robots.txt and make it current throughout its (often distributed) setup.
When you want blocks to be temporary - for example, when you're putting something online but want to polish before the first search engine index - you should look at other options (say, user auth / ip filtering), because spiders expect robots.txt to not change much, so may not notice new restrictions before they notice new content that was linked to somewhere.
(which is not necessarily a bad thing. The google spider tries to be nice to your site: if it notices something is barely or never updated, it will also re-fetch it less often so waste less of your bandwidth.)
One alternative is apache's mod_rewrite (see htaccess) can do agent checks (and others, also various ones that robots.txt can't specify), and with more control over which URLs it should apply to - but it is more more complex, and in some cases may not be allowed by your hoster.
Logic
The first (applicable_user-agent, applicable_disallow) pair applies, and stops further processing.
About Disallow:
- You get to specify one path per Disallow; use multiple Disallows if you want to disallow a list of things things
- Wildcards are not supported, but:
- Disallow: /means disallow all
- Disallow:(no value) means allow all
- Strings act as 'starts with' strings, so /index would block both /index/ as a directory directory and /index.html'
-
About User-agent:
- Wildcards are supported; *is nicely restrictive, unless you want to block a few annoying bots (that are still nice enough to respect robots.txt)
- User agent names should be interpreted case-insensitively
Further notes:
- The default if-no-rules-match policy is to allow, but a catch-all disallow at the end is possible.
- The disallow all / allow all details combined with the first-match thing combines to mean you can both whitelist and blacklist.
- Googlebot has some extensions, including wildcards and an Allow, but these apply only to googlebot.
- Don't list secrets. People with bad intentions may look at your robots just to find interesting things.
- The robots.txt file is checked only every now and then so robots do not adapt to it immediately. How fast depends on the bot. A week or so before you see change in bot behaviour isn't too strange.
Examples
Some user agents (bots):
- Media bots: Googlebot-Image, yahoo-mmcrawler, psbot, etc.
- General bots: googlebot, msnbot, yahoo-slurp, teoma, Scooter, etc.
Some disallows:
User-agent: Googlebot Disallow: /dynamic.html #Nothing should index... User-agent: * #...volatile things, Disallow: /cache/ Disallow: /tmp/ # dynamic apps when they generate almost infinite links to themselves Disallow: /cgi-bin/linker #...or development, if accidentally linked to somewhere (lines copied from somewhere) Disallow: /_borders/ Disallow: /_derived/ Disallow: /_fpclass/ Disallow: /_overlay/ Disallow: /_private/ Disallow: /_themes/ Disallow: /_vti_bin/ Disallow: /_vti_cnf/ Disallow: /_vti_log/ Disallow: /_vti_map/ Disallow: /_vti_pvt/ Disallow: /_vti_txt/

