Robots.txt

From Helpful

The short version is that you can request specific (or all) robots not to visit URLs and directories.

It lets you opt out of robot's basic find-everything behaviour.

Contents

Potential uses

The main practical uses for robots.txt are probably:

  • preventing unfinished work or relatively private data from appearing in searches - an easier alternative to making sure you have no links to it (though at the same time it's a security issue in that you reveal where your relatively private data sits)
  • prevent spiders from wasting bandwidth on things that are pointless to index, such as:
    • temporary directories
    • caches
    • scripts that reference themselves (with different values) and so may easily generate thousands of distinct generated pages. Note that nofollow is an easy first/additional step for this case.
    • very large files (e.g. where you may want to put up downloads, raw originals, and such for only interested people)
  • selectively disallowing spiders, such as programs (BlackWidow, wget, etc.), wayback, or search engines.


It is common that a spider will take a few days to notice the change in robots.txt and make it current throughout its (often distributed) setup.

When you want blocks to be temporary - for example, when you're putting something online but want to polish before the first search engine index - you should look at other options (say, user auth / ip filtering), because spiders expect robots.txt to not change much, so may not notice new restrictions before they notice new content that was linked to somewhere.


(which is not necessarily a bad thing. The google spider tries to be nice to your site: if it notices something is barely or never updated, it will also re-fetch it less often so waste less of your bandwidth.)


One alternative is apache's mod_rewrite (see htaccess) can do agent checks (and others, also various ones that robots.txt can't specify), and with more control over which URLs it should apply to - but it is more more complex, and in some cases may not be allowed by your hoster.

Logic

The first (applicable_user-agent, applicable_disallow) pair applies, and stops further processing.


About Disallow:

  • You get to specify one path per Disallow; use multiple Disallows if you want to disallow a list of things things
  • Wildcards are not supported, but:
    • Disallow: /
      means disallow all
    • Disallow:
      (no value) means allow all
    • Strings act as 'starts with' strings, so /index would block both /index/ as a directory directory and /index.html'


About User-agent:

  • Wildcards are supported;
    *
    is nicely restrictive, unless you want to block a few annoying bots (that are still nice enough to respect robots.txt)
  • User agent names should be interpreted case-insensitively


Further notes:

  • The default if-no-rules-match policy is to allow, but a catch-all disallow at the end is possible.
  • The disallow all / allow all details combined with the first-match thing combines to mean you can both whitelist and blacklist.
  • Googlebot has some extensions, including wildcards and an Allow, but these apply only to googlebot.
  • Don't list secrets. People with bad intentions may look at your robots just to find interesting things.
  • The robots.txt file is checked only every now and then so robots do not adapt to it immediately. How fast depends on the bot. A week or so before you see change in bot behaviour isn't too strange.

Examples

Some user agents (bots):

  • Media bots: Googlebot-Image, yahoo-mmcrawler, psbot, etc.
  • General bots: googlebot, msnbot, yahoo-slurp, teoma, Scooter, etc.


Some disallows:

User-agent: Googlebot
Disallow: /dynamic.html

#Nothing should index...
User-agent: *
#...volatile things, 
Disallow: /cache/
Disallow: /tmp/

# dynamic apps when they generate almost infinite links to themselves
Disallow: /cgi-bin/linker

#...or development, if accidentally linked to somewhere (lines copied from somewhere)
Disallow: /_borders/
Disallow: /_derived/
Disallow: /_fpclass/
Disallow: /_overlay/
Disallow: /_private/
Disallow: /_themes/
Disallow: /_vti_bin/
Disallow: /_vti_cnf/
Disallow: /_vti_log/
Disallow: /_vti_map/
Disallow: /_vti_pvt/
Disallow: /_vti_txt/


See also