GNU Info

Info Node: (wget.info)Robots

(wget.info)Robots


Next: Security Considerations Prev: Appendices Up: Appendices
Enter node , (file) or (file)node

Robots
======

   It is extremely easy to make Wget wander aimlessly around a web site,
sucking all the available data in progress.  `wget -r SITE', and you're
set.  Great?  Not for the server admin.

   While Wget is retrieving static pages, there's not much of a problem.
But for Wget, there is no real difference between a static page and the
most demanding CGI.  For instance, a site I know has a section handled
by an, uh, "bitchin'" CGI script that converts all the Info files to
HTML.  The script can and does bring the machine to its knees without
providing anything useful to the downloader.

   For such and similar cases various robot exclusion schemes have been
devised as a means for the server administrators and document authors to
protect chosen portions of their sites from the wandering of robots.

   The more popular mechanism is the "Robots Exclusion Standard", or
RES, written by Martijn Koster et al. in 1994.  It specifies the format
of a text file containing directives that instruct the robots which URL
paths to avoid.  To be found by the robots, the specifications must be
placed in `/robots.txt' in the server root, which the robots are
supposed to download and parse.

   Wget supports RES when downloading recursively.  So, when you issue:

     wget -r http://www.server.com/

   First the index of `www.server.com' will be downloaded.  If Wget
finds that it wants to download more documents from that server, it will
request `http://www.server.com/robots.txt' and, if found, use it for
further downloads.  `robots.txt' is loaded only once per each server.

   Until version 1.8, Wget supported the first version of the standard,
written by Martijn Koster in 1994 and available at
<http://www.robotstxt.org/wc/norobots.html>.  As of version 1.8, Wget
has supported the additional directives specified in the internet draft
`<draft-koster-robots-00.txt>' titled "A Method for Web Robots
Control".  The draft, which has as far as I know never made to an RFC,
is available at <http://www.robotstxt.org/wc/norobots-rfc.txt>.

   This manual no longer includes the text of the Robot Exclusion
Standard.

   The second, less known mechanism, enables the author of an individual
document to specify whether they want the links from the file to be
followed by a robot.  This is achieved using the `META' tag, like this:

     <meta name="robots" content="nofollow">

   This is explained in some detail at
<http://www.robotstxt.org/wc/meta-user.html>.  Wget supports this
method of robot exclusion in addition to the usual `/robots.txt'
exclusion.


automatically generated by info2www version 1.2.2.9