Parser for robots.txt
=====================
Accepts as input a list of lines or URL that refers to a robots.txt
file, parses the file, then builds a set of rules from that list and
answers questions about fetchability of other URLs.
This manual section was written by Skip Montanaro <skip@mojam.com>.
This module provides a single class, `RobotFileParser', which answers
questions about whether or not a particular user agent can fetch a URL
on the web site that published the `robots.txt' file. For more details
on the structure of `robots.txt' files, see
<http://info.webcrawler.com/mak/projects/robots/norobots.html>.
`RobotFileParser()'
This class provides a set of methods to read, parse and answer
questions about a single `robots.txt' file.
`set_url(url)'
Sets the URL referring to a `robots.txt' file.
`read()'
Reads the `robots.txt' URL and feeds it to the parser.
`parse(lines)'
Parses the lines argument.
`can_fetch(useragent, url)'
Returns true if the USERAGENT is allowed to fetch the URL
according to the rules contained in the parsed `robots.txt'
file.
`mtime()'
Returns the time the `robots.txt' file was last fetched.
This is useful for long-running web spiders that need to
check for new `robots.txt' files periodically.
`modified()'
Sets the time the `robots.txt' file was last fetched to the
current time.
The following example demonstrates basic use of the RobotFileParser
class.
>>> import robotparser
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url("http://www.musi-cal.com/robots.txt")
>>> rp.read()
>>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
0
>>> rp.can_fetch("*", "http://www.musi-cal.com/")
1
automatically generated byinfo2wwwversion 1.2.2.9