Copyright (C) 2000-2012 |
GNU Info (python2.1-lib.info)robotparserParser for robots.txt ===================== Accepts as input a list of lines or URL that refers to a robots.txt file, parses the file, then builds a set of rules from that list and answers questions about fetchability of other URLs. This manual section was written by Skip Montanaro <skip@mojam.com>. This module provides a single class, `RobotFileParser', which answers questions about whether or not a particular user agent can fetch a URL on the web site that published the `robots.txt' file. For more details on the structure of `robots.txt' files, see <http://info.webcrawler.com/mak/projects/robots/norobots.html>. `RobotFileParser()' This class provides a set of methods to read, parse and answer questions about a single `robots.txt' file. `set_url(url)' Sets the URL referring to a `robots.txt' file. `read()' Reads the `robots.txt' URL and feeds it to the parser. `parse(lines)' Parses the lines argument. `can_fetch(useragent, url)' Returns true if the USERAGENT is allowed to fetch the URL according to the rules contained in the parsed `robots.txt' file. `mtime()' Returns the time the `robots.txt' file was last fetched. This is useful for long-running web spiders that need to check for new `robots.txt' files periodically. `modified()' Sets the time the `robots.txt' file was last fetched to the current time. The following example demonstrates basic use of the RobotFileParser class. >>> import robotparser >>> rp = robotparser.RobotFileParser() >>> rp.set_url("http://www.musi-cal.com/robots.txt") >>> rp.read() >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco") 0 >>> rp.can_fetch("*", "http://www.musi-cal.com/") 1 |