GNU Info

Info Node: (python2.1-lib.info)robotparser

(python2.1-lib.info)robotparser


Prev: netrc Up: Internet Data Handling
Enter node , (file) or (file)node

Parser for robots.txt
=====================

Accepts as input a list of lines or URL that refers to a robots.txt
file, parses the file, then builds a set of rules from that list and
answers questions about fetchability of other URLs.

This manual section was written by Skip Montanaro <skip@mojam.com>.
This module provides a single class, `RobotFileParser', which answers
questions about whether or not a particular user agent can fetch a URL
on the web site that published the `robots.txt' file.  For more details
on the structure of `robots.txt' files, see
<http://info.webcrawler.com/mak/projects/robots/norobots.html>.

`RobotFileParser()'
     This class provides a set of methods to read, parse and answer
     questions about a single `robots.txt' file.

    `set_url(url)'
          Sets the URL referring to a `robots.txt' file.

    `read()'
          Reads the `robots.txt' URL and feeds it to the parser.

    `parse(lines)'
          Parses the lines argument.

    `can_fetch(useragent, url)'
          Returns true if the USERAGENT is allowed to fetch the URL
          according to the rules contained in the parsed `robots.txt'
          file.

    `mtime()'
          Returns the time the `robots.txt' file was last fetched.
          This is useful for long-running web spiders that need to
          check for new `robots.txt' files periodically.

    `modified()'
          Sets the time the `robots.txt' file was last fetched to the
          current time.

The following example demonstrates basic use of the RobotFileParser
class.

     >>> import robotparser
     >>> rp = robotparser.RobotFileParser()
     >>> rp.set_url("http://www.musi-cal.com/robots.txt")
     >>> rp.read()
     >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
     0
     >>> rp.can_fetch("*", "http://www.musi-cal.com/")
     1


automatically generated by info2www version 1.2.2.9