A parser for XML documents
==========================
A parser for XML documents. This module was written by Sjoerd
Mullender <Sjoerd.Mullender@cwi.nl>.
This manual section was written by Sjoerd Mullender
<Sjoerd.Mullender@cwi.nl>.
_This is deprecated in Python 2.0. Use `xml.sax' instead. The newer
XML package includes full support for XML 1.0._
_Changed in Python version 1.5.2_
This module defines a class `XMLParser' which serves as the basis for
parsing text files formatted in XML (Extensible Markup Language).
`XMLParser()'
The `XMLParser' class must be instantiated without arguments.(1)
This class provides the following interface methods and instance
variables:
`attributes'
A mapping of element names to mappings. The latter mapping maps
attribute names that are valid for the element to the default
value of the attribute, or if there is no default to `None'. The
default value is the empty dictionary. This variable is meant to
be overridden, not extended since the default is shared by all
instances of `XMLParser'.
`elements'
A mapping of element names to tuples. The tuples contain a
function for handling the start and end tag respectively of the
element, or `None' if the method `unknown_starttag()' or
`unknown_endtag()' is to be called. The default value is the
empty dictionary. This variable is meant to be overridden, not
extended since the default is shared by all instances of
`XMLParser'.
`entitydefs'
A mapping of entitynames to their values. The default value
contains definitions for `'lt'', `'gt'', `'amp'', `'quot'', and
`'apos''.
`reset()'
Reset the instance. Loses all unprocessed data. This is called
implicitly at the instantiation time.
`setnomoretags()'
Stop processing tags. Treat all following input as literal input
(CDATA).
`setliteral()'
Enter literal mode (CDATA mode). This mode is automatically exited
when the close tag matching the last unclosed open tag is
encountered.
`feed(data)'
Feed some text to the parser. It is processed insofar as it
consists of complete tags; incomplete data is buffered until more
data is fed or `close()' is called.
`close()'
Force processing of all buffered data as if it were followed by an
end-of-file mark. This method may be redefined by a derived class
to define additional processing at the end of the input, but the
redefined version should always call `close()'.
`translate_references(data)'
Translate all entity and character references in DATA and return
the translated string.
`getnamespace()'
Return a mapping of namespace abbreviations to namespace URIs that
are currently in effect.
`handle_xml(encoding, standalone)'
This method is called when the `<?xml ...?>' tag is processed.
The arguments are the values of the encoding and standalone
attributes in the tag. Both encoding and standalone are optional.
The values passed to `handle_xml()' default to `None' and the
string `'no'' respectively.
`handle_doctype(tag, pubid, syslit, data)'
This method is called when the `<!DOCTYPE...>' declaration is
processed. The arguments are the tag name of the root element,
the Formal Public Identifier (or `None' if not specified), the
system identifier, and the uninterpreted contents of the internal
DTD subset as a string (or `None' if not present).
`handle_starttag(tag, method, attributes)'
This method is called to handle start tags for which a start tag
handler is defined in the instance variable `elements'. The TAG
argument is the name of the tag, and the METHOD argument is the
function (method) which should be used to support semantic
interpretation of the start tag. The ATTRIBUTES argument is a
dictionary of attributes, the key being the NAME and the value
being the VALUE of the attribute found inside the tag's `<>'
brackets. Character and entity references in the VALUE have been
interpreted. For instance, for the start tag `<A
HREF="http://www.cwi.nl/">', this method would be called as
`handle_starttag('A', self.elements['A'][0], {'HREF':
'http://www.cwi.nl/'})'. The base implementation simply calls
METHOD with ATTRIBUTES as the only argument.
`handle_endtag(tag, method)'
This method is called to handle endtags for which an end tag
handler is defined in the instance variable `elements'. The TAG
argument is the name of the tag, and the METHOD argument is the
function (method) which should be used to support semantic
interpretation of the end tag. For instance, for the endtag
`</A>', this method would be called as `handle_endtag('A',
self.elements['A'][1])'. The base implementation simply calls
METHOD.
`handle_data(data)'
This method is called to process arbitrary data. It is intended
to be overridden by a derived class; the base class implementation
does nothing.
`handle_charref(ref)'
This method is called to process a character reference of the form
`&#REF;'. REF can either be a decimal number, or a hexadecimal
number when preceded by an `x'. In the base implementation, REF
must be a number in the range 0-255. It translates the character
to ASCII and calls the method `handle_data()' with the character
as argument. If REF is invalid or out of range, the method
`unknown_charref(REF)' is called to handle the error. A subclass
must override this method to provide support for character
references outside of the ASCII range.
`handle_comment(comment)'
This method is called when a comment is encountered. The COMMENT
argument is a string containing the text between the `<!--' and
`-->' delimiters, but not the delimiters themselves. For example,
the comment `<!--text-->' will cause this method to be called with
the argument `'text''. The default method does nothing.
`handle_cdata(data)'
This method is called when a CDATA element is encountered. The
DATA argument is a string containing the text between the
`<![CDATA[' and `]]>' delimiters, but not the delimiters
themselves. For example, the entity `<![CDATA[text]]>' will cause
this method to be called with the argument `'text''. The default
method does nothing, and is intended to be overridden.
`handle_proc(name, data)'
This method is called when a processing instruction (PI) is
encountered. The NAME is the PI target, and the DATA argument is
a string containing the text between the PI target and the closing
delimiter, but not the delimiter itself. For example, the
instruction `<?XML text?>' will cause this method to be called
with the arguments `'XML'' and `'text''. The default method does
nothing. Note that if a document starts with `<?xml ..?>',
`handle_xml()' is called to handle it.
`handle_special(data)'
This method is called when a declaration is encountered. The DATA
argument is a string containing the text between the `<!' and `>'
delimiters, but not the delimiters themselves. For example, the
entity declaration `<!ENTITY text>' will cause this method to be
called with the argument `'ENTITY text''. The default method does
nothing. Note that `<!DOCTYPE ...>' is handled separately if it
is located at the start of the document.
`syntax_error(message)'
This method is called when a syntax error is encountered. The
MESSAGE is a description of what was wrong. The default method
raises a `RuntimeError' exception. If this method is overridden,
it is permissible for it to return. This method is only called
when the error can be recovered from. Unrecoverable errors raise
a `RuntimeError' without first calling `syntax_error()'.
`unknown_starttag(tag, attributes)'
This method is called to process an unknown start tag. It is
intended to be overridden by a derived class; the base class
implementation does nothing.
`unknown_endtag(tag)'
This method is called to process an unknown end tag. It is
intended to be overridden by a derived class; the base class
implementation does nothing.
`unknown_charref(ref)'
This method is called to process unresolvable numeric character
references. It is intended to be overridden by a derived class;
the base class implementation does nothing.
`unknown_entityref(ref)'
This method is called to process an unknown entity reference. It
is intended to be overridden by a derived class; the base class
implementation calls `syntax_error()' to signal an error.
See also:
`Extensible Markup Language (XML) 1.0'{The XML specification,
published by the World Wide Web Consortium (W3C), defines the
syntax and processor requirements for XML. References to
additional material on XML, including translations of the
specification, are available at <http://www.w3.org/XML/>.}
`Python and XML Processing'{The Python XML Topic Guide provides a
great deal of information on using XML from Python and links to
other sources of information on XML.}
`SIG for XML Processing in Python'{The Python XML Special Interest
Group is developing substantial support for processing XML from
Python.}
---------- Footnotes ----------
(1) Actually, a number of keyword arguments are recognized which
influence the parser to accept certain non-standard constructs. The
following keyword arguments are currently recognized. The defaults for
all of these is `0' (false) except for the last one for which the
default is `1' (true). ACCEPT_UNQUOTED_ATTRIBUTES (accept certain
attribute values without requiring quotes), ACCEPT_MISSING_ENDTAG_NAME
(accept end tags that look like `</>'), MAP_CASE (map upper case to
lower case in tags and attributes), ACCEPT_UTF8 (allow UTF-8 characters
in input; this is required according to the XML standard, but Python
does not as yet deal properly with these characters, so this is not the
default), TRANSLATE_ATTRIBUTE_REFERENCES (don't attempt to translate
character and entity references in attribute values).