Whole document tree
    

Whole document tree

4.1 Description file next up previous contents index
Next: 4.2 Invoking the parser Up: 4. The Parser Previous: 4. The Parser   Contents   Index


4.1 Description file

Before we take a closer look at the parser itself, we will describe the format of the description file also known as the Home Page document (default home.html, but that can be changed). On a Unix/Linux system this file will be stored by default in $HOME/.plucker.

OS/2 will use the environment-variable HOME to find the location of your home-directory (you can also use drive letters). The installer should set the necessary environment variable for you and also add the necessary directories to your system. You may check the location by simply typing set home at a command prompt.

The description file is a valid HTML document with extra optional tags added for the link references.

  • MAXDEPTH=n: This specifies how deep the parser should follow the links embedded in a web page. If MAXDEPTH is not given the parser will default to a depth of 1, that is only download the page itself but do not follow any links in it. To follow links in the current page you would use MAXDEPTH=2 and to follow links also in those pages you would use MAXDEPTH=3 and so on. Too high values without using any of the available filtering mechanisms could result in an excessive amount of data.

    Hint: MAXDEPTH=2 can be very useful if you have a page that contains only the headlines that are links to the full text version of the articles. Many newstickers use this format.

  • NOIMAGES: If you are not interested in downloading images then you use this tag. If specified all images will be replaced with the ALT-tag for the image if available, otherwise [img].

    Hint: NOIMAGES is an effective way to decrease the size of documents.

  • STAYONHOST: Most web sites contains references to both locally stored articles and to articles stored on other hosts. Using a MAXDEPTH of 2 or higher could result in a lot of unwanted data. To prevent this you may specify the STAYONHOST tag for your link. The parser will now only download content that resides on the same server as the one that contained the top page. Together with exclusionlist.txt this is a quite handy way to prevent the download of links referred to by banners.

  • STAYBELOW=text: Similar to STAYONHOST this tag tells the parser to only fetch pages that start with text. For example, it could be used if the articles on a page are listed on another server in which case STAYONHOST would not work properly. Or you can grab certain articles out of a large listing so you would get all headlines but only articles referring to specific subjects (provided the web server offering the information is set up correctly).

    NOTE: If =text is not given, it will default to the content of the href-attribute (the URL you are pointing to).

  • BPP=n: This option is used to specify the bit depth that should be used for images. Valid values are 0 (i.e. no images), 1, 2, 4, and 8.

    NOTE: BPP=8 is currently only supported when the parser is used on a Windows system.

  • MAXWIDTH=width: Used to set the maximum width of images.

  • MAXHEIGHT=height: Used to set the maximum height of images.

An simple example of a description file is:

<HTML>
<HEAD>
  <TITLE>Plucker Home Page</TITLE>
</HEAD>
<BODY>
  <A HREF="http://plucker.gnu-designs.com" MAXDEPTH=2 STAYONHOST NOIMAGES>Plucker home page</A>
</BODY>
</HTML>

This would download the front page of our web site and also follow any links on the page if they are local to the host. No images would be downloaded.

The description file (home.html) that is installed when your Plucker directory is set up, also contains a few examples.


next up previous contents index
Next: 4.2 Invoking the parser Up: 4. The Parser Previous: 4. The Parser   Contents   Index
The Plucker Team