GNU Info

Info Node: (gawkinet.info)WEBGRAB

(gawkinet.info)WEBGRAB


Next: STATIST Prev: URLCHK Up: Some Applications and Techniques
Enter node , (file) or (file)node

WEBGRAB: Extract Links from a Page
==================================

   Sometimes it is necessary to extract links from web pages.  Browsers
do it, web robots do it, and sometimes even humans do it.  Since we
have a tool like GETURL at hand, we can solve this problem with some
help from the Bourne shell:

     BEGIN { RS = "http://[#%&\\+\\-\\./0-9\\:;\\?A-Z_a-z\\~]*" }
     RT != "" {
        command = ("gawk -v Proxy=MyProxy -f geturl.awk " RT \
                    " > doc" NR ".html")
        print command
     }

   Notice that the regular expression for URLs is rather crude. A
precise regular expression is much more complex. But this one works
rather well. One problem is that it is unable to find internal links of
an HTML document.  Another problem is that `ftp', `telnet', `news',
`mailto', and other kinds of links are missing in the regular
expression.  However, it is straightforward to add them, if doing so is
necessary for other tasks.

   This program reads an HTML file and prints all the HTTP links that
it finds.  It relies on `gawk''s ability to use regular expressions as
record separators. With `RS' set to a regular expression that matches
links, the second action is executed each time a non-empty link is
found.  We can find the matching link itself in `RT'.

   The action could use the `system' function to let another GETURL
retrieve the page, but here we use a different approach.  This simple
program prints shell commands that can be piped into `sh' for
execution.  This way it is possible to first extract the links, wrap
shell commands around them, and pipe all the shell commands into a
file. After editing the file, execution of the file retrieves exactly
those files that we really need. In case we do not want to edit, we can
retrieve all the pages like this:

     gawk -f geturl.awk http://www.suse.de | gawk -f webgrab.awk | sh

   After this, you will find the contents of all referenced documents in
files named `doc*.html' even if they do not contain HTML code.  The
most annoying thing is that we always have to pass the proxy to GETURL.
If you do not like to see the headers of the web pages appear on the
screen, you can redirect them to `/dev/null'.  Watching the headers
appear can be quite interesting, because it reveals interesting details
such as which web server the companies use.  Now, it is clear how the
clever marketing people use web robots to determine the market shares
of Microsoft and Netscape in the web server market.

   Port 80 of any web server is like a small hole in a repellent
firewall.  After attaching a browser to port 80, we usually catch a
glimpse of the bright side of the server (its home page). With a tool
like GETURL at hand, we are able to discover some of the more concealed
or even "indecent" services (i.e., lacking conformity to standards of
quality).  It can be exciting to see the fancy CGI scripts that lie
there, revealing the inner workings of the server, ready to be called:

   * With a command such as:

          gawk -f geturl.awk http://any.host.on.the.net/cgi-bin/

     some servers give you a directory listing of the CGI files.
     Knowing the names, you can try to call some of them and watch for
     useful results. Sometimes there are executables in such directories
     (such as Perl interpreters) that you may call remotely. If there
     are subdirectories with configuration data of the web server, this
     can also be quite interesting to read.

   * The well-known Apache web server usually has its CGI files in the
     directory `/cgi-bin'. There you can often find the scripts
     `test-cgi' and `printenv'. Both tell you some things about the
     current connection and the installation of the web server.  Just
     call:

          gawk -f geturl.awk http://any.host.on.the.net/cgi-bin/test-cgi
          gawk -f geturl.awk http://any.host.on.the.net/cgi-bin/printenv

   * Sometimes it is even possible to retrieve system files like the web
     server's log file--possibly containing customer data--or even the
     file `/etc/passwd'.  (We don't recommend this!)

   *Caution:* Although this may sound funny or simply irrelevant, we
are talking about severe security holes. Try to explore your own system
this way and make sure that none of the above reveals too much
information about your system.


automatically generated by info2www version 1.2.2.9