GNU Info

Info Node: (gawkinet.info)GETURL

(gawkinet.info)GETURL


Next: REMCONF Prev: PANIC Up: Some Applications and Techniques
Enter node , (file) or (file)node

GETURL: Retrieving Web Pages
============================

   GETURL is a versatile building block for shell scripts that need to
retrieve files from the Internet. It takes a web address as a
command-line parameter and tries to retrieve the contents of this
address. The contents are printed to standard output, while the header
is printed to `/dev/stderr'.  A surrounding shell script could analyze
the contents and extract the text or the links. An ASCII browser could
be written around GETURL. But more interestingly, web robots are
straightforward to write on top of GETURL. On the Internet, you can find
several programs of the same name that do the same job. They are usually
much more complex internally and at least 10 times longer.

   At first, GETURL checks if it was called with exactly one web
address.  Then, it checks if the user chose to use a special proxy
server whose name is handed over in a variable. By default, it is
assumed that the local machine serves as proxy. GETURL uses the `GET'
method by default to access the web page. By handing over the name of a
different method (such as `HEAD'), it is possible to choose a different
behavior. With the `HEAD' method, the user does not receive the body of
the page content, but does receive the header:

     BEGIN {
       if (ARGC != 2) {
         print "GETURL - retrieve Web page via HTTP 1.0"
         print "IN:\n    the URL as a command-line parameter"
         print "PARAM(S):\n    -v Proxy=MyProxy"
         print "OUT:\n    the page content on stdout"
         print "    the page header on stderr"
         print "JK 16.05.1997"
         print "ADR 13.08.2000"
         exit
       }
       URL = ARGV[1]; ARGV[1] = ""
       if (Proxy     == "")  Proxy     = "127.0.0.1"
       if (ProxyPort ==  0)  ProxyPort = 80
       if (Method    == "")  Method    = "GET"
       HttpService = "/inet/tcp/0/" Proxy "/" ProxyPort
       ORS = RS = "\r\n\r\n"
       print Method " " URL " HTTP/1.0" |& HttpService
       HttpService                      |& getline Header
       print Header > "/dev/stderr"
       while ((HttpService |& getline) > 0)
         printf "%s", $0
       close(HttpService)
     }

   This program can be changed as needed, but be careful with the last
lines.  Make sure transmission of binary data is not corrupted by
additional line breaks. Even as it is now, the byte sequence
`"\r\n\r\n"' would disappear if it were contained in binary data. Don't
get caught in a trap when trying a quick fix on this one.


automatically generated by info2www version 1.2.2.9