GNU Info

Info Node: (gawk.info)Igawk Program

(gawk.info)Igawk Program


Prev: Simple Sed Up: Miscellaneous Programs
Enter node , (file) or (file)node

An Easy Way to Use Library Functions
------------------------------------

   Using library functions in `awk' can be very beneficial. It
encourages code reuse and the writing of general functions. Programs are
smaller and therefore clearer.  However, using library functions is
only easy when writing `awk' programs; it is painful when running them,
requiring multiple `-f' options.  If `gawk' is unavailable, then so too
is the `AWKPATH' environment variable and the ability to put `awk'
functions into a library directory (*note Command-Line Options:
Options.).  It would be nice to be able to write programs in the
following manner:

     # library functions
     @include getopt.awk
     @include join.awk
     ...
     
     # main program
     BEGIN {
         while ((c = getopt(ARGC, ARGV, "a:b:cde")) != -1)
             ...
         ...
     }

   The following program, `igawk.sh', provides this service.  It
simulates `gawk''s searching of the `AWKPATH' variable and also allows
"nested" includes; i.e., a file that is included with `@include' can
contain further `@include' statements.  `igawk' makes an effort to only
include files once, so that nested includes don't accidentally include
a library function twice.

   `igawk' should behave just like `gawk' externally.  This means it
should accept all of `gawk''s command-line arguments, including the
ability to have multiple source files specified via `-f', and the
ability to mix command-line and library source files.

   The program is written using the POSIX Shell (`sh') command language.
The way the program works is as follows:

  1. Loop through the arguments, saving anything that doesn't represent
     `awk' source code for later, when the expanded program is run.

  2. For any arguments that do represent `awk' text, put the arguments
     into a temporary file that will be expanded.  There are two cases:

       a. Literal text, provided with `--source' or `--source='.  This
          text is just echoed directly.  The `echo' program
          automatically supplies a trailing newline.

       b. Source file names provided with `-f'.  We use a neat trick
          and echo `@include FILENAME' into the temporary file.  Since
          the file inclusion program works the way `gawk' does, this
          gets the text of the file included into the program at the
          correct point.

  3. Run an `awk' program (naturally) over the temporary file to expand
     `@include' statements.  The expanded program is placed in a second
     temporary file.

  4. Run the expanded program with `gawk' and any other original
     command-line arguments that the user supplied (such as the data
     file names).

   The initial part of the program turns on shell tracing if the first
argument is `debug'.  Otherwise, a shell `trap' statement arranges to
clean up any temporary files on program exit or upon an interrupt.

   The next part loops through all the command-line arguments.  There
are several cases of interest:

`--'
     This ends the arguments to `igawk'.  Anything else should be
     passed on to the user's `awk' program without being evaluated.

`-W'
     This indicates that the next option is specific to `gawk'.  To make
     argument processing easier, the `-W' is appended to the front of
     the remaining arguments and the loop continues.  (This is an `sh'
     programming trick.  Don't worry about it if you are not familiar
     with `sh'.)

`-v, -F'
     These are saved and passed on to `gawk'.

`-f, --file, --file=, -Wfile='
     The file name is saved to the temporary file `/tmp/ig.s.$$' with an
     `@include' statement.  The `sed' utility is used to remove the
     leading option part of the argument (e.g., `--file=').

`--source, --source=, -Wsource='
     The source text is echoed into `/tmp/ig.s.$$'.

`--version, -Wversion'
     `igawk' prints its version number, runs `gawk --version' to get
     the `gawk' version information, and then exits.

   If none of the `-f', `--file', `-Wfile', `--source', or `-Wsource'
arguments are supplied, then the first non-option argument should be
the `awk' program.  If there are no command-line arguments left,
`igawk' prints an error message and exits.  Otherwise, the first
argument is echoed into `/tmp/ig.s.$$'.  In any case, after the
arguments have been processed, `/tmp/ig.s.$$' contains the complete
text of the original `awk' program.

   The `$$' in `sh' represents the current process ID number.  It is
often used in shell programs to generate unique temporary file names.
This allows multiple users to run `igawk' without worrying that the
temporary file names will clash.  The program is as follows:

     #! /bin/sh
     # igawk --- like gawk but do @include processing
     if [ "$1" = debug ]
     then
         set -x
         shift
     else
         # cleanup on exit, hangup, interrupt, quit, termination
         trap 'rm -f /tmp/ig.[se].$$' 0 1 2 3 15
     fi
     
     while [ $# -ne 0 ] # loop over arguments
     do
         case $1 in
         --)     shift; break;;
     
         -W)     shift
                 set -- -W"$@"
                 continue;;
     
         -[vF])  opts="$opts $1 '$2'"
                 shift;;
     
         -[vF]*) opts="$opts '$1'" ;;
     
         -f)     echo @include "$2" >> /tmp/ig.s.$$
                 shift;;
     
         -f*)    f=`echo "$1" | sed 's/-f//'`
                 echo @include "$f" >> /tmp/ig.s.$$ ;;
     
         -?file=*)    # -Wfile or --file
                 f=`echo "$1" | sed 's/-.file=//'`
                 echo @include "$f" >> /tmp/ig.s.$$ ;;
     
         -?file)      # get arg, $2
                 echo @include "$2" >> /tmp/ig.s.$$
                 shift;;
     
         -?source=*)  # -Wsource or --source
                 t=`echo "$1" | sed 's/-.source=//'`
                 echo "$t" >> /tmp/ig.s.$$ ;;
     
         -?source)    # get arg, $2
                 echo "$2" >> /tmp/ig.s.$$
                 shift;;
     
         -?version)
                 echo igawk: version 1.0 1>&2
                 gawk --version
                 exit 0 ;;
     
         -[W-]*) opts="$opts '$1'" ;;
     
         *)      break;;
         esac
         shift
     done
     
     if [ ! -s /tmp/ig.s.$$ ]
     then
         if [ -z "$1" ]
         then
              echo igawk: no program! 1>&2
              exit 1
         else
             echo "$1" > /tmp/ig.s.$$
             shift
         fi
     fi
     
     # at this point, /tmp/ig.s.$$ has the program

   The `awk' program to process `@include' directives reads through the
program, one line at a time, using `getline' (Note: Explicit Input with
`getline'.).  The input file names and `@include' statements
are managed using a stack.  As each `@include' is encountered, the
current file name is "pushed" onto the stack and the file named in the
`@include' directive becomes the current file name.  As each file is
finished, the stack is "popped," and the previous input file becomes
the current input file again.  The process is started by making the
original file the first one on the stack.

   The `pathto' function does the work of finding the full path to a
file.  It simulates `gawk''s behavior when searching the `AWKPATH'
environment variable (Note: The `AWKPATH' Environment Variable.
).  If a file name has a `/' in it, no path search is done.
Otherwise, the file name is concatenated with the name of each
directory in the path, and an attempt is made to open the generated
file name.  The only way to test if a file can be read in `awk' is to go
ahead and try to read it with `getline'; this is what `pathto' does.(1)
If the file can be read, it is closed and the file name is returned:

     gawk -- '
     # process @include directives
     
     function pathto(file,    i, t, junk)
     {
         if (index(file, "/") != 0)
             return file
     
         for (i = 1; i <= ndirs; i++) {
             t = (pathlist[i] "/" file)
             if ((getline junk < t) > 0) {
                 # found it
                 close(t)
                 return t
             }
         }
         return ""
     }

   The main program is contained inside one `BEGIN' rule.  The first
thing it does is set up the `pathlist' array that `pathto' uses.  After
splitting the path on `:', null elements are replaced with `"."', which
represents the current directory:

     BEGIN {
         path = ENVIRON["AWKPATH"]
         ndirs = split(path, pathlist, ":")
         for (i = 1; i <= ndirs; i++) {
             if (pathlist[i] == "")
                 pathlist[i] = "."
         }

   The stack is initialized with `ARGV[1]', which will be
`/tmp/ig.s.$$'.  The main loop comes next.  Input lines are read in
succession. Lines that do not start with `@include' are printed
verbatim.  If the line does start with `@include', the file name is in
`$2'.  `pathto' is called to generate the full path.  If it cannot,
then we print an error message and continue.

   The next thing to check is if the file is included already.  The
`processed' array is indexed by the full file name of each included
file and it tracks this information for us.  If the file is seen again,
a warning message is printed. Otherwise, the new file name is pushed
onto the stack and processing continues.

   Finally, when `getline' encounters the end of the input file, the
file is closed and the stack is popped.  When `stackptr' is less than
zero, the program is done:

         stackptr = 0
         input[stackptr] = ARGV[1] # ARGV[1] is first file
     
         for (; stackptr >= 0; stackptr--) {
             while ((getline < input[stackptr]) > 0) {
                 if (tolower($1) != "@include") {
                     print
                     continue
                 }
                 fpath = pathto($2)
                 if (fpath == "") {
                     printf("igawk:%s:%d: cannot find %s\n",
                         input[stackptr], FNR, $2) > "/dev/stderr"
                     continue
                 }
                 if (! (fpath in processed)) {
                     processed[fpath] = input[stackptr]
                     input[++stackptr] = fpath  # push onto stack
                 } else
                     print $2, "included in", input[stackptr],
                         "already included in",
                         processed[fpath] > "/dev/stderr"
             }
             close(input[stackptr])
         }
     }' /tmp/ig.s.$$ > /tmp/ig.e.$$

   The last step is to call `gawk' with the expanded program, along
with the original options and command-line arguments that the user
supplied.  `gawk''s exit status is passed back on to `igawk''s calling
program:

     eval gawk -f /tmp/ig.e.$$ $opts -- "$@"
     
     exit $?

   This version of `igawk' represents my third attempt at this program.
There are three key simplifications that make the program work better:

   * Using `@include' even for the files named with `-f' makes building
     the initial collected `awk' program much simpler; all the
     `@include' processing can be done once.

   * The `pathto' function doesn't try to save the line read with
     `getline' when testing for the file's accessibility.  Trying to
     save this line for use with the main program complicates things
     considerably.

   * Using a `getline' loop in the `BEGIN' rule does it all in one
     place.  It is not necessary to call out to a separate loop for
     processing nested `@include' statements.

   Also, this program illustrates that it is often worthwhile to combine
`sh' and `awk' programming together.  You can usually accomplish quite
a lot, without having to resort to low-level programming in C or C++,
and it is frequently easier to do certain kinds of string and argument
manipulation using the shell than it is in `awk'.

   Finally, `igawk' shows that it is not always necessary to add new
features to a program; they can often be layered on top.  With `igawk',
there is no real reason to build `@include' processing into `gawk'
itself.

   As an additional example of this, consider the idea of having two
files in a directory in the search path:

`default.awk'
     This file contains a set of default library functions, such as
     `getopt' and `assert'.

`site.awk'
     This file contains library functions that are specific to a site or
     installation; i.e., locally developed functions.  Having a
     separate file allows `default.awk' to change with new `gawk'
     releases, without requiring the system administrator to update it
     each time by adding the local functions.

   One user suggested that `gawk' be modified to automatically read
these files upon startup.  Instead, it would be very simple to modify
`igawk' to do this. Since `igawk' can process nested `@include'
directives, `default.awk' could simply contain `@include' statements
for the desired library functions.

   ---------- Footnotes ----------

   (1) On some very old versions of `awk', the test `getline junk < t'
can loop forever if the file exists but is empty.  Caveat emptor.


automatically generated by info2www version 1.2.2.9