Implementing `nextfile' as a Function
-------------------------------------
The `nextfile' statement presented in Note:Using `gawk''s
`nextfile' Statement, is a `gawk'-specific
extension--it is not available in most other implementations of `awk'.
This minor node shows two versions of a `nextfile' function that you
can use to simulate `gawk''s `nextfile' statement if you cannot use
`gawk'.
A first attempt at writing a `nextfile' function is as follows:
# nextfile --- skip remaining records in current file
# this should be read in before the "main" awk program
function nextfile() { _abandon_ = FILENAME; next }
_abandon_ == FILENAME { next }
Because it supplies a rule that must be executed first, this file
should be included before the main program. This rule compares the
current data file's name (which is always in the `FILENAME' variable) to
a private variable named `_abandon_'. If the file name matches, then
the action part of the rule executes a `next' statement to go on to the
next record. (The use of `_' in the variable name is a convention. It
is discussed more fully in Note:Naming Library Function Global
Variables.)
The use of the `next' statement effectively creates a loop that reads
all the records from the current data file. The end of the file is
eventually reached and a new data file is opened, changing the value of
`FILENAME'. Once this happens, the comparison of `_abandon_' to
`FILENAME' fails and execution continues with the first rule of the
"real" program.
The `nextfile' function itself simply sets the value of `_abandon_'
and then executes a `next' statement to start the loop.
This initial version has a subtle problem. If the same data file is
listed _twice_ on the commandline, one right after the other or even
with just a variable assignment between them, this code skips right
through the file, a second time, even though it should stop when it
gets to the end of the first occurrence. A second version of
`nextfile' that remedies this problem is shown here:
# nextfile --- skip remaining records in current file
# correctly handle successive occurrences of the same file
# this should be read in before the "main" awk program
function nextfile() { _abandon_ = FILENAME; next }
_abandon_ == FILENAME {
if (FNR == 1)
_abandon_ = ""
else
next
}
The `nextfile' function has not changed. It makes `_abandon_' equal
to the current file name and then executes a `next' statement. The
`next' statement reads the next record and increments `FNR' so that
`FNR' is guaranteed to have a value of at least two. However, if
`nextfile' is called for the last record in the file, then `awk' closes
the current data file and moves on to the next one. Upon doing so,
`FILENAME' is set to the name of the new file and `FNR' is reset to
one. If this next file is the same as the previous one, `_abandon_' is
still equal to `FILENAME'. However, `FNR' is equal to one, telling us
that this is a new occurrence of the file and not the one we were
reading when the `nextfile' function was executed. In that case,
`_abandon_' is reset to the empty string, so that further executions of
this rule fail (until the next time that `nextfile' is called).
If `FNR' is not one, then we are still in the original data file and
the program executes a `next' statement to skip through it.
An important question to ask at this point is: given that the
functionality of `nextfile' can be provided with a library file, why is
it built into `gawk'? Adding features for little reason leads to
larger, slower programs that are harder to maintain. The answer is
that building `nextfile' into `gawk' provides significant gains in
efficiency. If the `nextfile' function is executed at the beginning of
a large data file, `awk' still has to scan the entire file, splitting
it up into records, just to skip over it. The built-in `nextfile' can
simply close the file immediately and proceed to the next one, which
saves a lot of time. This is particularly important in `awk', because
`awk' programs are generally I/O-bound (i.e., they spend most of their
time doing input and output, instead of performing computations).