Noting Data File Boundaries
---------------------------
The `BEGIN' and `END' rules are each executed exactly once, at the
beginning and end of your `awk' program, respectively (Note:The
`BEGIN' and `END' Special Patterns.). We (the `gawk'
authors) once had a user who mistakenly thought that the `BEGIN' rule
is executed at the beginning of each data file and the `END' rule is
executed at the end of each data file. When informed that this was not
the case, the user requested that we add new special patterns to
`gawk', named `BEGIN_FILE' and `END_FILE', that would have the desired
behavior. He even supplied us the code to do so.
Adding these special patterns to `gawk' wasn't necessary; the job
can be done cleanly in `awk' itself, as illustrated by the following
library program. It arranges to call two user-supplied functions,
`beginfile' and `endfile', at the beginning and end of each data file.
Besides solving the problem in only nine(!) lines of code, it does so
_portably_; this works with any implementation of `awk':
# transfile.awk
#
# Give the user a hook for filename transitions
#
# The user must supply functions beginfile() and endfile()
# that each take the name of the file being started or
# finished, respectively.
FILENAME != _oldfilename \
{
if (_oldfilename != "")
endfile(_oldfilename)
_oldfilename = FILENAME
beginfile(FILENAME)
}
END { endfile(FILENAME) }
This file must be loaded before the user's "main" program, so that
the rule it supplies is executed first.
This rule relies on `awk''s `FILENAME' variable that automatically
changes for each new data file. The current file name is saved in a
private variable, `_oldfilename'. If `FILENAME' does not equal
`_oldfilename', then a new data file is being processed and it is
necessary to call `endfile' for the old file. Because `endfile' should
only be called if a file has been processed, the program first checks
to make sure that `_oldfilename' is not the null string. The program
then assigns the current file name to `_oldfilename' and calls
`beginfile' for the file. Because, like all `awk' variables,
`_oldfilename' is initialized to the null string, this rule executes
correctly even for the first data file.
The program also supplies an `END' rule to do the final processing
for the last file. Because this `END' rule comes before any `END' rules
supplied in the "main" program, `endfile' is called first. Once again
the value of multiple `BEGIN' and `END' rules should be clear.
This version has same problem as the first version of `nextfile'
(Note:Implementing `nextfile' as a Function.). If
the same data file occurs twice in a row on the command line, then
`endfile' and `beginfile' are not executed at the end of the first pass
and at the beginning of the second pass. The following version solves
the problem:
# ftrans.awk --- handle data file transitions
#
# user supplies beginfile() and endfile() functions
FNR == 1 {
if (_filename_ != "")
endfile(_filename_)
_filename_ = FILENAME
beginfile(FILENAME)
}
END { endfile(_filename_) }
Note:Counting Things, shows how this library function
can be used and how it simplifies writing the main program.