GNU Info

Info Node: (gawk.info)Wc Program

(gawk.info)Wc Program


Prev: Uniq Program Up: Clones
Enter node , (file) or (file)node

Counting Things
---------------

   The `wc' (word count) utility counts lines, words, and characters in
one or more input files. Its usage is as follows:

     wc [-lwc] [ FILES ... ]

   If no files are specified on the command line, `wc' reads its
standard input. If there are multiple files, it also prints total
counts for all the files.  The options and their meanings are shown in
the following list:

`-l'
     Only count lines.

`-w'
     Only count words.  A "word" is a contiguous sequence of
     non-whitespace characters, separated by spaces and/or tabs.
     Happily, this is the normal way `awk' separates fields in its
     input data.

`-c'
     Only count characters.

   Implementing `wc' in `awk' is particularly elegant, since `awk' does
a lot of the work for us; it splits lines into words (i.e., fields) and
counts them, it counts lines (i.e., records), and it can easily tell us
how long a line is.

   This uses the `getopt' library function (Note: Processing
Command-Line Options.)  and the file transition
functions (Note: Noting Data File Boundaries.).

   This version has one notable difference from traditional versions of
`wc': it always prints the counts in the order lines, words, and
characters.  Traditional versions note the order of the `-l', `-w', and
`-c' options on the command line, and print the counts in that order.

   The `BEGIN' rule does the argument processing.  The variable
`print_total' is true if more than one file is named on the command
line:

     # wc.awk --- count lines, words, characters
     
     # Options:
     #    -l    only count lines
     #    -w    only count words
     #    -c    only count characters
     #
     # Default is to count lines, words, characters
     #
     # Requires getopt and file transition library functions
     
     BEGIN {
         # let getopt print a message about
         # invalid options. we ignore them
         while ((c = getopt(ARGC, ARGV, "lwc")) != -1) {
             if (c == "l")
                 do_lines = 1
             else if (c == "w")
                 do_words = 1
             else if (c == "c")
                 do_chars = 1
         }
         for (i = 1; i < Optind; i++)
             ARGV[i] = ""
     
         # if no options, do all
         if (! do_lines && ! do_words && ! do_chars)
             do_lines = do_words = do_chars = 1
     
         print_total = (ARGC - i > 2)
     }

   The `beginfile' function is simple; it just resets the counts of
lines, words, and characters to zero, and saves the current file name in
`fname':

     function beginfile(file)
     {
         chars = lines = words = 0
         fname = FILENAME
     }

   The `endfile' function adds the current file's numbers to the running
totals of lines, words, and characters.  It then prints out those
numbers for the file that was just read. It relies on `beginfile' to
reset the numbers for the following data file:

     function endfile(file)
     {
         tchars += chars
         tlines += lines
         twords += words
         if (do_lines)
             printf "\t%d", lines
         if (do_words)
             printf "\t%d", words
         if (do_chars)
             printf "\t%d", chars
         printf "\t%s\n", fname
     }

   There is one rule that is executed for each line. It adds the length
of the record, plus one, to `chars'.  Adding one plus the record length
is needed because the newline character separating records (the value
of `RS') is not part of the record itself, and thus not included in its
length.  Next, `lines' is incremented for each line read, and `words'
is incremented by the value of `NF', which is the number of "words" on
this line:(1)

     # do per line
     {
         chars += length($0) + 1    # get newline
         lines++
         words += NF
     }

   Finally, the `END' rule simply prints the totals for all the files.

     END {
         if (print_total) {
             if (do_lines)
                 printf "\t%d", tlines
             if (do_words)
                 printf "\t%d", twords
             if (do_chars)
                 printf "\t%d", tchars
             print "\ttotal"
         }
     }

   ---------- Footnotes ----------

   (1) `wc' can't just use the value of `FNR' in `endfile'.  If you
examine the code in Note: Noting Data File Boundaries,
 you will see that `FNR' has already been reset by the time
`endfile' is called.


automatically generated by info2www version 1.2.2.9