GNU Info

Info Node: (gawk.info)Word Sorting

(gawk.info)Word Sorting


Next: History Sorting Prev: Labels Program Up: Miscellaneous Programs
Enter node , (file) or (file)node

Generating Word Usage Counts
----------------------------

   The following `awk' program prints the number of occurrences of each
word in its input.  It illustrates the associative nature of `awk'
arrays by using strings as subscripts.  It also demonstrates the `for
INDEX in ARRAY' mechanism.  Finally, it shows how `awk' is used in
conjunction with other utility programs to do a useful task of some
complexity with a minimum of effort.  Some explanations follow the
program listing:

     # Print list of word frequencies
     {
         for (i = 1; i <= NF; i++)
             freq[$i]++
     }
     
     END {
         for (word in freq)
             printf "%s\t%d\n", word, freq[word]
     }

   This program has two rules.  The first rule, because it has an empty
pattern, is executed for every input line.  It uses `awk''s
field-accessing mechanism (Note: Examining Fields.) to pick out
the individual words from the line, and the built-in variable `NF'
(Note: Built-in Variables) to know how many fields are available.
For each input word, it increments an element of the array `freq' to
reflect that the word has been seen an additional time.

   The second rule, because it has the pattern `END', is not executed
until the input has been exhausted.  It prints out the contents of the
`freq' table that has been built up inside the first action.  This
program has several problems that would prevent it from being useful by
itself on real text files:

   * Words are detected using the `awk' convention that fields are
     separated just by whitespace.  Other characters in the input
     (except newlines) don't have any special meaning to `awk'.  This
     means that punctuation characters count as part of words.

   * The `awk' language considers upper- and lowercase characters to be
     distinct.  Therefore, "bartender" and "Bartender" are not treated
     as the same word.  This is undesirable, since in normal text, words
     are capitalized if they begin sentences, and a frequency analyzer
     should not be sensitive to capitalization.

   * The output does not come out in any useful order.  You're more
     likely to be interested in which words occur most frequently or in
     having an alphabetized table of how frequently each word occurs.

   The way to solve these problems is to use some of `awk''s more
advanced features.  First, we use `tolower' to remove case
distinctions.  Next, we use `gsub' to remove punctuation characters.
Finally, we use the system `sort' utility to process the output of the
`awk' script.  Here is the new version of the program:

     # wordfreq.awk --- print list of word frequencies
     
     {
         $0 = tolower($0)    # remove case distinctions
         # remove punctuation
         gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
         for (i = 1; i <= NF; i++)
             freq[$i]++
     }
     
     END {
         for (word in freq)
             printf "%s\t%d\n", word, freq[word]
     }

   Assuming we have saved this program in a file named `wordfreq.awk',
and that the data is in `file1', the following pipeline:

     awk -f wordfreq.awk file1 | sort +1 -nr

produces a table of the words appearing in `file1' in order of
decreasing frequency.  The `awk' program suitably massages the data and
produces a word frequency table, which is not ordered.

   The `awk' script's output is then sorted by the `sort' utility and
printed on the terminal.  The options given to `sort' specify a sort
that uses the second field of each input line (skipping one field),
that the sort keys should be treated as numeric quantities (otherwise
`15' would come before `5'), and that the sorting should be done in
descending (reverse) order.

   The `sort' could even be done from within the program, by changing
the `END' action to:

     END {
         sort = "sort +1 -nr"
         for (word in freq)
             printf "%s\t%d\n", word, freq[word] | sort
         close(sort)
     }

   This way of sorting must be used on systems that do not have true
pipes at the command-line (or batch-file) level.  See the general
operating system documentation for more information on how to use the
`sort' program.


automatically generated by info2www version 1.2.2.9