(gawk.info)Dupword Program


Finding Duplicated Words in a Document
--------------------------------------

   A common error when writing large amounts of prose is to accidentally
duplicate words.  Typically you will see this in text as something like
"the the program does the following ...."  When the text is online,
often the duplicated words occur at the end of one line and the
beginning of another, making them very difficult to spot.

   This program, `dupword.awk', scans through a file one line at a time
and looks for adjacent occurrences of the same word.  It also saves the
last word on a line (in the variable `prev') for comparison with the
first word on the next line.

   The first two statements make sure that the line is all lowercase,
so that, for example, "The" and "the" compare equal to each other.  The
next statement replaces non-alphanumeric and non-whitespace characters
with spaces, so that punctuation does not affect the comparison either.
The characters are replaced with spaces so that formatting controls
don't create nonsense words (e.g., the Texinfo `@code{NF}' becomes
`codeNF' if punctuation is simply deleted).  The record is then
re-split into fields, yielding just the actual words on the line, and
insuring that there are no empty fields.

   If there are no fields left after removing all the punctuation, the
current record is skipped.  Otherwise, the program loops through each
word, comparing it to the previous one:

     # dupword.awk --- find duplicate words in text
     {
         $0 = tolower($0)
         gsub(/[^[:alnum:][:blank:]]/, " ");
         $0 = $0         # re-split
         if (NF == 0)
             next
         if ($1 == prev)
             printf("%s:%d: duplicate %s\n",
                 FILENAME, FNR, $1)
         for (i = 2; i <= NF; i++)
             if ($i == $(i-1))
                 printf("%s:%d: duplicate %s\n",
                     FILENAME, FNR, $i)
         prev = $NF
     }

automatically generated by info2www version 1.2.2.9