GNU Info

Info Node: (gawk.info)Uniq Program

(gawk.info)Uniq Program


Next: Wc Program Prev: Tee Program Up: Clones
Enter node , (file) or (file)node

Printing Non-Duplicated Lines of Text
-------------------------------------

   The `uniq' utility reads sorted lines of data on its standard input,
and by default removes duplicate lines.  In other words, it only prints
unique lines--hence the name.  `uniq' has a number of options. The
usage is as follows:

     uniq [-udc [-N]] [+N] [ INPUT FILE [ OUTPUT FILE ]]

   The option meanings are:

`-d'
     Only print repeated lines.

`-u'
     Only print non-repeated lines.

`-c'
     Count lines. This option overrides `-d' and `-u'.  Both repeated
     and non-repeated lines are counted.

`-N'
     Skip N fields before comparing lines.  The definition of fields is
     similar to `awk''s default: non-whitespace characters separated by
     runs of spaces and/or tabs.

`+N'
     Skip N characters before comparing lines.  Any fields specified
     with `-N' are skipped first.

`INPUT FILE'
     Data is read from the input file named on the command line,
     instead of from the standard input.

`OUTPUT FILE'
     The generated output is sent to the named output file, instead of
     to the standard output.

   Normally `uniq' behaves as if both the `-d' and `-u' options are
provided.

   `uniq' uses the `getopt' library function (Note: Processing
Command-Line Options.)  and the `join' library function
(Note: Merging an Array into a String.).

   The program begins with a `usage' function and then a brief outline
of the options and their meanings in a comment.  The `BEGIN' rule deals
with the command-line arguments and options. It uses a trick to get
`getopt' to handle options of the form `-25', treating such an option
as the option letter `2' with an argument of `5'. If indeed two or more
digits are supplied (`Optarg' looks like a number), `Optarg' is
concatenated with the option digit and then the result is added to zero
to make it into a number.  If there is only one digit in the option,
then `Optarg' is not needed. `Optind' must be decremented so that
`getopt' processes it next time.  This code is admittedly a bit tricky.

   If no options are supplied, then the default is taken, to print both
repeated and non-repeated lines.  The output file, if provided, is
assigned to `outputfile'.  Early on, `outputfile' is initialized to the
standard output, `/dev/stdout':

     # uniq.awk --- do uniq in awk
     #
     # Requires getopt and join library functions
     function usage(    e)
     {
         e = "Usage: uniq [-udc [-n]] [+n] [ in [ out ]]"
         print e > "/dev/stderr"
         exit 1
     }
     
     # -c    count lines. overrides -d and -u
     # -d    only repeated lines
     # -u    only non-repeated lines
     # -n    skip n fields
     # +n    skip n characters, skip fields first
     
     BEGIN   \
     {
         count = 1
         outputfile = "/dev/stdout"
         opts = "udc0:1:2:3:4:5:6:7:8:9:"
         while ((c = getopt(ARGC, ARGV, opts)) != -1) {
             if (c == "u")
                 non_repeated_only++
             else if (c == "d")
                 repeated_only++
             else if (c == "c")
                 do_count++
             else if (index("0123456789", c) != 0) {
                 # getopt requires args to options
                 # this messes us up for things like -5
                 if (Optarg ~ /^[0-9]+$/)
                     fcount = (c Optarg) + 0
                 else {
                     fcount = c + 0
                     Optind--
                 }
             } else
                 usage()
         }
     
         if (ARGV[Optind] ~ /^\+[0-9]+$/) {
             charcount = substr(ARGV[Optind], 2) + 0
             Optind++
         }
     
         for (i = 1; i < Optind; i++)
             ARGV[i] = ""
     
         if (repeated_only == 0 && non_repeated_only == 0)
             repeated_only = non_repeated_only = 1
     
         if (ARGC - Optind == 2) {
             outputfile = ARGV[ARGC - 1]
             ARGV[ARGC - 1] = ""
         }
     }

   The following function, `are_equal', compares the current line,
`$0', to the previous line, `last'.  It handles skipping fields and
characters.  If no field count and no character count are specified,
`are_equal' simply returns one or zero depending upon the result of a
simple string comparison of `last' and `$0'.  Otherwise, things get more
complicated.  If fields have to be skipped, each line is broken into an
array using `split' (Note: String Manipulation Functions.
); the desired fields are then joined back into a line using
`join'.  The joined lines are stored in `clast' and `cline'.  If no
fields are skipped, `clast' and `cline' are set to `last' and `$0',
respectively.  Finally, if characters are skipped, `substr' is used to
strip off the leading `charcount' characters in `clast' and `cline'.
The two strings are then compared and `are_equal' returns the result:

     function are_equal(    n, m, clast, cline, alast, aline)
     {
         if (fcount == 0 && charcount == 0)
             return (last == $0)
     
         if (fcount > 0) {
             n = split(last, alast)
             m = split($0, aline)
             clast = join(alast, fcount+1, n)
             cline = join(aline, fcount+1, m)
         } else {
             clast = last
             cline = $0
         }
         if (charcount) {
             clast = substr(clast, charcount + 1)
             cline = substr(cline, charcount + 1)
         }
     
         return (clast == cline)
     }

   The following two rules are the body of the program.  The first one
is executed only for the very first line of data.  It sets `last' equal
to `$0', so that subsequent lines of text have something to be compared
to.

   The second rule does the work. The variable `equal' is one or zero,
depending upon the results of `are_equal''s comparison. If `uniq' is
counting repeated lines, and the lines are equal, then it increments
the `count' variable.  Otherwise it prints the line and resets `count',
since the two lines are not equal.

   If `uniq' is not counting, and if the lines are equal, `count' is
incremented.  Nothing is printed, since the point is to remove
duplicates.  Otherwise, if `uniq' is counting repeated lines and more
than one line is seen, or if `uniq' is counting non-repeated lines and
only one line is seen, then the line is printed, and `count' is reset.

   Finally, similar logic is used in the `END' rule to print the final
line of input data:

     NR == 1 {
         last = $0
         next
     }
     
     {
         equal = are_equal()
     
         if (do_count) {    # overrides -d and -u
             if (equal)
                 count++
             else {
                 printf("%4d %s\n", count, last) > outputfile
                 last = $0
                 count = 1    # reset
             }
             next
         }
     
         if (equal)
             count++
         else {
             if ((repeated_only && count > 1) ||
                 (non_repeated_only && count == 1))
                     print last > outputfile
             last = $0
             count = 1
         }
     }
     
     END {
         if (do_count)
             printf("%4d %s\n", count, last) > outputfile
         else if ((repeated_only && count > 1) ||
                 (non_repeated_only && count == 1))
             print last > outputfile
     }


automatically generated by info2www version 1.2.2.9