Info Node: (gawk.info)Uniq Program

www.fifi.org
    Documentation
        Manpages
        GNU Info
        Debian document tree
        Whole document tree
    Trigance web page
    Public services
    User info
    Mailing lists
    Secure server
    Multilingual usage

Validate HTML
Validate CSS

(gawk.info)Uniq Program

Printing Non-Duplicated Lines of Text ------------------------------------- The `uniq' utility reads sorted lines of data on its standard input, and by default removes duplicate lines. In other words, it only prints unique lines--hence the name. `uniq' has a number of options. The usage is as follows: uniq [-udc [-N]] [+N] [ INPUT FILE [ OUTPUT FILE ]] The option meanings are: `-d' Only print repeated lines. `-u' Only print non-repeated lines. `-c' Count lines. This option overrides `-d' and `-u'. Both repeated and non-repeated lines are counted. `-N' Skip N fields before comparing lines. The definition of fields is similar to `awk''s default: non-whitespace characters separated by runs of spaces and/or tabs. `+N' Skip N characters before comparing lines. Any fields specified with `-N' are skipped first. `INPUT FILE' Data is read from the input file named on the command line, instead of from the standard input. `OUTPUT FILE' The generated output is sent to the named output file, instead of to the standard output. Normally `uniq' behaves as if both the `-d' and `-u' options are provided. `uniq' uses the `getopt' library function (Note: Processing Command-Line Options.) and the `join' library function (Note: Merging an Array into a String.). The program begins with a `usage' function and then a brief outline of the options and their meanings in a comment. The `BEGIN' rule deals with the command-line arguments and options. It uses a trick to get `getopt' to handle options of the form `-25', treating such an option as the option letter `2' with an argument of `5'. If indeed two or more digits are supplied (`Optarg' looks like a number), `Optarg' is concatenated with the option digit and then the result is added to zero to make it into a number. If there is only one digit in the option, then `Optarg' is not needed. `Optind' must be decremented so that `getopt' processes it next time. This code is admittedly a bit tricky. If no options are supplied, then the default is taken, to print both repeated and non-repeated lines. The output file, if provided, is assigned to `outputfile'. Early on, `outputfile' is initialized to the standard output, `/dev/stdout': # uniq.awk --- do uniq in awk # # Requires getopt and join library functions function usage( e) { e = "Usage: uniq [-udc [-n]] [+n] [ in [ out ]]" print e > "/dev/stderr" exit 1 } # -c count lines. overrides -d and -u # -d only repeated lines # -u only non-repeated lines # -n skip n fields # +n skip n characters, skip fields first BEGIN \ { count = 1 outputfile = "/dev/stdout" opts = "udc0:1:2:3:4:5:6:7:8:9:" while ((c = getopt(ARGC, ARGV, opts)) != -1) { if (c == "u") non_repeated_only++ else if (c == "d") repeated_only++ else if (c == "c") do_count++ else if (index("0123456789", c) != 0) { # getopt requires args to options # this messes us up for things like -5 if (Optarg ~ /^[0-9]+$/) fcount = (c Optarg) + 0 else { fcount = c + 0 Optind-- } } else usage() } if (ARGV[Optind] ~ /^\+[0-9]+$/) { charcount = substr(ARGV[Optind], 2) + 0 Optind++ } for (i = 1; i < Optind; i++) ARGV[i] = "" if (repeated_only == 0 && non_repeated_only == 0) repeated_only = non_repeated_only = 1 if (ARGC - Optind == 2) { outputfile = ARGV[ARGC - 1] ARGV[ARGC - 1] = "" } } The following function, `are_equal', compares the current line, `$0', to the previous line, `last'. It handles skipping fields and characters. If no field count and no character count are specified, `are_equal' simply returns one or zero depending upon the result of a simple string comparison of `last' and `$0'. Otherwise, things get more complicated. If fields have to be skipped, each line is broken into an array using `split' (Note: String Manipulation Functions. ); the desired fields are then joined back into a line using `join'. The joined lines are stored in `clast' and `cline'. If no fields are skipped, `clast' and `cline' are set to `last' and `$0', respectively. Finally, if characters are skipped, `substr' is used to strip off the leading `charcount' characters in `clast' and `cline'. The two strings are then compared and `are_equal' returns the result: function are_equal( n, m, clast, cline, alast, aline) { if (fcount == 0 && charcount == 0) return (last == $0) if (fcount > 0) { n = split(last, alast) m = split($0, aline) clast = join(alast, fcount+1, n) cline = join(aline, fcount+1, m) } else { clast = last cline = $0 } if (charcount) { clast = substr(clast, charcount + 1) cline = substr(cline, charcount + 1) } return (clast == cline) } The following two rules are the body of the program. The first one is executed only for the very first line of data. It sets `last' equal to `$0', so that subsequent lines of text have something to be compared to. The second rule does the work. The variable `equal' is one or zero, depending upon the results of `are_equal''s comparison. If `uniq' is counting repeated lines, and the lines are equal, then it increments the `count' variable. Otherwise it prints the line and resets `count', since the two lines are not equal. If `uniq' is not counting, and if the lines are equal, `count' is incremented. Nothing is printed, since the point is to remove duplicates. Otherwise, if `uniq' is counting repeated lines and more than one line is seen, or if `uniq' is counting non-repeated lines and only one line is seen, then the line is printed, and `count' is reset. Finally, similar logic is used in the `END' rule to print the final line of input data: NR == 1 { last = $0 next } { equal = are_equal() if (do_count) { # overrides -d and -u if (equal) count++ else { printf("%4d %s\n", count, last) > outputfile last = $0 count = 1 # reset } next } if (equal) count++ else { if ((repeated_only && count > 1) || (non_repeated_only && count == 1)) print last > outputfile last = $0 count = 1 } } END { if (do_count) printf("%4d %s\n", count, last) > outputfile else if ((repeated_only && count > 1) || (non_repeated_only && count == 1)) print last > outputfile }

automatically generated by

info2www

version 1.2.2.9