Copyright (C) 2000-2012 |
GNU Info (gawk.info)Uniq ProgramPrinting Non-Duplicated Lines of Text ------------------------------------- The `uniq' utility reads sorted lines of data on its standard input, and by default removes duplicate lines. In other words, it only prints unique lines--hence the name. `uniq' has a number of options. The usage is as follows: uniq [-udc [-N]] [+N] [ INPUT FILE [ OUTPUT FILE ]] The option meanings are: `-d' Only print repeated lines. `-u' Only print non-repeated lines. `-c' Count lines. This option overrides `-d' and `-u'. Both repeated and non-repeated lines are counted. `-N' Skip N fields before comparing lines. The definition of fields is similar to `awk''s default: non-whitespace characters separated by runs of spaces and/or tabs. `+N' Skip N characters before comparing lines. Any fields specified with `-N' are skipped first. `INPUT FILE' Data is read from the input file named on the command line, instead of from the standard input. `OUTPUT FILE' The generated output is sent to the named output file, instead of to the standard output. Normally `uniq' behaves as if both the `-d' and `-u' options are provided. `uniq' uses the `getopt' library function (Note: Processing Command-Line Options.) and the `join' library function (Note: Merging an Array into a String.). The program begins with a `usage' function and then a brief outline of the options and their meanings in a comment. The `BEGIN' rule deals with the command-line arguments and options. It uses a trick to get `getopt' to handle options of the form `-25', treating such an option as the option letter `2' with an argument of `5'. If indeed two or more digits are supplied (`Optarg' looks like a number), `Optarg' is concatenated with the option digit and then the result is added to zero to make it into a number. If there is only one digit in the option, then `Optarg' is not needed. `Optind' must be decremented so that `getopt' processes it next time. This code is admittedly a bit tricky. If no options are supplied, then the default is taken, to print both repeated and non-repeated lines. The output file, if provided, is assigned to `outputfile'. Early on, `outputfile' is initialized to the standard output, `/dev/stdout': # uniq.awk --- do uniq in awk # # Requires getopt and join library functions function usage( e) { e = "Usage: uniq [-udc [-n]] [+n] [ in [ out ]]" print e > "/dev/stderr" exit 1 } # -c count lines. overrides -d and -u # -d only repeated lines # -u only non-repeated lines # -n skip n fields # +n skip n characters, skip fields first BEGIN \ { count = 1 outputfile = "/dev/stdout" opts = "udc0:1:2:3:4:5:6:7:8:9:" while ((c = getopt(ARGC, ARGV, opts)) != -1) { if (c == "u") non_repeated_only++ else if (c == "d") repeated_only++ else if (c == "c") do_count++ else if (index("0123456789", c) != 0) { # getopt requires args to options # this messes us up for things like -5 if (Optarg ~ /^[0-9]+$/) fcount = (c Optarg) + 0 else { fcount = c + 0 Optind-- } } else usage() } if (ARGV[Optind] ~ /^\+[0-9]+$/) { charcount = substr(ARGV[Optind], 2) + 0 Optind++ } for (i = 1; i < Optind; i++) ARGV[i] = "" if (repeated_only == 0 && non_repeated_only == 0) repeated_only = non_repeated_only = 1 if (ARGC - Optind == 2) { outputfile = ARGV[ARGC - 1] ARGV[ARGC - 1] = "" } } The following function, `are_equal', compares the current line, `$0', to the previous line, `last'. It handles skipping fields and characters. If no field count and no character count are specified, `are_equal' simply returns one or zero depending upon the result of a simple string comparison of `last' and `$0'. Otherwise, things get more complicated. If fields have to be skipped, each line is broken into an array using `split' (Note: String Manipulation Functions. ); the desired fields are then joined back into a line using `join'. The joined lines are stored in `clast' and `cline'. If no fields are skipped, `clast' and `cline' are set to `last' and `$0', respectively. Finally, if characters are skipped, `substr' is used to strip off the leading `charcount' characters in `clast' and `cline'. The two strings are then compared and `are_equal' returns the result: function are_equal( n, m, clast, cline, alast, aline) { if (fcount == 0 && charcount == 0) return (last == $0) if (fcount > 0) { n = split(last, alast) m = split($0, aline) clast = join(alast, fcount+1, n) cline = join(aline, fcount+1, m) } else { clast = last cline = $0 } if (charcount) { clast = substr(clast, charcount + 1) cline = substr(cline, charcount + 1) } return (clast == cline) } The following two rules are the body of the program. The first one is executed only for the very first line of data. It sets `last' equal to `$0', so that subsequent lines of text have something to be compared to. The second rule does the work. The variable `equal' is one or zero, depending upon the results of `are_equal''s comparison. If `uniq' is counting repeated lines, and the lines are equal, then it increments the `count' variable. Otherwise it prints the line and resets `count', since the two lines are not equal. If `uniq' is not counting, and if the lines are equal, `count' is incremented. Nothing is printed, since the point is to remove duplicates. Otherwise, if `uniq' is counting repeated lines and more than one line is seen, or if `uniq' is counting non-repeated lines and only one line is seen, then the line is printed, and `count' is reset. Finally, similar logic is used in the `END' rule to print the final line of input data: NR == 1 { last = $0 next } { equal = are_equal() if (do_count) { # overrides -d and -u if (equal) count++ else { printf("%4d %s\n", count, last) > outputfile last = $0 count = 1 # reset } next } if (equal) count++ else { if ((repeated_only && count > 1) || (non_repeated_only && count == 1)) print last > outputfile last = $0 count = 1 } } END { if (do_count) printf("%4d %s\n", count, last) > outputfile else if ((repeated_only && count > 1) || (non_repeated_only && count == 1)) print last > outputfile } automatically generated by info2www version 1.2.2.9 |