GNU Info

Info Node: (gawk.info)Cut Program

(gawk.info)Cut Program


Next: Egrep Program Prev: Clones Up: Clones
Enter node , (file) or (file)node

Cutting out Fields and Columns
------------------------------

   The `cut' utility selects, or "cuts," characters or fields from its
standard input and sends them to its standard output.  Fields are
separated by tabs by default, but you may supply a command-line option
to change the field "delimiter" (i.e., the field separator character).
`cut''s definition of fields is less general than `awk''s.

   A common use of `cut' might be to pull out just the login name of
logged-on users from the output of `who'.  For example, the following
pipeline generates a sorted, unique list of the logged-on users:

     who | cut -c1-8 | sort | uniq

   The options for `cut' are:

`-c LIST'
     Use LIST as the list of characters to cut out.  Items within the
     list may be separated by commas, and ranges of characters can be
     separated with dashes.  The list `1-8,15,22-35' specifies
     characters 1 through 8, 15, and 22 through 35.

`-f LIST'
     Use LIST as the list of fields to cut out.

`-d DELIM'
     Use DELIM as the field separator character instead of the tab
     character.

`-s'
     Suppress printing of lines that do not contain the field delimiter.

   The `awk' implementation of `cut' uses the `getopt' library function
(Note: Processing Command-Line Options.)  and the
`join' library function (Note: Merging an Array into a String.
).

   The program begins with a comment describing the options, the library
functions needed, and a `usage' function that prints out a usage
message and exits.  `usage' is called if invalid arguments are supplied:

     # cut.awk --- implement cut in awk
     # Options:
     #    -f list     Cut fields
     #    -d c        Field delimiter character
     #    -c list     Cut characters
     #
     #    -s          Suppress lines without the delimiter
     #
     # Requires getopt and join library functions
     
     function usage(    e1, e2)
     {
         e1 = "usage: cut [-f list] [-d c] [-s] [files...]"
         e2 = "usage: cut [-c list] [files...]"
         print e1 > "/dev/stderr"
         print e2 > "/dev/stderr"
         exit 1
     }

The variables `e1' and `e2' are used so that the function fits nicely
on the screen.

   Next comes a `BEGIN' rule that parses the command-line options.  It
sets `FS' to a single tab character, because that is `cut''s default
field separator.  The output field separator is also set to be the same
as the input field separator.  Then `getopt' is used to step through
the command-line options.  One or the other of the variables
`by_fields' or `by_chars' is set to true, to indicate that processing
should be done by fields or by characters, respectively.  When cutting
by characters, the output field separator is set to the null string.

     BEGIN    \
     {
         FS = "\t"    # default
         OFS = FS
         while ((c = getopt(ARGC, ARGV, "sf:c:d:")) != -1) {
             if (c == "f") {
                 by_fields = 1
                 fieldlist = Optarg
             } else if (c == "c") {
                 by_chars = 1
                 fieldlist = Optarg
                 OFS = ""
             } else if (c == "d") {
                 if (length(Optarg) > 1) {
                     printf("Using first character of %s" \
                     " for delimiter\n", Optarg) > "/dev/stderr"
                     Optarg = substr(Optarg, 1, 1)
                 }
                 FS = Optarg
                 OFS = FS
                 if (FS == " ")    # defeat awk semantics
                     FS = "[ ]"
             } else if (c == "s")
                 suppress++
             else
                 usage()
         }
     
         for (i = 1; i < Optind; i++)
             ARGV[i] = ""

   Special care is taken when the field delimiter is a space.  Using a
single space (`" "') for the value of `FS' is incorrect--`awk' would
separate fields with runs of spaces, tabs, and/or newlines, and we want
them to be separated with individual spaces.  Also, note that after
`getopt' is through, we have to clear out all the elements of `ARGV'
from 1 to `Optind', so that `awk' does not try to process the
command-line options as file names.

   After dealing with the command-line options, the program verifies
that the options make sense.  Only one or the other of `-c' and `-f'
should be used, and both require a field list.  Then the program calls
either `set_fieldlist' or `set_charlist' to pull apart the list of
fields or characters:

         if (by_fields && by_chars)
             usage()
     
         if (by_fields == 0 && by_chars == 0)
             by_fields = 1    # default
     
         if (fieldlist == "") {
             print "cut: needs list for -c or -f" > "/dev/stderr"
             exit 1
         }
     
         if (by_fields)
             set_fieldlist()
         else
             set_charlist()
     }

   `set_fieldlist'  is used to split the field list apart at the commas,
and into an array.  Then, for each element of the array, it looks to
see if it is actually a range, and if so, splits it apart. The range is
verified to make sure the first number is smaller than the second.
Each number in the list is added to the `flist' array, which simply
lists the fields that will be printed.  Normal field splitting is used.
The program lets `awk' handle the job of doing the field splitting:

     function set_fieldlist(        n, m, i, j, k, f, g)
     {
         n = split(fieldlist, f, ",")
         j = 1    # index in flist
         for (i = 1; i <= n; i++) {
             if (index(f[i], "-") != 0) { # a range
                 m = split(f[i], g, "-")
                 if (m != 2 || g[1] >= g[2]) {
                     printf("bad field list: %s\n",
                                       f[i]) > "/dev/stderr"
                     exit 1
                 }
                 for (k = g[1]; k <= g[2]; k++)
                     flist[j++] = k
             } else
                 flist[j++] = f[i]
         }
         nfields = j - 1
     }

   The `set_charlist' function is more complicated than `set_fieldlist'.
The idea here is to use `gawk''s `FIELDWIDTHS' variable (Note: Reading
Fixed-Width Data.), which describes constant width
input.  When using a character list, that is exactly what we have.

   Setting up `FIELDWIDTHS' is more complicated than simply listing the
fields that need to be printed.  We have to keep track of the fields to
print and also the intervening characters that have to be skipped.  For
example, suppose you wanted characters 1 through 8, 15, and 22 through
35.  You would use `-c 1-8,15,22-35'.  The necessary value for
`FIELDWIDTHS' is `"8 6 1 6 14"'.  This yields five fields, and the
fields to print are `$1', `$3', and `$5'.  The intermediate fields are
"filler", which is stuff in between the desired data.  `flist' lists
the fields to print, and `t' tracks the complete field list, including
filler fields:

     function set_charlist(    field, i, j, f, g, t,
                               filler, last, len)
     {
         field = 1   # count total fields
         n = split(fieldlist, f, ",")
         j = 1       # index in flist
         for (i = 1; i <= n; i++) {
             if (index(f[i], "-") != 0) { # range
                 m = split(f[i], g, "-")
                 if (m != 2 || g[1] >= g[2]) {
                     printf("bad character list: %s\n",
                                    f[i]) > "/dev/stderr"
                     exit 1
                 }
                 len = g[2] - g[1] + 1
                 if (g[1] > 1)  # compute length of filler
                     filler = g[1] - last - 1
                 else
                     filler = 0
                 if (filler)
                     t[field++] = filler
                 t[field++] = len  # length of field
                 last = g[2]
                 flist[j++] = field - 1
             } else {
                 if (f[i] > 1)
                     filler = f[i] - last - 1
                 else
                     filler = 0
                 if (filler)
                     t[field++] = filler
                 t[field++] = 1
                 last = f[i]
                 flist[j++] = field - 1
             }
         }
         FIELDWIDTHS = join(t, 1, field - 1)
         nfields = j - 1
     }

   Next is the rule that actually processes the data.  If the `-s'
option is given, then `suppress' is true.  The first `if' statement
makes sure that the input record does have the field separator.  If
`cut' is processing fields, `suppress' is true, and the field separator
character is not in the record, then the record is skipped.

   If the record is valid, then `gawk' has split the data into fields,
either using the character in `FS' or using fixed-length fields and
`FIELDWIDTHS'.  The loop goes through the list of fields that should be
printed.  The corresponding field is printed if it contains data.  If
the next field also has data, then the separator character is written
out between the fields:

     {
         if (by_fields && suppress && index($0, FS) != 0)
             next
     
         for (i = 1; i <= nfields; i++) {
             if ($flist[i] != "") {
                 printf "%s", $flist[i]
                 if (i < nfields && $flist[i+1] != "")
                     printf "%s", OFS
             }
         }
         print ""
     }

   This version of `cut' relies on `gawk''s `FIELDWIDTHS' variable to
do the character-based cutting.  While it is possible in other `awk'
implementations to use `substr' (*note String Manipulation Functions:
String Functions.), it is also extremely painful.  The `FIELDWIDTHS'
variable supplies an elegant solution to the problem of picking the
input line apart by characters.


automatically generated by info2www version 1.2.2.9