GNU Info

Info Node: (gawk.info)Regexp Field Splitting

(gawk.info)Regexp Field Splitting


Next: Single Character Fields Prev: Field Separators Up: Field Separators
Enter node , (file) or (file)node

Using Regular Expressions to Separate Fields
--------------------------------------------

   The previous node discussed the use of single characters or simple
strings as the value of `FS'.  More generally, the value of `FS' may be
a string containing any regular expression.  In this case, each match
in the record for the regular expression separates fields.  For
example, the assignment:

     FS = ", \t"

makes every area of an input line that consists of a comma followed by a
space and a tab into a field separator.  (`\t' is an "escape sequence"
that stands for a tab; Note: Escape Sequences, for the complete list
of similar escape sequences.)

   For a less trivial example of a regular expression, try using single
spaces to separate fields the way single commas are used.  `FS' can be
set to `"[ ]"' (left bracket, space, right bracket).  This regular
expression matches a single space and nothing else (Note: Regular
Expressions.).

   There is an important difference between the two cases of `FS = " "'
(a single space) and `FS = "[ \t\n]+"' (a regular expression matching
one or more spaces, tabs, or newlines).  For both values of `FS',
fields are separated by "runs" (multiple adjacent occurrences) of
spaces, tabs, and/or newlines.  However, when the value of `FS' is
`" "', `awk' first strips leading and trailing whitespace from the
record and then decides where the fields are.  For example, the
following pipeline prints `b':

     $ echo ' a b c d ' | awk '{ print $2 }'
     -| b

However, this pipeline prints `a' (note the extra spaces around each
letter):

     $ echo ' a  b  c  d ' | awk 'BEGIN { FS = "[ \t\n]+" }
     >                                  { print $2 }'
     -| a

In this case, the first field is "null" or empty.

   The stripping of leading and trailing whitespace also comes into
play whenever `$0' is recomputed.  For instance, study this pipeline:

     $ echo '   a b c d' | awk '{ print; $2 = $2; print }'
     -|    a b c d
     -| a b c d

The first `print' statement prints the record as it was read, with
leading whitespace intact.  The assignment to `$2' rebuilds `$0' by
concatenating `$1' through `$NF' together, separated by the value of
`OFS'.  Because the leading whitespace was ignored when finding `$1',
it is not part of the new `$0'.  Finally, the last `print' statement
prints the new `$0'.


automatically generated by info2www version 1.2.2.9