Using Regular Expressions to Separate Fields
--------------------------------------------
The previous node discussed the use of single characters or simple
strings as the value of `FS'. More generally, the value of `FS' may be
a string containing any regular expression. In this case, each match
in the record for the regular expression separates fields. For
example, the assignment:
FS = ", \t"
makes every area of an input line that consists of a comma followed by a
space and a tab into a field separator. (`\t' is an "escape sequence"
that stands for a tab; Note:Escape Sequences, for the complete list
of similar escape sequences.)
For a less trivial example of a regular expression, try using single
spaces to separate fields the way single commas are used. `FS' can be
set to `"[ ]"' (left bracket, space, right bracket). This regular
expression matches a single space and nothing else (Note:Regular
Expressions.).
There is an important difference between the two cases of `FS = " "'
(a single space) and `FS = "[ \t\n]+"' (a regular expression matching
one or more spaces, tabs, or newlines). For both values of `FS',
fields are separated by "runs" (multiple adjacent occurrences) of
spaces, tabs, and/or newlines. However, when the value of `FS' is
`" "', `awk' first strips leading and trailing whitespace from the
record and then decides where the fields are. For example, the
following pipeline prints `b':
$ echo ' a b c d ' | awk '{ print $2 }'
-| b
However, this pipeline prints `a' (note the extra spaces around each
letter):
$ echo ' a b c d ' | awk 'BEGIN { FS = "[ \t\n]+" }
> { print $2 }'
-| a
In this case, the first field is "null" or empty.
The stripping of leading and trailing whitespace also comes into
play whenever `$0' is recomputed. For instance, study this pipeline:
$ echo ' a b c d' | awk '{ print; $2 = $2; print }'
-| a b c d
-| a b c d
The first `print' statement prints the record as it was read, with
leading whitespace intact. The assignment to `$2' rebuilds `$0' by
concatenating `$1' through `$NF' together, separated by the value of
`OFS'. Because the leading whitespace was ignored when finding `$1',
it is not part of the new `$0'. Finally, the last `print' statement
prints the new `$0'.