Regular Expression Operators
============================
You can combine regular expressions with special characters, called
"regular expression operators" or "metacharacters", to increase the
power and versatility of regular expressions.
The escape sequences described in Note:Escape Sequences, are
valid inside a regexp. They are introduced by a `\', and are
recognized and converted into the corresponding real characters as the
very first step in processing regexps.
Here is a list of metacharacters. All characters that are not escape
sequences and that are not listed in the table stand for themselves:
`\'
This is used to suppress the special meaning of a character when
matching. For example, `\$' matches the character `$'.
`^'
This matches the beginning of a string. For example, `^@chapter'
matches `@chapter' at the beginning of a string, and can be used
to identify chapter beginnings in Texinfo source files. The `^'
is known as an "anchor", because it anchors the pattern to match
only at the beginning of the string.
It is important to realize that `^' does not match the beginning of
a line embedded in a string. The condition is not true in the
following example:
if ("line1\nLINE 2" ~ /^L/) ...
`$'
This is similar to `^' but it matches only at the end of a string.
For example, `p$' matches a record that ends with a `p'. The `$'
is an anchor and does not match the end of a line embedded in a
string. The condition is not true in the following example:
if ("line1\nLINE 2" ~ /1$/) ...
`.'
This matches any single character, _including_ the newline
character. For example, `.P' matches any single character
followed by a `P' in a string. Using concatenation, we can make a
regular expression such as `U.A', that matches any three-character
sequence that begins with `U' and ends with `A'.
In strict POSIX mode (Note:Command-Line Options.), `.'
does not match the NUL character, which is a character with all
bits equal to zero. Otherwise, NUL is just another character.
Other versions of `awk' may not be able to match the NUL character.
`[...]'
This is called a "character list".(1) It matches any _one_ of the
characters that are enclosed in the square brackets. For example,
`[MVX]' matches any one of the characters `M', `V', or `X', in a
string. A full discussion of what can be inside the square
brackets of a character list is given in Note:Using Character
Lists.
`[^ ...]'
This is a "complemented character list". The first character after
the `[' _must_ be a `^'. It matches any characters _except_ those
in the square brackets. For example, `[^awk]' matches any
character that is not an `a', a `w', or a `k'.
`|'
This is the "alternation operator" and it is used to specify
alternatives. The `|' has the lowest precedence of all the regular
expression operators. For example, `^P|[[:digit:]]' matches any
string that matches either `^P' or `[[:digit:]]'. This means it
matches any string that starts with `P' or contains a digit.
The alternation applies to the largest possible regexps on either
side.
`(...)'
Parentheses are used for grouping in regular expressions, similar
to arithmetic. They can be used to concatenate regular expressions
containing the alternation operator, `|'. For example,
`@(samp|code)\{[^}]+\}' matches both `@code{foo}' and `@samp{bar}'.
(These are Texinfo formatting control sequences.)
`*'
This symbol means that the preceding regular expression should be
repeated as many times as necessary to find a match. For example,
`ph*' applies the `*' symbol to the preceding `h' and looks for
matches of one `p' followed by any number of `h's. This also
matches just `p' if no `h's are present.
The `*' repeats the _smallest_ possible preceding expression.
(Use parentheses if you want to repeat a larger expression.) It
finds as many repetitions as possible. For example, `awk
'/\(c[ad][ad]*r x\)/ { print }' sample' prints every record in
`sample' containing a string of the form `(car x)', `(cdr x)',
`(cadr x)', and so on. Notice the escaping of the parentheses by
preceding them with backslashes.
`+'
This symbol is similar to `*' except that the preceding expression
must be matched at least once. This means that `wh+y' would match
`why' and `whhy', but not `wy', whereas `wh*y' would match all
three of these strings. The following is a simpler way of writing
the last `*' example:
awk '/\(c[ad]+r x\)/ { print }' sample
`?'
This symbol is similar to `*' except that the preceding expression
can be matched either once or not at all. For example, `fe?d'
matches `fed' and `fd', but nothing else.
`{N}'
`{N,}'
`{N,M}'
One or two numbers inside braces denote an "interval expression".
If there is one number in the braces, the preceding regexp is
repeated N times. If there are two numbers separated by a comma,
the preceding regexp is repeated N to M times. If there is one
number followed by a comma, then the preceding regexp is repeated
at least N times:
`wh{3}y'
Matches `whhhy', but not `why' or `whhhhy'.
`wh{3,5}y'
Matches `whhhy', `whhhhy', or `whhhhhy', only.
`wh{2,}y'
Matches `whhy' or `whhhy', and so on.
Interval expressions were not traditionally available in `awk'.
They were added as part of the POSIX standard to make `awk' and
`egrep' consistent with each other.
However, because old programs may use `{' and `}' in regexp
constants, by default `gawk' does _not_ match interval expressions
in regexps. If either `--posix' or `--re-interval' are specified
(Note:Command-Line Options.), then interval expressions
are allowed in regexps.
For new programs that use `{' and `}' in regexp constants, it is
good practice to always escape them with a backslash. Then the
regexp constants are valid and work the way you want them to, using
any version of `awk'.(2)
In regular expressions, the `*', `+', and `?' operators, as well as
the braces `{' and `}', have the highest precedence, followed by
concatenation, and finally by `|'. As in arithmetic, parentheses can
change how operators are grouped.
In POSIX `awk' and `gawk', the `*', `+', and `?' operators stand for
themselves when there is nothing in the regexp that precedes them. For
example, `/+/' matches a literal plus sign. However, many other
versions of `awk' treat such a usage as a syntax error.
If `gawk' is in compatibility mode (*note Command-Line Options:
Options.), POSIX character classes and interval expressions are not
available in regular expressions.
---------- Footnotes ----------
(1) In other literature, you may see a character list referred to as
either a "character set", a "character class" or a "bracket expression".
(2) Use two backslashes if you're using a string constant with a
regexp operator or function.