`gawk'-Specific Regexp Operators
================================
GNU software that deals with regular expressions provides a number of
additional regexp operators. These operators are described in this
minor node and are specific to `gawk'; they are not available in other
`awk' implementations. Most of the additional operators deal with word
matching. For our purposes, a "word" is a sequence of one or more
letters, digits, or underscores (`_'):
`\w'
Matches any word-constituent character--that is, it matches any
letter, digit, or underscore. Think of it as short-hand for
`[[:alnum:]_]'.
`\W'
Matches any character that is not word-constituent. Think of it
as short-hand for `[^[:alnum:]_]'.
`\<'
Matches the empty string at the beginning of a word. For example,
`/\<away/' matches `away' but not `stowaway'.
`\>'
Matches the empty string at the end of a word. For example,
`/stow\>/' matches `stow' but not `stowaway'.
`\y'
Matches the empty string at either the beginning or the end of a
word (i.e., the word boundar*y*). For example, `\yballs?\y'
matches either `ball' or `balls', as a separate word.
`\B'
Matches the empty string that occurs between two word-constituent
characters. For example, `/\Brat\B/' matches `crate' but it does
not match `dirty rat'. `\B' is essentially the opposite of `\y'.
There are two other operators that work on buffers. In Emacs, a
"buffer" is, naturally, an Emacs buffer. For other programs, `gawk''s
regexp library routines consider the entire string to match as the
buffer.
`\`'
Matches the empty string at the beginning of a buffer (string).
`\''
Matches the empty string at the end of a buffer (string).
Because `^' and `$' always work in terms of the beginning and end of
strings, these operators don't add any new capabilities for `awk'.
They are provided for compatibility with other GNU software.
In other GNU software, the word-boundary operator is `\b'. However,
that conflicts with the `awk' language's definition of `\b' as
backspace, so `gawk' uses a different letter. An alternative method
would have been to require two backslashes in the GNU operators, but
this was deemed too confusing. The current method of using `\y' for the
GNU `\b' appears to be the lesser of two evils.
The various command-line options (*note Command-Line Options:
Options.) control how `gawk' interprets characters in regexps:
No options
In the default case, `gawk' provides all the facilities of POSIX
regexps and the GNU regexp operators described in Note:Regular
Expression Operators. However, interval
expressions are not supported.
`--posix'
Only POSIX regexps are supported; the GNU operators are not special
(e.g., `\w' matches a literal `w'). Interval expressions are
allowed.
`--traditional'
Traditional Unix `awk' regexps are matched. The GNU operators are
not special, interval expressions are not available, nor are the
POSIX character classes (`[[:alnum:]]' and so on). Characters
described by octal and hexadecimal escape sequences are treated
literally, even if they represent regexp metacharacters.
`--re-interval'
Allow interval expressions in regexps, even if `--traditional' has
been provided.