Copyright (C) 2000-2012 |
GNU Info (gawk.info)GNU Regexp Operators`gawk'-Specific Regexp Operators ================================ GNU software that deals with regular expressions provides a number of additional regexp operators. These operators are described in this minor node and are specific to `gawk'; they are not available in other `awk' implementations. Most of the additional operators deal with word matching. For our purposes, a "word" is a sequence of one or more letters, digits, or underscores (`_'): `\w' Matches any word-constituent character--that is, it matches any letter, digit, or underscore. Think of it as short-hand for `[[:alnum:]_]'. `\W' Matches any character that is not word-constituent. Think of it as short-hand for `[^[:alnum:]_]'. `\<' Matches the empty string at the beginning of a word. For example, `/\<away/' matches `away' but not `stowaway'. `\>' Matches the empty string at the end of a word. For example, `/stow\>/' matches `stow' but not `stowaway'. `\y' Matches the empty string at either the beginning or the end of a word (i.e., the word boundar*y*). For example, `\yballs?\y' matches either `ball' or `balls', as a separate word. `\B' Matches the empty string that occurs between two word-constituent characters. For example, `/\Brat\B/' matches `crate' but it does not match `dirty rat'. `\B' is essentially the opposite of `\y'. There are two other operators that work on buffers. In Emacs, a "buffer" is, naturally, an Emacs buffer. For other programs, `gawk''s regexp library routines consider the entire string to match as the buffer. `\`' Matches the empty string at the beginning of a buffer (string). `\'' Matches the empty string at the end of a buffer (string). Because `^' and `$' always work in terms of the beginning and end of strings, these operators don't add any new capabilities for `awk'. They are provided for compatibility with other GNU software. In other GNU software, the word-boundary operator is `\b'. However, that conflicts with the `awk' language's definition of `\b' as backspace, so `gawk' uses a different letter. An alternative method would have been to require two backslashes in the GNU operators, but this was deemed too confusing. The current method of using `\y' for the GNU `\b' appears to be the lesser of two evils. The various command-line options (*note Command-Line Options: Options.) control how `gawk' interprets characters in regexps: No options In the default case, `gawk' provides all the facilities of POSIX regexps and the GNU regexp operators described in Note: Regular Expression Operators. However, interval expressions are not supported. `--posix' Only POSIX regexps are supported; the GNU operators are not special (e.g., `\w' matches a literal `w'). Interval expressions are allowed. `--traditional' Traditional Unix `awk' regexps are matched. The GNU operators are not special, interval expressions are not available, nor are the POSIX character classes (`[[:alnum:]]' and so on). Characters described by octal and hexadecimal escape sequences are treated literally, even if they represent regexp metacharacters. `--re-interval' Allow interval expressions in regexps, even if `--traditional' has been provided. automatically generated by info2www version 1.2.2.9 |