Backslash Constructs in Regular Expressions
...........................................
For the most part, `\' followed by any character matches only that
character. However, there are several exceptions: certain
two-character sequences starting with `\' that have special meanings.
(The character after the `\' in such a sequence is always ordinary when
used on its own.) Here is a table of the special `\' constructs.
`\|'
specifies an alternative. Two regular expressions A and B with
`\|' in between form an expression that matches anything that
either A or B matches.
Thus, `foo\|bar' matches either `foo' or `bar' but no other string.
`\|' applies to the largest possible surrounding expressions.
Only a surrounding `\( ... \)' grouping can limit the grouping
power of `\|'.
Full backtracking capability exists to handle multiple uses of
`\|', if you use the POSIX regular expression functions (Note:POSIX Regexps).
`\{M\}'
is a postfix operator that repeats the previous pattern exactly M
times. Thus, `x\{5\}' matches the string `xxxxx' and nothing
else. `c[ad]\{3\}r' matches string such as `caaar', `cdddr',
`cadar', and so on.
`\{M,N\}'
is more general postfix operator that specifies repetition with a
minimum of M repeats and a maximum of N repeats. If M is omitted,
the minimum is 0; if N is omitted, there is no maximum.
For example, `c[ad]\{1,2\}r' matches the strings `car', `cdr',
`caar', `cadr', `cdar', and `cddr', and nothing else.
`\{0,1\}' or `\{,1\}' is equivalent to `?'.
`\{0,\}' or `\{,\}' is equivalent to `*'.
`\{1,\}' is equivalent to `+'.
`\( ... \)'
is a grouping construct that serves three purposes:
1. To enclose a set of `\|' alternatives for other operations.
Thus, the regular expression `\(foo\|bar\)x' matches either
`foox' or `barx'.
2. To enclose a complicated expression for the postfix operators
`*', `+' and `?' to operate on. Thus, `ba\(na\)*' matches
`ba', `bana', `banana', `bananana', etc., with any number
(zero or more) of `na' strings.
3. To record a matched substring for future reference with
`\DIGIT' (see below).
This last application is not a consequence of the idea of a
parenthetical grouping; it is a separate feature that was assigned
as a second meaning to the same `\( ... \)' construct because, in
pratice, there was usually no conflict between the two meanings.
But occasionally there is a conflict, and that led to the
introduction of shy groups.
`\(?: ... \)'
is the "shy group" construct. A shy group serves the first two
purposes of an ordinary group (controlling the nesting of other
operators), but it does not get a number, so you cannot refer back
to its value with `\DIGIT'.
Shy groups are particulary useful for mechanically-constructed
regular expressions because they can be added automatically
without altering the numbering of any ordinary, non-shy groups.
`\DIGIT'
matches the same text that matched the DIGITth occurrence of a
grouping (`\( ... \)') construct.
In other words, after the end of a group, the matcher remembers the
beginning and end of the text matched by that group. Later on in
the regular expression you can use `\' followed by DIGIT to match
that same text, whatever it may have been.
The strings matching the first nine grouping constructs appearing
in the entire regular expression passed to a search or matching
function are assigned numbers 1 through 9 in the order that the
open parentheses appear in the regular expression. So you can use
`\1' through `\9' to refer to the text matched by the
corresponding grouping constructs.
For example, `\(.*\)\1' matches any newline-free string that is
composed of two identical halves. The `\(.*\)' matches the first
half, which may be anything, but the `\1' that follows must match
the same exact text.
If a particular grouping construct in the regular expression was
never matched--for instance, if it appears inside of an
alternative that wasn't used, or inside of a repetition that
repeated zero times--then the corresponding `\DIGIT' construct
never matches anything. To use an artificial example,,
`\(foo\(b*\)\|lose\)\2' cannot match `lose': the second
alternative inside the larger group matches it, but then `\2' is
undefined and can't match anything. But it can match `foobb',
because the first alternative matches `foob' and `\2' matches `b'.
`\w'
matches any word-constituent character. The editor syntax table
determines which characters these are. Note:Syntax Tables.
`\W'
matches any character that is not a word constituent.
`\sCODE'
matches any character whose syntax is CODE. Here CODE is a
character that represents a syntax code: thus, `w' for word
constituent, `-' for whitespace, `(' for open parenthesis, etc.
To represent whitespace syntax, use either `-' or a space
character. Note:Syntax Class Table, for a list of syntax codes
and the characters that stand for them.
`\SCODE'
matches any character whose syntax is not CODE.
`\cC'
matches any character whose category is C. Here C is a character
that represents a category: thus, `c' for Chinese characters or
`g' for Greek characters in the standard category table.
`\CC'
matches any character whose category is not C.
The following regular expression constructs match the empty
string--that is, they don't use up any characters--but whether they
match depends on the context.
`\`'
matches the empty string, but only at the beginning of the buffer
or string being matched against.
`\''
matches the empty string, but only at the end of the buffer or
string being matched against.
`\='
matches the empty string, but only at point. (This construct is
not defined when matching against a string.)
`\b'
matches the empty string, but only at the beginning or end of a
word. Thus, `\bfoo\b' matches any occurrence of `foo' as a
separate word. `\bballs?\b' matches `ball' or `balls' as a
separate word.
`\b' matches at the beginning or end of the buffer regardless of
what text appears next to it.
`\B'
matches the empty string, but _not_ at the beginning or end of a
word.
`\<'
matches the empty string, but only at the beginning of a word.
`\<' matches at the beginning of the buffer only if a
word-constituent character follows.
`\>'
matches the empty string, but only at the end of a word. `\>'
matches at the end of the buffer only if the contents end with a
word-constituent character.
Not every string is a valid regular expression. For example, a
string with unbalanced square brackets is invalid (with a few
exceptions, such as `[]]'), and so is a string that ends with a single
`\'. If an invalid regular expression is passed to any of the search
functions, an `invalid-regexp' error is signaled.