GNU Info

Info Node: (gawk.info)Regexp Operators

(gawk.info)Regexp Operators


Next: Character Lists Prev: Escape Sequences Up: Regexp
Enter node , (file) or (file)node

Regular Expression Operators
============================

   You can combine regular expressions with special characters, called
"regular expression operators" or "metacharacters", to increase the
power and versatility of regular expressions.

   The escape sequences described in Note: Escape Sequences, are
valid inside a regexp.  They are introduced by a `\', and are
recognized and converted into the corresponding real characters as the
very first step in processing regexps.

   Here is a list of metacharacters.  All characters that are not escape
sequences and that are not listed in the table stand for themselves:

`\'
     This is used to suppress the special meaning of a character when
     matching.  For example, `\$' matches the character `$'.

`^'
     This matches the beginning of a string.  For example, `^@chapter'
     matches `@chapter' at the beginning of a string, and can be used
     to identify chapter beginnings in Texinfo source files.  The `^'
     is known as an "anchor", because it anchors the pattern to match
     only at the beginning of the string.

     It is important to realize that `^' does not match the beginning of
     a line embedded in a string.  The condition is not true in the
     following example:

          if ("line1\nLINE 2" ~ /^L/) ...

`$'
     This is similar to `^' but it matches only at the end of a string.
     For example, `p$' matches a record that ends with a `p'.  The `$'
     is an anchor and does not match the end of a line embedded in a
     string.  The condition is not true in the following example:

          if ("line1\nLINE 2" ~ /1$/) ...

`.'
     This matches any single character, _including_ the newline
     character.  For example, `.P' matches any single character
     followed by a `P' in a string.  Using concatenation, we can make a
     regular expression such as `U.A', that matches any three-character
     sequence that begins with `U' and ends with `A'.

     In strict POSIX mode (Note: Command-Line Options.), `.'
     does not match the NUL character, which is a character with all
     bits equal to zero.  Otherwise, NUL is just another character.
     Other versions of `awk' may not be able to match the NUL character.

`[...]'
     This is called a "character list".(1) It matches any _one_ of the
     characters that are enclosed in the square brackets.  For example,
     `[MVX]' matches any one of the characters `M', `V', or `X', in a
     string.  A full discussion of what can be inside the square
     brackets of a character list is given in Note: Using Character
     Lists.

`[^ ...]'
     This is a "complemented character list".  The first character after
     the `[' _must_ be a `^'.  It matches any characters _except_ those
     in the square brackets.  For example, `[^awk]' matches any
     character that is not an `a', a `w', or a `k'.

`|'
     This is the "alternation operator" and it is used to specify
     alternatives.  The `|' has the lowest precedence of all the regular
     expression operators.  For example, `^P|[[:digit:]]' matches any
     string that matches either `^P' or `[[:digit:]]'.  This means it
     matches any string that starts with `P' or contains a digit.

     The alternation applies to the largest possible regexps on either
     side.

`(...)'
     Parentheses are used for grouping in regular expressions, similar
     to arithmetic.  They can be used to concatenate regular expressions
     containing the alternation operator, `|'.  For example,
     `@(samp|code)\{[^}]+\}' matches both `@code{foo}' and `@samp{bar}'.
     (These are Texinfo formatting control sequences.)

`*'
     This symbol means that the preceding regular expression should be
     repeated as many times as necessary to find a match.  For example,
     `ph*' applies the `*' symbol to the preceding `h' and looks for
     matches of one `p' followed by any number of `h's.  This also
     matches just `p' if no `h's are present.

     The `*' repeats the _smallest_ possible preceding expression.
     (Use parentheses if you want to repeat a larger expression.)  It
     finds as many repetitions as possible.  For example, `awk
     '/\(c[ad][ad]*r x\)/ { print }' sample' prints every record in
     `sample' containing a string of the form `(car x)', `(cdr x)',
     `(cadr x)', and so on.  Notice the escaping of the parentheses by
     preceding them with backslashes.

`+'
     This symbol is similar to `*' except that the preceding expression
     must be matched at least once.  This means that `wh+y' would match
     `why' and `whhy', but not `wy', whereas `wh*y' would match all
     three of these strings.  The following is a simpler way of writing
     the last `*' example:

          awk '/\(c[ad]+r x\)/ { print }' sample

`?'
     This symbol is similar to `*' except that the preceding expression
     can be matched either once or not at all.  For example, `fe?d'
     matches `fed' and `fd', but nothing else.

`{N}'
`{N,}'
`{N,M}'
     One or two numbers inside braces denote an "interval expression".
     If there is one number in the braces, the preceding regexp is
     repeated N times.  If there are two numbers separated by a comma,
     the preceding regexp is repeated N to M times.  If there is one
     number followed by a comma, then the preceding regexp is repeated
     at least N times:

    `wh{3}y'
          Matches `whhhy', but not `why' or `whhhhy'.

    `wh{3,5}y'
          Matches `whhhy', `whhhhy', or `whhhhhy', only.

    `wh{2,}y'
          Matches `whhy' or `whhhy', and so on.

     Interval expressions were not traditionally available in `awk'.
     They were added as part of the POSIX standard to make `awk' and
     `egrep' consistent with each other.

     However, because old programs may use `{' and `}' in regexp
     constants, by default `gawk' does _not_ match interval expressions
     in regexps.  If either `--posix' or `--re-interval' are specified
     (Note: Command-Line Options.), then interval expressions
     are allowed in regexps.

     For new programs that use `{' and `}' in regexp constants, it is
     good practice to always escape them with a backslash.  Then the
     regexp constants are valid and work the way you want them to, using
     any version of `awk'.(2)

   In regular expressions, the `*', `+', and `?' operators, as well as
the braces `{' and `}', have the highest precedence, followed by
concatenation, and finally by `|'.  As in arithmetic, parentheses can
change how operators are grouped.

   In POSIX `awk' and `gawk', the `*', `+', and `?' operators stand for
themselves when there is nothing in the regexp that precedes them.  For
example, `/+/' matches a literal plus sign.  However, many other
versions of `awk' treat such a usage as a syntax error.

   If `gawk' is in compatibility mode (*note Command-Line Options:
Options.), POSIX character classes and interval expressions are not
available in regular expressions.

   ---------- Footnotes ----------

   (1) In other literature, you may see a character list referred to as
either a "character set", a "character class" or a "bracket expression".

   (2) Use two backslashes if you're using a string constant with a
regexp operator or function.


automatically generated by info2www version 1.2.2.9