GNU Info

Info Node: (gawk.info)Character Lists

(gawk.info)Character Lists


Next: GNU Regexp Operators Prev: Regexp Operators Up: Regexp
Enter node , (file) or (file)node

Using Character Lists
=====================

   Within a character list, a "range expression" consists of two
characters separated by a hyphen.  It matches any single character that
sorts between the two characters, using the locale's collating sequence
and character set.  For example, in the default C locale, `[a-dx-z]' is
equivalent to `[abcdxyz]'.  Many locales sort characters in dictionary
order, and in these locales, `[a-dx-z]' is typically not equivalent to
`[abcdxyz]'; instead it might be equivalent to `[aBbCcDdxXyYz]', for
example.  To obtain the traditional interpretation of bracket
expressions, you can use the C locale by setting the `LC_ALL'
environment variable to the value `C'.

   To include one of the characters `\', `]', `-', or `^' in a
character list, put a `\' in front of it.  For example:

     [d\]]

matches either `d' or `]'.

   This treatment of `\' in character lists is compatible with other
`awk' implementations and is also mandated by POSIX.  The regular
expressions in `awk' are a superset of the POSIX specification for
Extended Regular Expressions (EREs).  POSIX EREs are based on the
regular expressions accepted by the traditional `egrep' utility.

   "Character classes" are a new feature introduced in the POSIX
standard.  A character class is a special notation for describing lists
of characters that have a specific attribute, but the actual characters
can vary from country to country and/or from character set to character
set.  For example, the notion of what is an alphabetic character
differs between the United States and France.

   A character class is only valid in a regexp _inside_ the brackets of
a character list.  Character classes consist of `[:', a keyword
denoting the class, and `:]'.  Here are the character classes defined
by the POSIX standard:

`[:alnum:]'    Alphanumeric characters.
`[:alpha:]'    Alphabetic characters.
`[:blank:]'    Space and tab characters.
`[:cntrl:]'    Control characters.
`[:digit:]'    Numeric characters.
`[:graph:]'    Characters that are both printable and visible.  (A space is
               printable but not visible, whereas an `a' is both.)
`[:lower:]'    Lowercase alphabetic characters.
`[:print:]'    Printable characters (characters that are not control
               characters).
`[:punct:]'    Punctuation characters (characters that are not letters,
               digits, control characters, or space characters).
`[:space:]'    Space characters (such as space, tab, and formfeed, to name a
               few).
`[:upper:]'    Uppercase alphabetic characters.
`[:xdigit:]'   Characters that are hexadecimal digits.

   For example, before the POSIX standard, you had to write
`/[A-Za-z0-9]/' to match alphanumeric characters.  If your character
set had other alphabetic characters in it, this would not match them,
and if your character set collated differently from ASCII, this might
not even match the ASCII alphanumeric characters.  With the POSIX
character classes, you can write `/[[:alnum:]]/' to match the alphabetic
and numeric characters in your character set.

   Two additional special sequences can appear in character lists.
These apply to non-ASCII character sets, which can have single symbols
(called "collating elements") that are represented with more than one
character. They can also have several characters that are equivalent for
"collating", or sorting, purposes.  (For example, in French, a plain "e"
and a grave-accented "e`" are equivalent.)

Collating Symbols
     A "collating symbol" is a multicharacter collating element
     enclosed between `[.' and `.]'.  For example, if `ch' is a
     collating element, then `[[.ch.]]' is a regexp that matches this
     collating element, whereas `[ch]' is a regexp that matches either
     `c' or `h'.

Equivalence Classes
     An "equivalence class" is a locale-specific name for a list of
     characters that are equal. The name is enclosed between `[=' and
     `=]'.  For example, the name `e' might be used to represent all of
     "e," "e`," and "e'." In this case, `[[=e=]]' is a regexp that
     matches any of `e', `e'', or `e`'.

   These features are very valuable in non-English speaking locales.

   *Caution:* The library functions that `gawk' uses for regular
expression matching currently only recognize POSIX character classes;
they do not recognize collating symbols or equivalence classes.


automatically generated by info2www version 1.2.2.9