GNU Info

Info Node: (gawk.info)String Functions

(gawk.info)String Functions


Next: I/O Functions Prev: Numeric Functions Up: Built-in
Enter node , (file) or (file)node

String Manipulation Functions
-----------------------------

   The functions in this minor node look at or change the text of one
or more strings.  Optional parameters are enclosed in square brackets
([ and ]).  Those functions that are specific to `gawk' are marked with
a pound sign (`#'):

Gory Details
More than you want to know about `\' and
`&' with `sub', `gsub', and `gensub'.
`asort(SOURCE [, DEST]) #'
     `asort' is a `gawk'-specific extension, returning the number of
     elements in the array SOURCE.  The contents of SOURCE are sorted
     using `gawk''s normal rules for comparing values, and the indices
     of the sorted values of SOURCE are replaced with sequential
     integers starting with one. If the optional array DEST is
     specified, then SOURCE is duplicated into DEST.  DEST is then
     sorted, leaving the indices of SOURCE unchanged.  For example, if
     the contents of `a' are as follows:

          a["last"] = "de"
          a["first"] = "sac"
          a["middle"] = "cul"

     A call to `asort':

          asort(a)

     results in the following contents of `a':

          a[1] = "cul"
          a[2] = "de"
          a[3] = "sac"

     The `asort' function is described in more detail in Note: Sorting
     Array Values and Indices with `gawk'.  `asort' is a
     `gawk' extension; it is not available in compatibility mode (Note:
     Command-Line Options.).

`index(IN, FIND)'
     This searches the string IN for the first occurrence of the string
     FIND, and returns the position in characters where that occurrence
     begins in the string IN.  Consider the following example:

          $ awk 'BEGIN { print index("peanut", "an") }'
          -| 3

     If FIND is not found, `index' returns zero.  (Remember that string
     indices in `awk' start at one.)

`length([STRING])'
     This returns the number of characters in STRING.  If STRING is a
     number, the length of the digit string representing that number is
     returned.  For example, `length("abcde")' is 5.  By contrast,
     `length(15 * 35)' works out to 3. In this example, 15 * 35 = 525,
     and 525 is then converted to the string `"525"', which has three
     characters.

     If no argument is supplied, `length' returns the length of `$0'.

     *Note:* In older versions of `awk', the `length' function could be
     called without any parentheses.  Doing so is marked as
     "deprecated" in the POSIX standard.  This means that while a
     program can do this, it is a feature that can eventually be
     removed from a future version of the standard.  Therefore, for
     programs to be maximally portable, always supply the parentheses.

`match(STRING, REGEXP [, ARRAY])'
     The `match' function searches STRING for the longest leftmost
     substring matched by the regular expression, REGEXP.  It returns
     the character position, or "index", where that substring begins
     (one, if it starts at the beginning of STRING).  If no match is
     found, it returns zero.

     The order of the first two arguments is backwards from most other
     string functions that work with regular expressions, such as `sub'
     and `gsub'.  It might help to remember that for `match', the order
     is the same as for the `~' operator: `STRING ~ REGEXP'.

     The `match' function sets the built-in variable `RSTART' to the
     index.  It also sets the built-in variable `RLENGTH' to the length
     in characters of the matched substring.  If no match is found,
     `RSTART' is set to zero, and `RLENGTH' to -1.

     For example:

          {
                 if ($1 == "FIND")
                   regex = $2
                 else {
                   where = match($0, regex)
                   if (where != 0)
                     print "Match of", regex, "found at",
                               where, "in", $0
                 }
          }

     This program looks for lines that match the regular expression
     stored in the variable `regex'.  This regular expression can be
     changed.  If the first word on a line is `FIND', `regex' is
     changed to be the second word on that line.  Therefore, if given:

          FIND ru+n
          My program runs
          but not very quickly
          FIND Melvin
          JF+KM
          This line is property of Reality Engineering Co.
          Melvin was here.

     `awk' prints:

          Match of ru+n found at 12 in My program runs
          Match of Melvin found at 1 in Melvin was here.

     If ARRAY is present, it is cleared, and then the 0'th element of
     ARRAY is set to the entire portion of STRING matched by REGEXP.
     If REGEXP contains parentheses, the integer-indexed elements of
     ARRAY are set to contain the portion of STRING matching the
     corresponding parenthesized sub-expression.  For example:

          $ echo foooobazbarrrrr |
          > gawk '{ match($0, /(fo+).+(ba*r)/, arr)
          >           print arr[1], arr[2] }'
          -| foooo barrrrr

     The ARRAY argument to `match' is a `gawk' extension.  In
     compatibility mode (Note: Command-Line Options.), using a
     third argument is a fatal error.

`split(STRING, ARRAY [, FIELDSEP])'
     This function divides STRING into pieces separated by FIELDSEP,
     and stores the pieces in ARRAY.  The first piece is stored in
     `ARRAY[1]', the second piece in `ARRAY[2]', and so forth.  The
     string value of the third argument, FIELDSEP, is a regexp
     describing where to split STRING (much as `FS' can be a regexp
     describing where to split input records).  If the FIELDSEP is
     omitted, the value of `FS' is used.  `split' returns the number of
     elements created.  If STRING does not match FIELDSEP, ARRAY is
     empty and `split' returns zero.

     The `split' function splits strings into pieces in a manner
     similar to the way input lines are split into fields.  For example:

          split("cul-de-sac", a, "-")

     splits the string `cul-de-sac' into three fields using `-' as the
     separator.  It sets the contents of the array `a' as follows:

          a[1] = "cul"
          a[2] = "de"
          a[3] = "sac"

     The value returned by this call to `split' is three.

     As with input field-splitting, when the value of FIELDSEP is
     `" "', leading and trailing whitespace is ignored and the elements
     are separated by runs of whitespace.  Also as with input
     field-splitting, if FIELDSEP is the null string, each individual
     character in the string is split into its own array element.
     (This is a `gawk'-specific extension.)

     Modern implementations of `awk', including `gawk', allow the third
     argument to be a regexp constant (`/abc/') as well as a string.
     (d.c.)  The POSIX standard allows this as well.

     Before splitting the string, `split' deletes any previously
     existing elements in the array ARRAY.  If STRING does not match
     FIELDSEP at all, ARRAY has one element only. The value of that
     element is the original STRING.

`sprintf(FORMAT, EXPRESSION1, ...)'
     This returns (without printing) the string that `printf' would
     have printed out with the same arguments (Note: Using `printf'
     Statements for Fancier Printing.).  For example:

          pival = sprintf("pi = %.2f (approx.)", 22/7)

     assigns the string `"pi = 3.14 (approx.)"' to the variable `pival'.

`strtonum(STR) #'
     Examines STR and returns its numeric value.  If STR begins with a
     leading `0', `strtonum' assumes that STR is an octal number.  If
     STR begins with a leading `0x' or `0X', `strtonum' assumes that
     STR is a hexadecimal number.  For example:

          $ echo 0x11 |
          > gawk '{ printf "%d\n", strtonum($1) }'
          -| 17

     Using the `strtonum' function is _not_ the same as adding zero to
     a string value; the automatic coercion of strings to numbers works
     only for decimal data, not for octal or hexadecimal.(1)

     `strtonum' is a `gawk' extension; it is not available in
     compatibility mode (Note: Command-Line Options.).

`sub(REGEXP, REPLACEMENT [, TARGET])'
     The `sub' function alters the value of TARGET.  It searches this
     value, which is treated as a string, for the leftmost longest
     substring matched by the regular expression REGEXP.  Then the
     entire string is changed by replacing the matched text with
     REPLACEMENT.  The modified string becomes the new value of TARGET.

     This function is peculiar because TARGET is not simply used to
     compute a value, and not just any expression will do--it must be a
     variable, field, or array element so that `sub' can store a
     modified value there.  If this argument is omitted, then the
     default is to use and alter `$0'.  For example:

          str = "water, water, everywhere"
          sub(/at/, "ith", str)

     sets `str' to `"wither, water, everywhere"', by replacing the
     leftmost longest occurrence of `at' with `ith'.

     The `sub' function returns the number of substitutions made (either
     one or zero).

     If the special character `&' appears in REPLACEMENT, it stands for
     the precise substring that was matched by REGEXP.  (If the regexp
     can match more than one string, then this precise substring may
     vary.)  For example:

          { sub(/candidate/, "& and his wife"); print }

     changes the first occurrence of `candidate' to `candidate and his
     wife' on each input line.  Here is another example:

          $ awk 'BEGIN {
          >         str = "daabaaa"
          >         sub(/a+/, "C&C", str)
          >         print str
          > }'
          -| dCaaCbaaa

     This shows how `&' can represent a non-constant string and also
     illustrates the "leftmost, longest" rule in regexp matching (Note:
     How Much Text Matches?.).

     The effect of this special character (`&') can be turned off by
     putting a backslash before it in the string.  As usual, to insert
     one backslash in the string, you must write two backslashes.
     Therefore, write `\\&' in a string constant to include a literal
     `&' in the replacement.  For example, following is shown how to
     replace the first `|' on each line with an `&':

          { sub(/\|/, "\\&"); print }

     As mentioned, the third argument to `sub' must be a variable,
     field or array reference.  Some versions of `awk' allow the third
     argument to be an expression that is not an lvalue.  In such a
     case, `sub' still searches for the pattern and returns zero or
     one, but the result of the substitution (if any) is thrown away
     because there is no place to put it.  Such versions of `awk'
     accept expressions such as the following:

          sub(/USA/, "United States", "the USA and Canada")

     For historical compatibility, `gawk' accepts erroneous code, such
     as in the previous example. However, using any other non-changeable
     object as the third parameter causes a fatal error and your program
     will not run.

     Finally, if the REGEXP is not a regexp constant, it is converted
     into a string, and then the value of that string is treated as the
     regexp to match.

`gsub(REGEXP, REPLACEMENT [, TARGET])'
     This is similar to the `sub' function, except `gsub' replaces
     _all_ of the longest, leftmost, _non-overlapping_ matching
     substrings it can find.  The `g' in `gsub' stands for "global,"
     which means replace everywhere.  For example:

          { gsub(/Britain/, "United Kingdom"); print }

     replaces all occurrences of the string `Britain' with `United
     Kingdom' for all input records.

     The `gsub' function returns the number of substitutions made.  If
     the variable to search and alter (TARGET) is omitted, then the
     entire input record (`$0') is used.  As in `sub', the characters
     `&' and `\' are special, and the third argument must be assignable.

`gensub(REGEXP, REPLACEMENT, HOW [, TARGET]) #'
     `gensub' is a general substitution function.  Like `sub' and
     `gsub', it searches the target string TARGET for matches of the
     regular expression REGEXP.  Unlike `sub' and `gsub', the modified
     string is returned as the result of the function and the original
     target string is _not_ changed.  If HOW is a string beginning with
     `g' or `G', then it replaces all matches of REGEXP with
     REPLACEMENT.  Otherwise, HOW is treated as a number that indicates
     which match of REGEXP to replace. If no TARGET is supplied, `$0'
     is used.

     `gensub' provides an additional feature that is not available in
     `sub' or `gsub': the ability to specify components of a regexp in
     the replacement text.  This is done by using parentheses in the
     regexp to mark the components and then specifying `\N' in the
     replacement text, where N is a digit from 1 to 9.  For example:

          $ gawk '
          > BEGIN {
          >      a = "abc def"
          >      b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a)
          >      print b
          > }'
          -| def abc

     As with `sub', you must type two backslashes in order to get one
     into the string.

     In the replacement text, the sequence `\0' represents the entire
     matched text, as does the character `&'.

     The following example shows how you can use the third argument to
     control which match of the regexp should be changed:

          $ echo a b c a b c |
          > gawk '{ print gensub(/a/, "AA", 2) }'
          -| a b c AA b c

     In this case, `$0' is used as the default target string.  `gensub'
     returns the new string as its result, which is passed directly to
     `print' for printing.

     If the HOW argument is a string that does not begin with `g' or
     `G', or if it is a number that is less than or equal to zero, only
     one substitution is performed.  If HOW is zero, `gawk' issues a
     warning message.

     If REGEXP does not match TARGET, `gensub''s return value is the
     original unchanged value of TARGET.

     `gensub' is a `gawk' extension; it is not available in
     compatibility mode (Note: Command-Line Options.).

`substr(STRING, START [, LENGTH])'
     This returns a LENGTH-character-long substring of STRING, starting
     at character number START.  The first character of a string is
     character number one.(2) For example, `substr("washington", 5, 3)'
     returns `"ing"'.

     If LENGTH is not present, this function returns the whole suffix of
     STRING that begins at character number START.  For example,
     `substr("washington", 5)' returns `"ington"'.  The whole suffix is
     also returned if LENGTH is greater than the number of characters
     remaining in the string, counting from character number START.

     The string returned by `substr' _cannot_ be assigned.  Thus, it is
     a mistake to attempt to change a portion of a string, as shown in
     the following example:

          string = "abcdef"
          # try to get "abCDEf", won't work
          substr(string, 3, 3) = "CDE"

     It is also a mistake to use `substr' as the third argument of
     `sub' or `gsub':

          gsub(/xyz/, "pdq", substr($0, 5, 20))  # WRONG

     (Some commercial versions of `awk' do in fact let you use `substr'
     this way, but doing so is not portable.)

     If you need to replace bits and pieces of a string, combine
     `substr' with string concatenation, in the following manner:

          string = "abcdef"
          ...
          string = substr(string, 1, 2) "CDE" substr(string, 6)

`tolower(STRING)'
     This returns a copy of STRING, with each uppercase character in
     the string replaced with its corresponding lowercase character.
     Non-alphabetic characters are left unchanged.  For example,
     `tolower("MiXeD cAsE 123")' returns `"mixed case 123"'.

`toupper(STRING)'
     This returns a copy of STRING, with each lowercase character in
     the string replaced with its corresponding uppercase character.
     Non-alphabetic characters are left unchanged.  For example,
     `toupper("MiXeD cAsE 123")' returns `"MIXED CASE 123"'.

   ---------- Footnotes ----------

   (1) Unless you use the `--non-decimal-data' option, which isn't
recommended.  Note: Allowing Non-Decimal Input Data,
for more information.

   (2) This is different from C and C++, where the first character is
number zero.


automatically generated by info2www version 1.2.2.9