Copyright (C) 2000-2012 |
GNU Info (gawk.info)Gory DetailsMore About `\' and `&' with `sub', `gsub', and `gensub' ....................................................... When using `sub', `gsub', or `gensub', and trying to get literal backslashes and ampersands into the replacement text, you need to remember that there are several levels of "escape processing" going on. First, there is the "lexical" level, which is when `awk' reads your program and builds an internal copy of it that can be executed. Then there is the runtime level, which is when `awk' actually scans the replacement string to determine what to generate. At both levels, `awk' looks for a defined set of characters that can come after a backslash. At the lexical level, it looks for the escape sequences listed in Note: Escape Sequences. Thus, for every `\' that `awk' processes at the runtime level, type two backslashes at the lexical level. When a character that is not valid for an escape sequence follows the `\', Unix `awk' and `gawk' both simply remove the initial `\' and put the next character into the string. Thus, for example, `"a\qb"' is treated as `"aqb"'. At the runtime level, the various functions handle sequences of `\' and `&' differently. The situation is (sadly) somewhat complex. Historically, the `sub' and `gsub' functions treated the two character sequence `\&' specially; this sequence was replaced in the generated text with a single `&'. Any other `\' within the REPLACEMENT string that did not precede an `&' was passed through unchanged. To illustrate with a table: You type `sub' sees `sub' generates -------- ---------- --------------- `\&' `&' the matched text `\\&' `\&' a literal `&' `\\\&' `\&' a literal `&' `\\\\&' `\\&' a literal `\&' `\\\\\&' `\\&' a literal `\&' `\\\\\\&' `\\\&' a literal `\\&' `\\q' `\q' a literal `\q' This table shows both the lexical-level processing, where an odd number of backslashes becomes an even number at the runtime level, as well as the runtime processing done by `sub'. (For the sake of simplicity, the rest of the tables below only show the case of even numbers of backslashes entered at the lexical level.) The problem with the historical approach is that there is no way to get a literal `\' followed by the matched text. The 1992 POSIX standard attempted to fix this problem. The standard says that `sub' and `gsub' look for either a `\' or an `&' after the `\'. If either one follows a `\', that character is output literally. The interpretation of `\' and `&' then becomes: You type `sub' sees `sub' generates -------- ---------- --------------- `&' `&' the matched text `\\&' `\&' a literal `&' `\\\\&' `\\&' a literal `\', then the matched text `\\\\\\&' `\\\&' a literal `\&' This appears to solve the problem. Unfortunately, the phrasing of the standard is unusual. It says, in effect, that `\' turns off the special meaning of any following character, but for anything other than `\' and `&', such special meaning is undefined. This wording leads to two problems: * Backslashes must now be doubled in the REPLACEMENT string, breaking historical `awk' programs. * To make sure that an `awk' program is portable, _every_ character in the REPLACEMENT string must be preceded with a backslash.(1) The POSIX standard is under revision. Because of the problems just listed, proposed text for the revised standard reverts to rules that correspond more closely to the original existing practice. The proposed rules have special cases that make it possible to produce a `\' preceding the matched text: You type `sub' sees `sub' generates -------- ---------- --------------- `\\\\\\&' `\\\&' a literal `\&' `\\\\&' `\\&' a literal `\', followed by the matched text `\\&' `\&' a literal `&' `\\q' `\q' a literal `\q' In a nutshell, at the runtime level, there are now three special sequences of characters (`\\\&', `\\&' and `\&') whereas historically there was only one. However, as in the historical case, any `\' that is not part of one of these three sequences is not special and appears in the output literally. `gawk' 3.0 and 3.1 follow these proposed POSIX rules for `sub' and `gsub'. Whether these proposed rules will actually become codified into the standard is unknown at this point. Subsequent `gawk' releases will track the standard and implement whatever the final version specifies; this Info file will be updated as well.(2) The rules for `gensub' are considerably simpler. At the runtime level, whenever `gawk' sees a `\', if the following character is a digit, then the text that matched the corresponding parenthesized subexpression is placed in the generated output. Otherwise, no matter what the character after the `\' is, it appears in the generated text and the `\' does not: You type `gensub' sees `gensub' generates -------- ------------- ------------------ `&' `&' the matched text `\\&' `\&' a literal `&' `\\\\' `\\' a literal `\' `\\\\&' `\\&' a literal `\', then the matched text `\\\\\\&' `\\\&' a literal `\&' `\\q' `\q' a literal `q' Because of the complexity of the lexical and runtime level processing and the special cases for `sub' and `gsub', we recommend the use of `gawk' and `gensub' when you have to do substitutions. Advanced Notes: Matching the Null String ---------------------------------------- In `awk', the `*' operator can match the null string. This is particularly important for the `sub', `gsub', and `gensub' functions. For example: $ echo abc | awk '{ gsub(/m*/, "X"); print }' -| XaXbXcX Although this makes a certain amount of sense, it can be surprising. ---------- Footnotes ---------- (1) This consequence was certainly unintended. (2) As this Info file was being finalized, we learned that the POSIX standard will not use these rules. However, it was too late to change `gawk' for the 3.1 release. `gawk' behaves as described here. automatically generated by info2www version 1.2.2.9 |