Copyright (C) 2000-2012 |
GNU Info (gawk.info)Egrep ProgramSearching for Regular Expressions in Files ------------------------------------------ The `egrep' utility searches files for patterns. It uses regular expressions that are almost identical to those available in `awk' (Note: Regular Expressions.). It is used in the following manner: egrep [ OPTIONS ] 'PATTERN' FILES ... The PATTERN is a regular expression. In typical usage, the regular expression is quoted to prevent the shell from expanding any of the special characters as file name wildcards. Normally, `egrep' prints the lines that matched. If multiple file names are provided on the command line, each output line is preceded by the name of the file and a colon. The options to `egrep' are as follows: `-c' Print out a count of the lines that matched the pattern, instead of the lines themselves. `-s' Be silent. No output is produced and the exit value indicates whether the pattern was matched. `-v' Invert the sense of the test. `egrep' prints the lines that do _not_ match the pattern and exits successfully if the pattern is not matched. `-i' Ignore case distinctions in both the pattern and the input data. `-l' Only print (list) the names of the files that matched, not the lines that matched. `-e PATTERN' Use PATTERN as the regexp to match. The purpose of the `-e' option is to allow patterns that start with a `-'. This version uses the `getopt' library function (Note: Processing Command-Line Options.) and the file transition library program (Note: Noting Data File Boundaries. ). The program begins with a descriptive comment and then a `BEGIN' rule that processes the command-line arguments with `getopt'. The `-i' (ignore case) option is particularly easy with `gawk'; we just use the `IGNORECASE' built-in variable (Note: Built-in Variables): # egrep.awk --- simulate egrep in awk # Options: # -c count of lines # -s silent - use exit value # -v invert test, success if no match # -i ignore case # -l print filenames only # -e argument is pattern # # Requires getopt and file transition library functions BEGIN { while ((c = getopt(ARGC, ARGV, "ce:svil")) != -1) { if (c == "c") count_only++ else if (c == "s") no_print++ else if (c == "v") invert++ else if (c == "i") IGNORECASE = 1 else if (c == "l") filenames_only++ else if (c == "e") pattern = Optarg else usage() } Next comes the code that handles the `egrep'-specific behavior. If no pattern is supplied with `-e', the first non-option on the command line is used. The `awk' command-line arguments up to `ARGV[Optind]' are cleared, so that `awk' won't try to process them as files. If no files are specified, the standard input is used, and if multiple files are specified, we make sure to note this so that the file names can precede the matched lines in the output: if (pattern == "") pattern = ARGV[Optind++] for (i = 1; i < Optind; i++) ARGV[i] = "" if (Optind >= ARGC) { ARGV[1] = "-" ARGC = 2 } else if (ARGC - Optind > 1) do_filenames++ # if (IGNORECASE) # pattern = tolower(pattern) } The last two lines are commented out, since they are not needed in `gawk'. They should be uncommented if you have to use another version of `awk'. The next set of lines should be uncommented if you are not using `gawk'. This rule translates all the characters in the input line into lowercase if the `-i' option is specified.(1) The rule is commented out since it is not necessary with `gawk': #{ # if (IGNORECASE) # $0 = tolower($0) #} The `beginfile' function is called by the rule in `ftrans.awk' when each new file is processed. In this case, it is very simple; all it does is initialize a variable `fcount' to zero. `fcount' tracks how many lines in the current file matched the pattern. (Naming the parameter `junk' shows we know that `beginfile' is called with a parameter, but that we're not interested in its value.): function beginfile(junk) { fcount = 0 } The `endfile' function is called after each file has been processed. It affects the output only when the user wants a count of the number of lines that matched. `no_print' is true only if the exit status is desired. `count_only' is true if line counts are desired. `egrep' therefore only prints line counts if printing and counting are enabled. The output format must be adjusted depending upon the number of files to process. Finally, `fcount' is added to `total', so that we know how many lines altogether matched the pattern: function endfile(file) { if (! no_print && count_only) if (do_filenames) print file ":" fcount else print fcount total += fcount } The following rule does most of the work of matching lines. The variable `matches' is true if the line matched the pattern. If the user wants lines that did not match, the sense of `matches' is inverted using the `!' operator. `fcount' is incremented with the value of `matches', which is either one or zero, depending upon a successful or unsuccessful match. If the line does not match, the `next' statement just moves on to the next record. A number of additional tests are made, but they are only done if we are not counting lines. First, if the user only wants exit status (`no_print' is true), then it is enough to know that _one_ line in this file matched, and we can skip on to the next file with `nextfile'. Similarly, if we are only printing file names, we can print the file name, and then skip to the next file with `nextfile'. Finally, each line is printed, with a leading file name and colon if necessary: { matches = ($0 ~ pattern) if (invert) matches = ! matches fcount += matches # 1 or 0 if (! matches) next if (! count_only) { if (no_print) nextfile if (filenames_only) { print FILENAME nextfile } if (do_filenames) print FILENAME ":" $0 else print } } The `END' rule takes care of producing the correct exit status. If there are no matches, the exit status is one, otherwise it is zero: END \ { if (total == 0) exit 1 exit 0 } The `usage' function prints a usage message in case of invalid options, and then exits: function usage( e) { e = "Usage: egrep [-csvil] [-e pat] [files ...]" e = e "\n\tegrep [-csvil] pat [files ...]" print e > "/dev/stderr" exit 1 } The variable `e' is used so that the function fits nicely on the printed page. Just a note on programming style: you may have noticed that the `END' rule uses backslash continuation, with the open brace on a line by itself. This is so that it more closely resembles the way functions are written. Many of the examples in this major node use this style. You can decide for yourself if you like writing your `BEGIN' and `END' rules this way or not. ---------- Footnotes ---------- (1) It also introduces a subtle bug; if a match happens, we output the translated line, not the original. automatically generated by info2www version 1.2.2.9 |