GNU Info

Info Node: (gawk.info)Split Program

(gawk.info)Split Program


Next: Tee Program Prev: Id Program Up: Clones
Enter node , (file) or (file)node

Splitting a Large File into Pieces
----------------------------------

   The `split' program splits large text files into smaller pieces.
The usage is as follows:

     split [-COUNT] file [ PREFIX ]

   By default, the output files are named `xaa', `xab', and so on. Each
file has 1000 lines in it, with the likely exception of the last file.
To change the number of lines in each file, supply a number on the
command line preceded with a minus; e.g., `-500' for files with 500
lines in them instead of 1000.  To change the name of the output files
to something like `myfileaa', `myfileab', and so on, supply an
additional argument that specifies the file name prefix.

   Here is a version of `split' in `awk'. It uses the `ord' and `chr'
functions presented in Note: Translating Between Characters and
Numbers.

   The program first sets its defaults, and then tests to make sure
there are not too many arguments.  It then looks at each argument in
turn.  The first argument could be a minus followed by a number. If it
is, this happens to look like a negative number, so it is made
positive, and that is the count of lines.  The data file name is
skipped over and the final argument is used as the prefix for the
output file names:

     # split.awk --- do split in awk
     #
     # Requires ord and chr library functions
     # usage: split [-num] [file] [outname]
     
     BEGIN {
         outfile = "x"    # default
         count = 1000
         if (ARGC > 4)
             usage()
     
         i = 1
         if (ARGV[i] ~ /^-[0-9]+$/) {
             count = -ARGV[i]
             ARGV[i] = ""
             i++
         }
         # test argv in case reading from stdin instead of file
         if (i in ARGV)
             i++    # skip data file name
         if (i in ARGV) {
             outfile = ARGV[i]
             ARGV[i] = ""
         }
     
         s1 = s2 = "a"
         out = (outfile s1 s2)
     }

   The next rule does most of the work. `tcount' (temporary count)
tracks how many lines have been printed to the output file so far. If
it is greater than `count', it is time to close the current file and
start a new one.  `s1' and `s2' track the current suffixes for the file
name. If they are both `z', the file is just too big.  Otherwise, `s1'
moves to the next letter in the alphabet and `s2' starts over again at
`a':

     {
         if (++tcount > count) {
             close(out)
             if (s2 == "z") {
                 if (s1 == "z") {
                     printf("split: %s is too large to split\n",
                            FILENAME) > "/dev/stderr"
                     exit 1
                 }
                 s1 = chr(ord(s1) + 1)
                 s2 = "a"
             }
             else
                 s2 = chr(ord(s2) + 1)
             out = (outfile s1 s2)
             tcount = 1
         }
         print > out
     }

The `usage' function simply prints an error message and exits:

     function usage(   e)
     {
         e = "usage: split [-num] [file] [outname]"
         print e > "/dev/stderr"
         exit 1
     }

The variable `e' is used so that the function fits nicely on the screen.

   This program is a bit sloppy; it relies on `awk' to close the last
file for it automatically, instead of doing it in an `END' rule.  It
also assumes that letters are contiguous in the character set, which
isn't true for EBCDIC systems.


automatically generated by info2www version 1.2.2.9