Whole document tree
    

Whole document tree

ls-lR Efficient FTP Mirrors

ls-lR Efficient FTP Mirrors

Ian Maclaine-cross <i.maclaine-cross@unsw.edu.au>


FTP archives and mirrors transfer large volumes of ls-lR and other index files over the Internet when only data files need be. Configuration changes reduce unnecessary transfers. Editing, differencing and compressing ls-lR files reduces ls-lR and index transfer to negligible volumes. Do this with my FTP server script mkls-lR and my FTP client script getls-lR or Lee McLoughlin's mirror.

© 1998--2001 Ian Maclaine-cross

You may use this only under the conditions of the General Public License in file GPL.

1. Introduction

2. Installation

3. Coding for the ls-lR files yourself

4. What the FTP server script does

5. Using the server script

6. Using the mirror package files

7. Editing the package files

8. Does ls-lR patching work?

9. Additional netiquette for ls-lR

10. Acknowledgements


1. Introduction

FTP servers like ftp.debian.org and ftp.tex.ac.uk have gigabyte directory trees which change slowly. Crude daily mirroring by FTP clients with programs or scripts like Lee McLoughlin's `mirror' transfers uncompressed directory listings often several times larger than new file transfer. The daily index files are even larger than the listings. These files sizes in bytes for Sunday 22nd February 1998 were:

Archive Name                          CTAN          Debian     
Site                            ftp.tex.ac.uk   ftp.debian.org 
Directory                        /tex-archive     /pub/debian   
Full directory file, ls-lR          4617180         1476263    
Daily index files in tree          11933231         4385284    
New data files on day               1168164        61087257      
  

This day was typical and Debian's stable `bo' distribution had no new files. A crude mirror of bo would have downloaded 5 MB and no new files!

A client mirror can exclude index files by configuring the mirror script. This rarely happens. However, the FTP server can exclude them from its ls-lR files. Client mirrors using such ls-lR files ignore the index files automatically.

The above servers supplement daily ls-lR.gz files with compressed difference files ls-lR.patch.gz. mirror already provides a parameter, ls_lR_file=, to use ls-lR.gz files. If client mirrors also use configuration files patching a local copy of the server ls-lR with ls-lR.patch.gz, tracking improves and downloads and required bandwidth shrink even further.

The following files should accompany this one, mirror-lslR.html, but not necessarily in the same directory: GPL, getls-lR, getls-lR.1, mkls-lR, mkls-lR.1, ctan.unsw.edu.au, download, mm/debian-dists, packages/debian-dists, ftp.tex.ac.uk, mm/localhost, packages/localhost and simple.

mkls-lR uses the bash shell and GNU utilities which are standard on Linux systems. The package files, debian-dists, ftp.tex.ac.uk, localhost and simple also require a patched version of the Perl script mirror by Lee McLoughlin. Debian 2.2 patches mirror auto-magically during installation. Later mirror versions than 2.9 may not require patching.

Please read the manual pages for mirror, mirror-master, cron and crontab before running any of the commands suggested below.

2. Installation

If you have a Debian system version 2.2 or later just install the mirror package.

For systems other than Debian GNU/Linux mkls-lR requires the bash shell and GNU utilities to run. The GNU utilities used are: at, cp, date, diff, find, gzip, ls, mv, rm, sed and touch. Your bash shell should have these GNU utilities before others with the same name on its search path. If this is so mkls-lR will run without modification. The Debian utility /bin/tempfile is used if it exists. If this exists but differs on non-Debian systems mkls-lR may fail.

The ls-lR files mkls-lR creates can be used by any mirror program which can use a local copy of the directory. The short script getls-lR will run on your system if mirror does. This script downloads the ls-lR files and then patches a copy of the remote directory listing on your machine keeping it up to date with minimal download volume.

For Lee McLoughlin's perl script mirror version 2.9, you can install my patch instead of using the getls-lR script. Debian does this automagically on installation.

3. Coding for the ls-lR files yourself

Many FTP servers do not run Debian, Linux or bash and so cannot run mkls-lR. A shell is a feature of all Unix-like systems. Some simple shell code to make ls-lR.gz and ls-lR.patch.gz is listed below for a script your FTP server runs at a fixed time of day. You will need GNU diff and gzip, and may need to change the directory name on the first line and the path names of diff and gzip.

cd /home/ftp/pub;
ls -lR -Ils-lR >ls-lR;
cp -a ls-lR.gz ls-lR.old.gz;
gzip -d ls-lR.old.gz;
diff -u ls-lR.old ls-lR >ls-lR.patch;
rm ls-lR.old;
gzip -9f ls-lR.patch ls-lR;

This code is less efficient and capable than mkls-lR but it makes a ls-lR.gz and a ls-lR.patch.gz usable with mirror. If you can get mkls-lR running, use mkls-lR.

4. What the FTP server script does

mkls-lR is a GNU bash script which a server's cron runs to make daily timezone, ls-lR.gz, ls-lR.patch.gz and ls-lR.times files in that order. The current version is 1.38. This script has temporary files larger than ls-lR in the directory /tmp but deletes them on termination. It copies /etc/timezone if one exists to the ls-lR directory. It makes a unified difference, ls-lR.patch, between the current and previous ls-lR being careful about the names and dates on the first two lines so that a mirror can easily check them.

If the difference is null mkls-lR removes temporary files and terminates without changing the previous ls-lR files. This saves unnecessary traffic in ls-lR.patch.gz and ls-lR.gz.

If a difference exists, mkls-lR compresses ls-lR to ls-lR.gz and then the difference to ls-lR.patch.gz. It then makes ls-lR.times which contains the modification time as decimal seconds since the local start of 1 Jan 1970 of the previous ls-lR file on its first line and of the current ls-lR on its second line. Lastly the temporary files are moved into place simultaneously perhaps at a specified later time. The file ls-lR.times is all of 22 bytes.

A mirror downloads ls-lR.times first to decide whether to download nothing or ls-lR.patch.gz or ls-lR.gz. If the time of the local copy of the remote ls-lR matches the last time in ls-lR.times, the mirror downloads nothing. If it matches the first time, the mirror downloads ls-lR.patch.gz and applies the patch. If it matches neither time, the mirror downloads ls-lR.gz and sets its local time from ls-lR.times.

The option -a adds ls-lR to the files provided by mkls-lR.

The timezone file allows a local script started by the mirror's cron to know the archive's local time. This mirror script could start the mirror program using at at a time calculated to follow daylight saving changes at the archive. This permits extra hours of night mirroring when the archive is to the West of the mirror and runs mkls-lR late. If the archive's mkls-lR released the ls-lRs at a precise time mirroring could start only a minute later.

In mkls-lR the unified difference, diff -u, uses local time. find is told to use local time and date uses the UTC time-zone to convert local time to seconds. The mirror script set its ls-lR copy to the time received with ls-lR.times or ls-lR.patch.gz as appropriate. This time is assumed local if mirror's parameter use_timelocal=true otherwise the time is assumed UTC. Neither server nor mirror know the other's true time-zone.

5. Using the server script

You can test mkls-lR with a command mkls-lR -d . and watch what happens in your current directory each time you repeat it. If your system did not install mksl-lR in /usr/bin/, use the path where you have it installed. Start it daily by putting a line in an appropriate crontab like:

50 20 * * * /usr/bin/mkls-lR -d /home/ftp/pub -t 21:00
  

At 8:50 pm local time start making the ls-lR files in /tmp. Move them to /home/ftp/pub where mirrors can download them at 9:00 pm precisely. The crontab must have write permission for /home/ftp/pub/.

If your client mirrors are configured so they download your ls-lR files twice, you can prevent this by not listing ls-lR files inside your ls-lR files:

01 21 * * * /usr/bin/mkls-lR -d /home/ftp/pub "-Ils-lR*"
  

If you have more than one of these commands you should put them in a bash script and start the script with cron.

A common fault of CTAN mirrors is downloading daily the 12 MB of index files. If mkls-lR's directory was on our PATH, the following command could go in a cron script file:

mkls-lR -d /home/ftp/tex-archive "-Ils-lR* -IFILES.by*"
  

A replacement for specifying the paths (files or subdirectories) that mirrors do not want is specifying the paths they do:

mkls-lR -d /home/ftp/pub/debian -n ls-lR_stable dists/slink
  

The -n argument, ls-lR_stable, replaces the basename ls-lR so the output files become ls-lR_stable.gz, ls-lR_stable.patch.gz and ls-lR_stable.times. Very necessary if there are already more complete files in the same directory with basename ls-lR as above. This example is obsolete but can be changed to the subdirectories which comprise the current stable Debian distribution.

You can combine the ignore and specify techniques for paths and get the same selection as ls -lR gives at the command line. Complex selections should be tested at the command line first. The result of "-I ls-lR*" often differs from "-Ils-lR*" and the last is usually what you want. The double quotes make the special pattern character * relative to the directory specified with -d. If you have a number of these commands in a bash script a line set -f; turns file globbing off for the script below. You need this to construct mkls-lR commands with string variables however double quotes are still necessary in defining the strings. Please use double quotes.

6. Using the mirror package files

The mirror package files described here assume a line:

 
local_ignore+|(^|/)(|remote_)ls-lR[^/]*$
in your mirror.defaults file and also that this file has no lines setting ls_lR_file or local_ls_lR_file. Your mirror.defaults will then be compatible and ls-lR efficient with the widest range of archives and mirror package files.

`download' is a test package file which uses your loopback interface (if you have one) not a real Internet connection. It downloads from /home/ftp/pub/ on your local FTP server using your FTP client and mirror to /tmp/pub. You can copy and edit it for any site which generates a ls-lR.gz and it requires no mirror patch. You can test this with a command like:

 
mirror /etc/mirror/packages/download 

Please copy /usr/share/doc/mirror/examples/iml/packages/download to /etc/mirror/packages/download first.

download has a defaults and two files packages. You edit the defaults and leave the files packages alone. The first files package downloads ls-lR.gz only when newer than the local copy in remote_ls-lR.gz and the second files gets the data files. The download package file is an efficient way to download scattered files on a single day with multiple starts. There are better techniques for mirroring sites which provide ls-lR.patch.gz files.

`simple' and `ctan.unsw.edu.au' are package files which use ls-lR.times files and my patch to mirror.pl 2.9. They are simple and very ls-lR efficient. The local copy of the remote ls-lR file is named remote_ls-lR. It is updated and then missing files are immediately downloaded. More complex package files to update remote_ls-lR or download missing files or both are described in the following.

`debian-dists' and `ftp.tex.ac.uk' are example package files for mirror using ls-lR.patch.gz. The package files are set to initially download or update just the base directory, README files and mirror. If you have an Internet connection and mirror installed, start them daily by putting a line in an appropriate root crontab like:

 
15 23 * * * /usr/bin/mirror /etc/mirror/packages/debian-dists

Please do not try running ftp.tex.ac.uk unless you have over 128MB of memory plus swap available. Try man mirror or the mirror documentation for solutions to memory problems.

`localhost' is a test package file which also uses your loopback interface like `download'. As supplied it uses ls-lR.gz and ls-lR.patch.gz. Uncomment two lines and it also uses the ls-lR.times file to further reduce download.

The patch packages for these last three package files are identical. The package mirrors ls-lR.patch.gz, uncompresses it and applies it to the local copy remote_ls-lR or downloads a complete listing in ls-lR.gz. With a ls-lR.times file mirroring ls-lR.patch.gz is unnecessary.

With a times file, mirror can determine whether remote_ls-lR is up to date and if not whether the available patch applies. The mirror script can then download nothing, the patch file or a fresh listing as appropriate. Downloading the times file always saves downloading an ls-lR.patch.gz file listing in mirroring the patch. When the patch does not apply it saves downloading a redundant patch file also.

The patch packages set a number of parameters to minimizes the work mirror does to accomplish their objectives. The packages set exclude_patt=. and do_deletes=false telling mirror to do nothing with the data files. The recursive=false parameter and a shortcut at the end of my perl patch then make sure mirror does "nothing" quickly.

The `patch' packages for a site must create a new ls-lR before it is used by the data package. Using mirror directly ensures this or mirror-master with both packages given the same name. Also putting max=1 at the top of the `mirror-master' configuration file with patch packages first ensures correct order. You can however increase max after the first data package to run the `files' packages for different sites in parallel. Often if the first max is no greater than the number of patch packages and they are in the same order as their data packages mirror-master will finish patching ls-lR before it is used.

mirror can run just the `patch' packages to update the ls-lR files rapidly for example using:

mirror -p patch /etc/mirror/packages/*
  

If the ls-lR files are up to date, this takes a few seconds and downloads just a single directory listing. Running such a command 12 hours after your mirror start time often avoids missing an ls-lR.patch.gz when the net is down. A missed ls-lR.patch.gz download forces an extra ls-lR.gz download. Archive sites should start mkls-lR at a fixed local time and at no other time regardless of computer or communication failures. The suggested local time is 9pm. Three patch runs a day additional to the mirror run should save some ls-lR.gz downloads from sites with frequent failures.

7. Editing the package files

You should copy and edit the package files, download, debian-dists, ftp.tex.ac.uk and localhost, carefully to suit your site before using any of the above commands. Mistakes can result in the loss of files, failure of your computer system or large communication bills. Please note before editing:

  1. Parameter lines with trailing blank characters may not work.
  2. Blank spaces immediately after the "=" are also significant.
  3. You cannot put comment lines immediately after continued lines which end in &.
  4. Check mirror documentation and man mirror first.
  5. Package files are interpreted using mirror.defaults which usually you do not change.
  6. Lines in the defaults packages and at the end of a patch package are the only ones you need change in these package files.
  7. You may wish to add lines beginning exclude_patt+| or get_patt+| to the defaults package of the download, localhost, debian-dists and ftp.tex.ac.uk files.
  8. The patch packages are the same in the last three example files. You need never change these four packages.

For GNU/Linux systems, you do not need to edit the server script, mkls-lR.

8. Does ls-lR patching work?

The table following is for ctan.unsw.edu.au mirroring CTAN and Debian between 1 July 1998 and 30 June 1999 inclusive. C.u.e.a mirrors these sites twice a day, every day, and the table data comes from its log files. Data files are only downloaded after a new remote ls-lR.gz is created. CTAN and Debian did not provide ls-lR.times files during this year.

This table shows the total number of ls-lR.gz and ls-lR.patch.gz files downloaded for the year. Also shown are typical file sizes and resulting download volume savings. The savings are relative to downloading ls-lR.gz with one non-recursive mirror package and then downloading data files with a second using the downloaded directory. This two package technique used by my download package file is the most efficient of the ls-lR techniques used with just ls-lR.gz files available.

Archive Name                     CTAN          Debian     
Site                        ftp.tex.ac.uk   ftp.debian.org 
Directory                    /tex-archive     /pub/debian

File downloads   
ls-lR.gz                           10              45
ls-lR.patch.gz                    384             314

Typical file size (kB)
ls-lR.gz                          560             650
ls-lR.patch.gz                     10              50

Download saving (MB)
ls-lR.gz                            0               0
ls-lR.patch.gz                    201             158            
ls-lR.times                       202             161
  

The above table shows CTAN required less than a quarter the ls-lR.gz downloads of Debian. CTAN makes its ls-lR files at the same time each morning in a maintenance script. Debian also does it at a random time depending on progress with archive maintenance. A missed downloading of even one ls-lR.patch.gz requires a fresh ls-lR.gz before downloading data files again. The CTAN connection was much worse than Debian's but 35 fewer ls-lR.patch.gz were missed with a fixed ls-lR time.

The table also shows that CTAN made at least 19 more ls-lR.patch.gz files than there were days in the year. Some of the ls-lR.gz downloads may have resulted from extra ls-lR files made by boot and shutdown scripts at times the mirror missed. Debian appears not to have made ls-lR.patch.gz files on up to 51 days of the year. This reduces download volume with no loss of tracking if the archive is shutdown or there are simply no new files to mirror.

If c.u.e.a mirrored just once daily, more ls-lR.patch.gz files would be missed and the download volume would be greater. The many time-zones on the Internet between c.u.e.a and CTAN result in usable bandwidth only just before and after the business day at c.u.e.a. Usually for the early morning mirror, c.u.e.a logs into CTAN after hours for about 10 minutes and downloads just a single directory listing. However when needed the bandwidth is used and c.u.e.a tracks CTAN better than closer mirrors.

9. Additional netiquette for ls-lR

For archives, I recommend these additional netiquette rules:

  1. Run mkls-lR at a fixed time everyday from cron. Release the scripts to mirrors at a precise later time using the -t option. The interval between start and release times should be about twice the normal run time of mkls-lR.
  2. Never start mkls-lR from boot, shutdown or other scripts which run at random times of day.
  3. Use mkls-lR's "-I" flag to remove all ls-lR files from the listings inside these ls-lR files.
  4. Use mkls-lR's "-I" flag to remove all files derived daily from directories (like CTAN's FILESby*) from the listings in the ls-lR files.

For mirrors, the additional rules I recommend are:---

  1. Use all ls-lR files the archive makes available as previously described here.
  2. Set your mirror to start at least one hour after the ls-lR appear at the archive unless they are using precise release and you have a script which uses their timezone file.
  3. Set your mirror to ignore ls-lR files as data so you do not download them twice (see my example package files for how to do this with mirror).
  4. Make your mirror packages ignore all archive files derived daily from the directory tree like CTAN's FILES.by* (see my examples again).
  5. Email sites you mirror which do not run mkls-lR or an equivalent requesting they do and attach mkls-lR and mirror-lslR.html.

10. Acknowledgements

Early in 1998, Robin Fairbairns (f.t.a.u) and Guy Maor (f.d.o) initiated the first scripts making ls-lR.patch.gz at major archive sites. mkls-lR is a successor to these. Dirk Eddelbuettel (edd@debian.org) is the author of the Debian mirror package and has been a valuable source of encouragement and suggestions. Lee McLoughlin is the author and maintainer of the mirror perl scripts adapted after version 2.9 to use ls-lR.patch.gz for remote directory information.

Enjoy,

Ian Maclaine-cross (i.maclaine-cross@unsw.edu.au) 2001/10/22