FTP archives and mirrors transfer large volumes of
ls-lR and other index files over the Internet when only data files
need be. Configuration changes reduce unnecessary transfers. Editing, differencing and compressing ls-lR files reduces ls-lR and index transfer to negligible volumes. Do this with my FTP server script mkls-lR and my FTP client script getls-lR or Lee McLoughlin's mirror.
FTP servers like ftp.debian.org and ftp.tex.ac.uk have
gigabyte directory trees which change slowly. Crude daily mirroring by
FTP clients with programs or scripts like Lee McLoughlin's `mirror'
transfers uncompressed directory listings often several times larger
than new file transfer. The daily index files are even larger than the
listings. These files sizes in bytes for Sunday 22nd February 1998
were:
Archive Name CTAN Debian
Site ftp.tex.ac.uk ftp.debian.org
Directory /tex-archive /pub/debian
Full directory file, ls-lR 4617180 1476263
Daily index files in tree 11933231 4385284
New data files on day 1168164 61087257
This day was typical and Debian's stable `bo' distribution had no
new files. A crude mirror of bo would have downloaded 5 MB and no
new files!
A client mirror can exclude index files by configuring the mirror
script. This rarely happens. However, the FTP
server can exclude them from its ls-lR files. Client
mirrors using such ls-lR files ignore the index
files automatically.
The above servers supplement daily ls-lR.gz files with compressed
difference files ls-lR.patch.gz. mirror already provides a
parameter, ls_lR_file=, to use ls-lR.gz files. If client
mirrors also use configuration files patching a local copy of the
server ls-lR with ls-lR.patch.gz, tracking improves and
downloads and required bandwidth shrink even further.
The following files should accompany this one, mirror-lslR.html,
but not necessarily in the same directory: GPL, getls-lR,
getls-lR.1, mkls-lR, mkls-lR.1, ctan.unsw.edu.au, download,
mm/debian-dists, packages/debian-dists, ftp.tex.ac.uk, mm/localhost,
packages/localhost and simple.
mkls-lR uses the bash shell and GNU utilities which are
standard on Linux systems. The package files, debian-dists,
ftp.tex.ac.uk, localhost and simple also require a
patched version of the Perl script mirror by Lee
McLoughlin. Debian 2.2 patches mirror auto-magically during
installation. Later mirror versions than 2.9 may not require
patching.
Please read the manual pages for mirror, mirror-master,
cron and crontab before running any of the commands
suggested below.
If you have a Debian system version 2.2 or later just install the
mirror package.
For systems other than Debian GNU/Linux mkls-lR requires the
bash shell and GNU utilities to run. The GNU utilities used are:
at, cp, date, diff, find, gzip, ls, mv, rm, sed and touch.
Your bash shell should have these GNU utilities before others
with the same name on its search path. If this is so mkls-lR
will run without modification. The Debian utility
/bin/tempfile is used if it exists. If this exists but
differs on non-Debian systems mkls-lR may fail.
The ls-lR files mkls-lR creates can be used by any mirror
program which can use a local copy of the directory. The short script
getls-lR will run on your system if mirror does. This script
downloads the ls-lR files and then patches a copy of the remote
directory listing on your machine keeping it up to date with minimal
download volume.
For Lee McLoughlin's perl script mirror version 2.9, you can
install my patch instead of using the getls-lR script. Debian
does this automagically on installation.
Many FTP servers do not run Debian, Linux or bash and so cannot
run mkls-lR. A shell is a feature of all Unix-like systems. Some
simple shell code to make ls-lR.gz and ls-lR.patch.gz is
listed below for a script your FTP server runs at a fixed time of
day. You will need GNU diff and gzip, and may need to change the
directory name on the first line and the path names of diff and gzip.
cd /home/ftp/pub;
ls -lR -Ils-lR >ls-lR;
cp -a ls-lR.gz ls-lR.old.gz;
gzip -d ls-lR.old.gz;
diff -u ls-lR.old ls-lR >ls-lR.patch;
rm ls-lR.old;
gzip -9f ls-lR.patch ls-lR;
This code is less efficient and capable than mkls-lR but it makes
a ls-lR.gz and a ls-lR.patch.gz usable with
mirror. If you can get mkls-lR running, use mkls-lR.
mkls-lR is a GNU bash script which a server's cron runs
to make daily timezone, ls-lR.gz, ls-lR.patch.gz and
ls-lR.times files in that order. The current version is
1.38. This script has temporary files larger than ls-lR in the
directory /tmp but deletes them on termination. It copies
/etc/timezone if one exists to the ls-lR directory. It
makes a unified difference, ls-lR.patch, between the current and
previous ls-lR being careful about the names and dates on the
first two lines so that a mirror can easily check them.
If the difference is null mkls-lR removes temporary files and
terminates without changing the previous ls-lR files. This saves
unnecessary traffic in ls-lR.patch.gz and ls-lR.gz.
If a difference exists, mkls-lR compresses ls-lR to
ls-lR.gz and then the difference to ls-lR.patch.gz. It then
makes ls-lR.times which contains the modification time as decimal
seconds since the local start of 1 Jan 1970 of the previous ls-lR
file on its first line and of the current ls-lR on its second
line. Lastly the temporary files are moved into place simultaneously
perhaps at a specified later time. The file ls-lR.times is all
of 22 bytes.
A mirror downloads ls-lR.times first to decide whether to download
nothing or ls-lR.patch.gz or ls-lR.gz. If the time of the
local copy of the remote ls-lR matches the last time in ls-lR.times,
the mirror downloads nothing. If it matches the first time, the mirror
downloads ls-lR.patch.gz and applies the patch. If it matches neither
time, the mirror downloads ls-lR.gz and sets its local time
from ls-lR.times.
The option -a adds ls-lR to the files provided by
mkls-lR.
The timezone file allows a local script started by the mirror's
cron to know the archive's local time. This mirror script could
start the mirror program using at at a time calculated to
follow daylight saving changes at the archive. This permits extra
hours of night mirroring when the archive is to the West of the mirror
and runs mkls-lR late. If the archive's mkls-lR released
the ls-lRs at a precise time mirroring could start only a minute
later.
In mkls-lR the unified difference, diff -u, uses local
time. find is told to use local time and date uses the UTC
time-zone to convert local time to seconds. The mirror script set its
ls-lR copy to the time received with ls-lR.times or
ls-lR.patch.gz as appropriate. This time is assumed local if
mirror's parameter use_timelocal=true otherwise the
time is assumed UTC. Neither server nor mirror know the other's true
time-zone.
You can test mkls-lR with a command mkls-lR -d . and
watch what happens in your current directory
each time you repeat it. If your system did not install mksl-lR in
/usr/bin/, use the path where you have it installed. Start it
daily by putting a line in an appropriate crontab like:
At 8:50 pm local time start making the ls-lR files in
/tmp. Move them to /home/ftp/pub where mirrors can
download them at 9:00 pm precisely. The crontab must have write
permission for /home/ftp/pub/.
If your client mirrors are configured so they download your ls-lR
files twice, you can prevent this by not listing ls-lR files
inside your ls-lR files:
If you have more than one of these commands you should put them in a
bash script and start the script with cron.
A common fault of CTAN mirrors is downloading daily the 12 MB of
index files. If mkls-lR's directory was on our PATH, the
following command could go in a cron script file:
The -n argument, ls-lR_stable, replaces the basename
ls-lR so the output files become ls-lR_stable.gz,
ls-lR_stable.patch.gz and ls-lR_stable.times. Very necessary
if there are already more complete files in the same directory with
basename ls-lR as above. This example is obsolete but can be changed
to the subdirectories which comprise the current stable
Debian distribution.
You can combine the ignore and specify techniques for paths and get
the same selection as ls -lR gives at the command line.
Complex selections should be tested at the command line first. The
result of "-I ls-lR*" often differs from "-Ils-lR*" and
the last is usually what you want. The double
quotes make the special pattern character * relative to the
directory specified with -d. If you have a number of these
commands in a bash script a line set -f; turns file globbing
off for the script below. You need this to construct mkls-lR
commands with string variables however double quotes are still necessary
in defining the strings. Please use double quotes.
The mirror package files described here assume a line:
local_ignore+|(^|/)(|remote_)ls-lR[^/]*$
in your mirror.defaults file and also that this file has
no lines setting ls_lR_file or local_ls_lR_file. Your
mirror.defaults will then be compatible and ls-lR efficient
with the widest range of archives and mirror package files.
`download' is a test package file which uses your loopback
interface (if you have one) not a real Internet connection. It
downloads from /home/ftp/pub/ on your local FTP server using
your FTP client and mirror to /tmp/pub. You can copy and
edit it for any site which generates a ls-lR.gz and it requires
no mirror patch. You can test this with a command like:
mirror /etc/mirror/packages/download
Please copy
/usr/share/doc/mirror/examples/iml/packages/download to
/etc/mirror/packages/download first.
download has a defaults and two files packages. You
edit the defaults and leave the files packages alone. The
first files package downloads ls-lR.gz only when newer than
the local copy in remote_ls-lR.gz and the second files gets
the data files. The download package file is an efficient way to
download scattered files on a single day with multiple starts. There
are better techniques for mirroring sites which provide
ls-lR.patch.gz files.
`simple' and `ctan.unsw.edu.au' are package files which
use ls-lR.times files and my patch to mirror.pl 2.9. They
are simple and very ls-lR efficient. The local copy of the
remote ls-lR file is named remote_ls-lR. It is updated
and then missing files are immediately downloaded. More complex
package files to update remote_ls-lR or download missing
files or both are described in the following.
`debian-dists' and `ftp.tex.ac.uk' are example package
files for mirror using ls-lR.patch.gz. The package files
are set to initially download or update just the base directory,
README files and mirror. If you have an Internet
connection and mirror installed, start them daily by putting a
line in an appropriate root crontab like:
Please do not try running ftp.tex.ac.uk unless you have over 128MB
of memory plus swap available. Try man mirror or the mirror
documentation for solutions to memory problems.
`localhost' is a test package file which also uses your loopback
interface like `download'. As supplied it uses ls-lR.gz and
ls-lR.patch.gz. Uncomment two lines and it also uses the
ls-lR.times file to further reduce download.
The patch packages for these last three
package files are identical. The package mirrors
ls-lR.patch.gz, uncompresses it and applies it to the
local copy remote_ls-lR or downloads a complete listing in
ls-lR.gz. With a ls-lR.times file mirroring
ls-lR.patch.gz is unnecessary.
With a times file, mirror can determine whether remote_ls-lR
is up to date and if not whether the available patch applies. The
mirror script can then download nothing, the patch file or a
fresh listing as appropriate. Downloading the times file always saves
downloading an ls-lR.patch.gz file listing in mirroring the patch. When
the patch does not apply it saves downloading a redundant patch file
also.
The patch packages set a number of parameters to minimizes the
work mirror does to accomplish their objectives. The
packages set exclude_patt=. and do_deletes=false
telling mirror to do nothing with the data files. The
recursive=false parameter and a shortcut at the end of my perl
patch then make sure mirror does "nothing" quickly.
The `patch' packages for a site must create a new ls-lR
before it is used by the data package. Using mirror directly
ensures this or mirror-master with both packages given the same
name. Also putting max=1 at the top of the `mirror-master'
configuration file with patch packages first ensures correct
order. You can however increase max after the first data package
to run the `files' packages for different sites in
parallel. Often if the first max is no greater than the number of
patch packages and they are in the same order as their data
packages mirror-master will finish patching ls-lR before it
is used.
mirror can run just the `patch' packages to update the
ls-lR files rapidly for example using:
mirror -p patch /etc/mirror/packages/*
If the ls-lR files are up to date, this takes a few seconds and
downloads just a single directory listing. Running such a command 12
hours after your mirror start time often avoids missing an
ls-lR.patch.gz when the net is down. A missed
ls-lR.patch.gz download forces an extra ls-lR.gz download.
Archive sites should start mkls-lR at a fixed local time and at no other
time regardless of computer or communication failures. The suggested
local time is 9pm. Three patch runs a day additional to the
mirror run should save some ls-lR.gz downloads from sites with
frequent failures.
You should copy and edit the package files, download, debian-dists,
ftp.tex.ac.uk and localhost, carefully to suit your
site before using any of the above commands. Mistakes can result in
the loss of files, failure of your computer system or large
communication bills. Please note before editing:
Parameter lines with trailing blank characters may not work.
Blank spaces immediately after the "=" are also significant.
You cannot put comment lines immediately after continued lines
which end in &.
Check mirror documentation and man mirror first.
Package files are interpreted using mirror.defaults which
usually you do not change.
Lines in the defaults packages and at the end of a
patch package are the only ones you need change in these package
files.
You may wish to add lines beginning exclude_patt+| or
get_patt+| to the defaults package of the download,
localhost, debian-dists and ftp.tex.ac.uk files.
The patch packages are the same in the last three example
files. You need never change these four packages.
For GNU/Linux systems, you do not need to edit the server script,
mkls-lR.
The table following is for ctan.unsw.edu.au mirroring CTAN and
Debian between 1 July 1998 and 30 June 1999 inclusive. C.u.e.a
mirrors these sites twice a day, every day, and the table data comes
from its log files. Data files are only downloaded after a new remote
ls-lR.gz is created. CTAN and Debian did not provide
ls-lR.times files during this year.
This table shows the total number of ls-lR.gz and
ls-lR.patch.gz files downloaded for the year. Also shown are
typical file sizes and resulting download volume savings. The savings are
relative to downloading ls-lR.gz with one non-recursive
mirror package and then downloading data files with a second
using the downloaded directory. This two package technique used by
my download package file is the most efficient of the ls-lR
techniques used with just ls-lR.gz files available.
The above table shows CTAN required less than a quarter the
ls-lR.gz downloads of Debian. CTAN makes its ls-lR files at
the same time each morning in a maintenance script. Debian also does it at
a random time depending on progress with archive maintenance. A
missed downloading of even one ls-lR.patch.gz requires a fresh
ls-lR.gz before downloading data files again. The CTAN
connection was much worse than Debian's but 35 fewer
ls-lR.patch.gz were missed with a fixed ls-lR time.
The table also shows that CTAN made at least 19 more
ls-lR.patch.gz files than there were days in the year. Some of
the ls-lR.gz downloads may have resulted from extra ls-lR files
made by boot and shutdown scripts at times the mirror missed. Debian
appears not to have made ls-lR.patch.gz files on up to 51 days of
the year. This reduces download volume with no loss of tracking if the
archive is shutdown or there are simply no new files to mirror.
If c.u.e.a mirrored just once daily, more ls-lR.patch.gz
files would be missed and the download volume would be greater. The
many time-zones on the Internet between c.u.e.a and CTAN result in
usable bandwidth only just before and after the business day at
c.u.e.a. Usually for the early morning mirror, c.u.e.a logs
into CTAN after hours for about 10 minutes and downloads just a single
directory listing. However when needed the bandwidth is used and
c.u.e.a tracks CTAN better than closer mirrors.
For archives, I recommend these additional netiquette rules:
Run mkls-lR at a fixed time everyday from
cron. Release the scripts to mirrors at a precise later time
using the -t option. The interval between start and release times
should be about twice the normal run time of mkls-lR.
Never start mkls-lR from boot, shutdown or other scripts
which run at random times of day.
Use mkls-lR's "-I" flag to remove
all ls-lR files from the listings inside these ls-lR files.
Use mkls-lR's "-I" flag to remove all files derived
daily from directories (like CTAN's FILESby*)
from the listings in the ls-lR files.
For mirrors, the additional rules I recommend are:---
Use all ls-lR files the archive makes available as
previously described here.
Set your mirror to start at least one hour after the ls-lR
appear at the archive unless they are using precise release and
you have a script which uses their timezone file.
Set your mirror to ignore ls-lR files as data so you do
not download them twice (see my example package files for how to do
this with mirror).
Make your mirror packages ignore all archive files derived
daily from the directory tree like CTAN's FILES.by* (see my
examples again).
Email sites you mirror which do not run mkls-lR or an
equivalent requesting they do and attach mkls-lR and
mirror-lslR.html.
Early in 1998, Robin Fairbairns (f.t.a.u) and Guy Maor
(f.d.o) initiated the first scripts making ls-lR.patch.gz at
major archive sites. mkls-lR is a successor to these. Dirk
Eddelbuettel (edd@debian.org) is the author of the
Debian mirror package and has been a valuable source of
encouragement and suggestions. Lee McLoughlin is the author and
maintainer of the mirror perl scripts adapted after version 2.9
to use ls-lR.patch.gz for remote directory information.
Enjoy,
Ian Maclaine-cross (i.maclaine-cross@unsw.edu.au) 2001/10/22