The Webalizer is a web server log file analysis program which produces
usage statistics in HTML format for viewing with a browser. The results
are presented in both columnar and graphical format, which facilitates
interpretation. Yearly, monthly, daily and hourly usage statistics are
presented, along with the ability to display usage by site, URL, referrer,
user agent (browser), username, search strings, entry/exit pages, and
country (some information may not be available if not present in the log
file being processed).
The Webalizer supports CLF (common log format) log files,
as well as Combined log formats as defined by NCSA and others,
and variations of these which it attempts to handle intelligently. In
addition, the Webalizer also supports wu-ftpdxferlog
formatted log files, allowing analysis of ftp servers, and
squid proxy logs. Logs may also be compressed, via gzip.
If a compressed log file is detected, it will be automatically uncompressed
while it is read. Compressed logs must have the standard gzip
extension of .gz.
webazolver is normally just a symbolic link to the webalizer.
When run as webazolver, only DNS file creation/updates are performed,
and the program will exit once complete. All normal options and
configuration directives are available, however many will not be used.
In addition, a DNS cache file must be specified. If the number of DNS
children processes to use are not specified, the webazolver will
default to 5.
This documentation applies to The Webalizer Version 2.01
RUNNING THE WEBALIZER
The Webalizer was designed to be run from a Unix command line prompt or
as a crond(8) job. Once executed, the general flow of the program is:
o
A default configuration file is scanned for. A file named
webalizer.conf is searched for in the current directory, and if
found, it's configuration data is parsed. If the file is not
present in the current directory, the file /etc/webalizer.conf
is searched for and, if found, is used instead.
o
Any command line arguments given to the program are parsed. This
may include the specification of a configuration file, which is
processed at the time it is encountered.
o
If a log file was specified, it is opened and made ready for
processing. If no log file was given, STDIN is used for input.
If the log filename '-' is specified, STDIN will be forced.
o
If an output directory was specified, the program does a chdir(2) to
that directory in prepration for generating output. If no output
directory was given, the current directory is used.
o
If a non-zero number of DNS Children processes were specified, they will
be started, and the specified log file will be processed, creating or
updating the specified DNS cache file.
o
If no hostname was given, the program attempts to get the hostname
using a uname(2) system call. If that fails, localhost
is used.
o
A history file is searched for in the current directory (output
directory) and read if found. This file keeps totals for previous
months, which is used in the main index.html HTML document.
Note:
The file location can now be specified with the HistoryName
configuration option.
o
If incremental processing was specified, a data file is searched for
and loaded if found, containing the 'internal state' data of the
program at the end of a previous run.
Note:
The file location can now be specified with the IncrementalName
configuration option.
o
Main processing begins on the log file. If the log spans multiple
months, a seperate HTML document is created for each month.
o
After main processing, the main index.html page is created, which
has totals by month and links to each months HTML document.
o
A new history file is saved to disk, which includes totals generated
by The Webalizer during the current run.
o
If incremental processing was specified, a data file is written that
contains the 'internal state' data at the end of this run.
INCREMENTAL PROCESSING
Version 1.2x of The Webalizer adds incremental run capability. Simply
put, this allows processing large log files by breaking them up into
smaller pieces, and processing these pieces instead. What this means
in real terms is that you can now rotate your log files as often as you
want, and still be able to produce monthly usage statistics without the
loss of any detail. Basically, The Webalizer saves and restores all
internal data in a file named webalizer.current. This allows the
program to 'start where it left off' so to speak, and allows the
preservation of detail from one run to the next. The data file is
placed in the current output directory, and is a plain ascii text
file that can be viewed with any standard text editor. It's location
and name may be changed using the IncrementalName configuration
keyword.
Some special precautions need to be taken when using the incremental
run capability of The Webalizer. Configuration options should not be
changed between runs, as that could cause corruption of the internal
data stored. For example, changing the MangleAgents level will cause
different representations of user agents to be stored, producing invalid
results in the user agents section of the report. If you need to change
configuration options, do it at the end of the month after normal
processing of the previous month and before processing the current month.
You may also want to delete the webalizer.current file as well.
The Webalizer also attempts to prevent data duplication by keeping
track of the timestamp of the last record processed. This timestamp
is then compared to current records being processed, and any records
that were logged previous to that timestamp are ignored. This, in
theory, should allow you to re-process logs that have already been
processed, or process logs that contain a mix of processed/not yet
processed records, and not produce duplication of statistics. The
only time this may break is if you have duplicate timestamps in two
seperate log files... any records in the second log file that do have
the same timestamp as the last record in the previous log file processed,
will be discarded as if they had already been processed. There are
lots of ways to prevent this however, for example, stopping the web
server before rotating logs will prevent this situation. This setup
also necessitates that you always process logs in chronological order,
otherwise data loss will occur as a result of the timestamp compare.
REVERSE DNS LOOKUPS
The Webalizer supports reverse DNS lookups through a DNS cache file
that is either created/updated at run-time, or has been previously
created, either by a previous run of the webalizer, or by running
the stand-alone version, webazolver. In order to perform reverse
DNS lookups, a DNSCache filename must be specified. In order to
create/update the cache file at run-time, the DNSChildren number
must be non-zero. The DNSChildren value specifies the number of
children processes to fork, each of which will perform reverse DNS
lookups in order to create/update the DNS cache file. See the file
DNS.README for additional information.
COMMAND LINE OPTIONS
The Webalizer supports many different configuration options that will
alter the way the program behaves and generates output. Most of these
can be specified on the command line, while some can only be specified
in a configuration file. The command line options are listed below,
with references to the corresponding configuration file keywords.
General Options
-h
Display all available command line options and exit program.
-v -V
Display program version and exit program.
-d
Debug. Display debugging information for errors and warnings.
-i
IgnoreHist. Ignore history. USE WITH CAUTION. This
will cause The Webalizer to ignore any previous monthly history
file only. Incremental data (if present) is still processed.
-p
Incremental. Preserve internal data between runs.
-q
Quiet. Supress informational messages. Does not supress
warnings or errors.
-Q
ReallyQuiet. Supress all messages including warnings and errors.
-T
TimeMe. Force display of timing information at end of processing.
-c file
Use configuration file file.
-n name
HostName. Use the hostname name.
-o dir
OutputDir. Use output directory dir.
-t name
ReportTitle. Use name for report title.
-F ( clf | ftp | squid )
LogType. Specify log type to be processed. Value can be either
clf, ftp or squid format. If not specified, will
default to CLF format. FTP logs must be in standard
wu-ftpd xferlog format.
-f
FoldSeqErr. Fold out of sequence log records back into analysis,
by treating as if they were the same date/time as the last good record.
Normally, out of sequence log records are simply ignored.
-Y
CountryGraph. Supress country graph.
-G
HourlyGraph. Supress hourly graph.
-x name
HTMLExtension. Defines HTML file extension to use. If not
specified, defaults to html. Do not include the leading
period.
-H
HourlyStats. Supress hourly statistics.
-L
GraphLegend. Supress color coded graph legends.
-l num
GraphLines. Specify number of background lines. Default
is 2. Use zero ('0') to disable the lines.
-P name
PageType. Specify file extensions that are considered pages.
Sometimes referred to as pageviews.
-m num
VisitTimeout. Specify the Visit timeout period. Specified in
number of seconds. Default is 1800 seconds (30 minutes).
-I name
IndexAlias. Use the filename name as an additional alias
for index..
-M num
MangleAgents. Mangle user agent names according to the mangle
level specified by num. Mangle levels are:
5
Browser name and major version.
4
Browser name, major and minor version.
3
Browser name, major version, minor version to two decimal places.
2
Browser name, major and minor versions and sub-version.
1
Browser name, version and machine type if possible.
0
All informaiton (left unchanged).
-g num
GroupDomains. Automatically group sites by domain. The
grouping level specified by num can be thought of as 'the
number of dots' to display in the grouping. The default value
of 0 disables any domain grouping.
-D name
DNSCache. Use the DNS cache file name.
-N num
DNSChildren. Use num DNS children processes to perform DNS
lookups, either creating or updateing the DNS cache file. Specify zero
(0) to disable cache file creation/updates. If given, a DNS cache
filename must be specified.
Hide Options
-a name
HideAgent. Hide user agents matching name.
-r name
HideReferrer. Hide referrer matching name.
-s name
HideSite. Hide site matching name.
-X name
HideAllSites. Hide all individual sites (only display groups).
-u name
HideURL. Hide URL matching name.
Table size options
-A num
TopAgents. Display the top num user agents table.
-R num
TopReferrers. Display the top num referrers table.
-S num
TopSites. Display the top num sites table.
-U num
TopURLs. Display the top num URL's table.
-C num
TopCountries. Display the top num countries table.
-e num
TopEntry. Display the top num entry pages table.
-E num
TopExit. Display the top num exit pages table.
CONFIGURATION FILES
Configuration files are standard ascii(7) text files that may be created
or edited using any standard editor. Blank lines and lines that begin
with a pound sign ('#') are ignored. Any other lines are considered to
be configurgation lines, and have the form "Keyword Value", where the
'Keyword' is one of the currently available configuration keywords defined
below, and 'Value' is the value to assign to that particular option. Any
text found after the keyword up to the end of the line is considered the
keyword's value, so you should not include anything after the actual value
on the line that is not actually part of the value being assigned. The
file sample.conf provided with the distribution contains lots of useful
documentation and examples as well.
General Configuration Keywords
LogFile name
Use log file named name. If none specified, STDIN will be used.
LogType name
Specify log file type as name. Values can be either web,
squid or ftp, with the default being web.
OutputDir dir
Create output in the directory dir. If none specified, the current
directory will be used.
HistoryName name
Filename to use for history file. Relative to output directory unless
absolute name is given (ie: starts with '/'). Defaults to
'webalizer.hist' in the standard output directory.
ReportTitle name
Use the title string name for the report title. If none
specified, use the default of (in english) "Usage Statistics for ".
Hostname name
Set the hostname for the report as name. If none specified, an
attempt will be made to gather the hostname via a uname(2) system
call. If that fails, localhost will be used.
UseHTTPS ( yes | no )
Use https:// on links to URLS, instead of the default http://,
in the 'Top URL's' table.
Quiet ( yes | no )
Supress informational messages. Warning and Error messages will not be
supressed.
ReallyQuiet ( yes | no )
Supress all messages, including Warning and Error messages.
Debug ( yes | no )
Print extra debugging information on Warnings and Errors.
TimeMe ( yes | no )
Force timing information at end of processing.
GMTTime ( yes | no )
Use GMT (UTC) time instead of local timezone for reports.
IgnoreHist ( yes | no )
Ignore previous monthly history file. USE WITH CAUTION. Does
not prevent Incremental file processing.
FoldSeqErr ( yes | no )
Fold out of sequence log records back into analysis by treating them
as if they had the same date/time as the last good record. Normally,
out of sequence log records are ignored.
CountryGraph ( yes | no )
Display Country Usage Graph in output report.
DailyGraph ( yes | no )
Display Daily Graph in output report.
DailyStats ( yes | no )
Display Daily Statistics in output report.
HourlyGraph ( yes | no )
Display Hourly Graph in output report.
HourlyStats ( yes | no )
Display Hourly Statistics in output report.
PageType name
Define the file extensions to consider as a page. If a file
is found to have the same extension as name, it will be counted
as a page (sometimes called a pageview).
GraphLegend ( yes | no )
Allows the color coded graph legends to be enabled/disabled.
GraphLines num
Specify the number of background reference lines displayed on the
graphs produced. Disable by using zero ('0'), default is 2.
VisitTimeout num
Specifies the visit timeout value. Default is 1800 seconds (30
minutes). A visit is determined by looking at the difference in time
between the current and last request from a specific site. If the
difference is greater or equal to the timeout value, the request is
counted as a new visit. Specified in seconds.
IndexAlias name
Use name as an additional alias for index.*.
MangleAgents num
Mangle user agent names based on mangle level num. See the
-M command line switch for mangle levels and their meaning.
The default is 0, which doesn't mangle user agents at all.
SearchEnginenamevariable
Allows the specification of search engines and their query strings.
The name is the name to match against the referrer string for
a given search engine. The variable is the cgi variable that
the search engine uses for queries. See the sample.conf file
for example usage with common search engines.
Incremental ( yes | no )
Enable Incremental mode processing.
IncrementalName name
Filename to use for incremental data. Relative to output directory unless
an absolute name is given (ie: starts with '/'). Defaults to
'webalizer.current' in the standard output directory.
DNSCache name
Filename to use for the DNS cache. Relative to output directory unless
an absolute name is given (ie: starts with '/').
DNSChildren num
Number of children DNS processes to run in order to create/update the
DNS cache file. Specify zero (0) to disable.
Top Table Keywords
TopAgents num
Display the top num User Agents table. Use zero to disable.
AllAgents ( yes | no )
Create seperate HTML page with All User Agents.
TopReferrers num
Display the top num Referrers table. Use zero to disable.
AllReferrers ( yes | no )
Create seperate HTML page with All Referrers.
TopSites num
Display the top num Sites table. Use zero to disable.
TopKSites num
Display the top num Sites (by KByte) table. Use zero to disable.
AllSites ( yes | no )
Create seperate HTML page with All Sites.
TopURLs num
Display the top num URLs table. Use zero to disable.
TopKURLs num
Display the top num URLs (by KByte) table. Use zero to disable.
AllURLs ( yes | no )
Create seperate HTML page with All URLs.
TopCountries num
Display the top num Countries in the table. Use zero to disable.
TopEntry num
Display the top num Entry Pages in the table. Use zero to disable.
TopExit num
Display the top num Exit Pages in the table. Use zero to disable.
TopSearch num
Display the top num Search Strings in the table. Use zero to disable.
AllSearchStr ( yes | no )
Create seperate HTML page with All Search Strings.
TopUsers num
Display the top num Usernames in the table. Use zero to disable.
Usernames are only available if using http based authentication.
AllUsers ( yes | no )
Create seperate HTML page with All Usernames.
Hide/Ignore/Group/Include Keywords
HideAgent name
Hide User Agents that match name.
HideReferrer name
Hide Referrers that match name.
HideSite name
Hide Sites that match name.
HideAllSites ( yes | no )
Hide all individual sites. This causes only grouped sites to be displayed.
HideURL name
Hide URL's that match name.
HideUser name
Hide Usernames that match name.
IgnoreAgent name
Ignore User Agents that match name.
IgnoreReferrer name
Ignore Referrers that match name.
IgnoreSite name
Ignore Sites that match name.
IgnoreURL name
Ignore URL's that match name.
IgnoreUser name
Ignore Usernames that match name.
GroupAgentname [Label]
Group User Agents that match name. Display Label in 'Top Agent'
table if given (instead of name).
GroupReferrername [Label]
Group Referrers that match name. Display Label in 'Top Referrer'
table if given (instead of name).
GroupSitename [Label]
Group Sites that match name. Display Label in 'Top Site'
table if given (instead of name).
GroupDomainsnum
Automatically group sites by domain. The value num specifies the
level of grouping, and can be thought of as the 'number of dots' to
be displayed. The default value of 0 disables domain grouping.
GroupURLname [Label]
Group URL's that match name. Display Label in 'Top URL'
table if given (instead of name).
GroupUsername [Label]
Group Usernames that match name. Display Label in 'Top
Usernames' table if given (instead of name).
IncludeSite name
Force inclusion of sites that match name. Takes precedence
over Ignore# keywords.
IncludeURL name
Force inclusion of URL's that match name. Takes precedence
over Ignore# keywords.
IncludeReferrer name
Force inclusion of Referrers that match name. Takes precedence
over Ignore# keywords.
IncludeAgent name
Force inclusion of User Agents that match name. Takes precedence
over Ignore* keywords.
IncludeUser name
Force inclusion of Usernames that match name. Takes precedence
over Ignore* keywords.
HTML Generation Keywords
HTMLExtension text
Defines the HTML file extension to use. Default is html. Do not
include the leading period!
HTMLPre text
Insert text at the very beginning of the generated HTML file.
Defaults to a standard html 3.2 DOCTYPE record.
HTMLHead text
Insert text within the <HEAD></HEAD> block of the HTML file.
HTMLBody text
Insert text in HTML page, starting with the <BODY> tag. If used, the
first line must be a <BODY ...> tag. Multiple lines may be specified.
HTMLPost text
Insert text at top (before horiz. rule) of HTML pages. Multiple lines
may be specified.
HTMLTail text
Insert text at bottom of the HTML page. The text is top and
right aligned within a table column at the end of the report.
HTMLEnd text
Insert text at the very end of the HTML page. If not specified,
the default is to insert the ending </BODY> and </HTML> tags. If used,
you must supply these tags yourself.
Dump Object Keywords
The Webalizer allows you to export processed data to other programs by
using tab delimited text files. The Dump* commands specify
which files are to be written, and where.
DumpPath name
Save dump files in directory name. If not specified, the default
output directory will be used. Do not specify a trailing slash (/fP).
DumpExtension name
Use name as the filename extension for dump files. If not given,
the default of tab will be used.
DumpHeader ( yes | no )
Print a column header as the first record of the file.
DumpSites ( yes | no )
Dump the sites data to a tab delimited file.
DumpURLs ( yes | no )
Dump the url data to a tab delimited file.
DumpReferrers ( yes | no )
Dump the referrer data to a tab delimitd file. This data is only
available if using a log that contains referrer information
(ie: a combined format web log).
DumpAgents ( yes | no )
Dump the user agent data to a tab delimited file. This data is only
available if using a log that contains user agent information
(ie: a combined format web log).
DumpUsers ( yes | no )
Dump the username data to a tab delimited file. This data is only available
if processing a wu-ftpd xferlog or a web log that contains http authentication
information.
DumpSearchStr ( yes | no )
Dump the search string data to a tab delimited file. This data is only
available if processing a web log that contains referrer information and
had search string information present.
FILES
webalizer.conf
Default configuration file. Is searched for in the current directory
and if not found, in the /etc/ directory.
webalizer.hist
Monthly history file for previous 12 months. (can be changed)
webalizer.current
Current state data file (Incremental processing). (can be changed)
xxxxx_YYYYMM.html
Various monthly HTML output files produced. (extension can be changed)
xxxxx_YYYYMM.png
Various monthly image files used in the reports.
xxxxx_YYYYMM.tab
Monthly tab delimited text files. (extension can be changed)
Copyright (C) 1997-2000 by Bradford L. Barrett. Distributed under
the GNU GPL. See the files "COPYING" and "Copyright",
supplied with all distributions for additional information.