Checkbot verifies the links in a specific portion of the World Wide
Web. It creates HTML pages with diagnostics.
Checkbot uses LWP to find URLs on pages and to check them. It supports
the same schemes as LWP does, and find the same links that
HTML::LinkExtor will find.
Checkbot considers links to be either 'internal' or
'external'. Internal links are links within the web space that needs
to be checked. If an internal link points to a web document this
document is retrieved, and its links are extracted and
processed. External links are only checked to be working. Checkbot
checks links as it finds them, so internal and external links are
checked at the same time, even though they are treated differently.
Options for Checkbot are:
--url <start URL>
Set the start URL. Checkbot starts checking at this URL, and then
recursively checks all links found on this page. The start URL takes
precedence over additional URLs specified on the command line.
If no scheme is protocol for the URL, the file protocol is assumed.
--match <match string>
This option selects which pages Checkbot considers local. If the
match string is contained within the URL, then Checkbot considers
the page local, retrieves it, and will check all the links contained
on it. Otherwise the page is considered external and it is only
checked with a HEAD request.
If no explicit match string is given, the start URLs (See option
"--url") will be used as a match string instead. In this case the
last page name, if any, will be trimmed. For example, a start URL like
"http://some.site/index.html" will result in a default match
string of "http://some.site/".
The match string can be a perl regular expression. For example, to
check the main server page and all HTML pages directly underneath it,
but not the HTML pages in the subdirectories of the server, the
match string would be "www.someserver.xyz/($|[^/]+.html)".
--exclude <exclude string>
URLs matching the exclude string are considered to be external,
even if they happen to match the match string (See option "--match").
The exclude string can be a perl regular expression.
--filter <filter string>
This options defines a filter string, which is a perl regular
expression. This filter is run on each URL found, thus rewriting the
URL before it enters the queue to be checked. It can be used to remove
elements from a URL. This option can be useful when symbolic links
point to the same directory, or when a content management system adds
session IDs to URLs.
For example "/old/new/" would replace occurances of 'old' with 'new'
in each URL.
--ignore <ignore string>
If a URL has an error, and matches the ignore string, its error
will not be listed. This can be useful to stop certain errors from
being listed.
The ignore string can be a perl regular expression.
--proxy <proxy URL>
This attribute specifies the URL for a proxy server. Only external URLs
are queried through this proxy server, because Checkbot assumes all
internal URLs can be accessed directly. Currently only the HTTP and FTP
protocols will be send to the proxy server.
--internal-only
Skip the checking of external links at the end of the Checkbot
run. Only matching links are checked. Not that some redirections may
still cause external links to be checked.
--mailto <email address>
Send mail to the email address when Checkbot is done
checking. Includes a small summary of the results.
--note <note>
The note is included verbatim in the mail message (See option
"--mailto"). This can be useful to include the URL of the summary HTML page
for easy reference, for instance.
Only meaningful in combination with the "--mailto" option.
--help
Shows brief help message on the standard output.
--verbose
Show verbose output while running. Includes all links checked, results
from the checks, etc.
--debug
Enable debugging mode. Not really supported anymore, but it will keep
some files around that otherwise would be deleted.
--sleep <seconds>
Number of seconds to sleep in between requests. Default is 0
seconds, i.e. do not sleep at all between requests. Setting this
option can be useful to keep the load on the web server down while
running Checkbot. This option can also be set to a fractional number,
i.e. a value of 0.1 will sleep one tenth of a second between requests.
--timeout <timeout>
Default timeout for the requests, specified in seconds. The default is
2 minutes.
--interval <seconds>
The maximum interval between updates in seconds. Default is 3 hours
(10800 seconds). Checkbot will start the intervale at one minute, and
gradually extend it towards the maximum interval.
--file <file name>
Write the summary pages into file file name. Default is "checkbot.html".
--style <URL of style file>
When this option is used, Checkbot embeds this URL as a link to a
style file on each page it writes. This makes it easy to customize the
layout of pages generated by Checkbot.
Do not include warnings on the result pages for those HTTP response
codes which match the regular expression. For instance, --dontwarn
``(301|404)'' would not include 301 and 404 response codes.
--enable-virtual
This option enables dealing with virtual servers. Checkbot then
assumes that all hostnames for internal servers are unique, even
though their IP addresses may be the same. Normally Checkbot uses the
IP address to distinguish servers. This has the advantage that if a
server has two names (e.g. www and bamboozle) its pages only get
checked once. When you want to check multiple virtual servers this
causes problems, which this feature work around by using the hostname
to distinguish the server.
--allow-simple-hosts
This option turns off warnings about URLs which contain unqualified
host names. This is useful for intranet sites which often use just a
simple hostname or even "localhost" in their links.
--language
The argument for this option is a two-letter language code. Checkbot
will use language negotiation to request files in that language. The
default is to request English language (language code 'en').
HINTS AND TIPS
Problems with checking FTP links
Some users may experience consistent problems with checking FTP
links. In these cases it may be useful to instruct Net::FTP to use
passive FTP mode to check files. This can be done by setting the
environment variable FTP_PASSIVE to 1. For example, using the bash
shell: "FTP_PASSIVE=1 checkbot ...". See the Net::FTP documentation
for more details.
Run-away Checkbot
In some cases Checkbot literaly takes forever to finish. There are two
common causes for this problem. First, there might be a database
application as part of the website which generates a new page based on
links on another page. Since Checkbot tries to travel through all
links this will create an infinite number of pages. Second, a server
configuration problem can causes a loop in generating URLs for pages
that really do not exist. This will result in URLs of the form
http://some.server/images/images/images/logo.png, with ever more
'images' included. Checkbot cannot check for this because the server
should have indicated that the requested pages do not exist.
PREREQUISITES
This script uses the "LWP" modules.
COREQUISITES
This script can send mail when "Mail::Send" is present.