                                  [Linbot]


  ------------------------------------------------------------------------

Installing Linbot

Installation is relatively easy. Note these installation instructions are
for Unix-like systems. Other operating systems may differ.

  1. Unpack the gzipped tarchive. Be sure to add the directory to your
     PYTHONPATH environment variable.

     $ tar zxvf linbot-1.0b6.tar.gz -C /usr/local/lib
     $ PYTHONPATH="/usr/local/lib/linbot:$PYTHONPATH"
     $ export PYTHONPATH

  2. Add a symbolic link to some place in your PATH

     $ ln -s /usr/local/lib/linbot/linbot.py /usr/local/bin/linbot

  3. Edit the config.py file to your choosing. Most of the defaults are
     safe. The important ones can be overridden with command-line flags. You
     may want to keep a copy of the original config.py file just in case.
     The config.py options are documented within the file.

  ------------------------------------------------------------------------

Running Linbot

It is simple t run Linbot.

Executing Linbot without any command-line arguments will cause it to give a
simple synopsis of its usage and then quit.

$ linbot
linbot [-abvq][-l url][-x url]... [-y url]... [-r depth][-o dir][-w sec][-d level] url [location]...

Before running Linbot on a site, you should need to do a little preparation.

One think that Linbot needs is a directory in which to publish its reports.
It is recommended that you choose a directory which is empty and will only
contain linbot reports. This directory must exist and be writeable by the
user running linbot before linbot is run.

$ mkdir /usr/local/apache/share/htdocs/linbot

The report can be viewed using most web browsers. Browsers using frames can
initially open the "index.html" file. Browsers not supporting frames or
users who do not like frames can initially open the "navbar.html" file. Note
these are default filenames for Linbot and may be changed via the config
file.

It should be decided beforehand which documents on your site should be
considered "internal" and which should be considered "external". Linbot
defines internal and external documents as such:

An internal document is a part of your site that you have control of and
checked, as well as the links that it points to. Basically an internal
document is one that, if broken, you have the power to fix.

An external document is one that an internal document points to but you have
no jurisdiction over. It can also be a document that you have the power to
change, but need not be checked, such as documents pointed to by CGI scripts
or other automated tools such as Linbot.

Your base url is the url pointing to the document that is the top level of
your site. Commonly referred to as the "home page", it is the url that
points to all other urls, either directly or indirectly. The base url can be
on one web server but point to documents on another server that hosts other
internal documents. An exampel would be a main server
www.someplaceonthenet.com in which there may be links to an alternate server
called www2.someplaceonthenet.com. In this case www2.someplaceonthenet.com
would host internal documents even though your "home page" is on
www.someplaceonthenet.com.

That said, you should have a basic idea of what you do and do not want
Linbot to check. Don't be surprised if you do not get it exactly right the
first time. Also, consider using the robots.txt explained at
http://info.webcrawler.com/mak/projects/robots/exclusion-admin.html.
Currently Linbot identifies itself as User-Agent: Linbot.

You can allow Linbot to search a directory but restrict other bots, for
example, like this:

User-agent: *
Disallow: /

User-agent: Linbot
Allow: /

Okay you have heard enough and you just want to run the darn thing. The
simplest way to run Linbot is:

$ linbot http://www.someplaceonthenet.com/

This will first read the robots.txt file at www.someplaceonthenet.com and
then proceed to examine every link pointed to on that site except documents
denied by robots.txt if that file exists.

The exact usage for linbot is given below.

  ------------------------------------------------------------------------

Synopsis

 linbot linbot [-abvq][-l url][-x url]... [-y url]... [-r depth][-o dir][-w
 sec][-d level] url [location[:port]]...



 -x regexUse this option to tell Linbot to consider any url matching regex
         to be external. Uses perl-type regular expressions. Can be used
         multiple times.
 -y regexLike the -x flag, though this option will cause Linbot to not
         check the link matched by regex whereas -x will check the link but
         not its children. Uses perl-type regular expressions. Can be used
         multiple times.
  -l url Use url for the logo image on all reports. The url should point to
         a valid image.
    -b   Base urls only. Tells Linbot to consider any url that does not
         start with the base url to be considered external. For example, if
         you run linbot -b
         http://www.someplaceonthenet.com/~somebody/foo.html then
         http://www.someplaceonthenet.com/~somebody/misc/index.html will be
         considered internal whereas http://www.someplaceonthenet.com/ will
         be considered external.
    -a   Avoid external links. Normally if Linbot is examining an HTML page
         and it finds a link that points to an external document, it will
         check to see if that external document exists. This flag disables
         that action. External links will not be checked.
    -q   Quiet. Do not print out the progress as Linbot traverses a site
         (equivalent to -d 0).
  -o dir Output directory. Use to specify the directory where Linbot will
         dump its reports. The default is the current directory or as
         specified by config.py. If this directory does not exist it will
         be created for you (if possible).
 -r depthRedirect depth. the amount of redirects Linbot should follow when
         following a link. 0 implies follow all redirects.
 -w secs Wait secs between link checking. Usually Linbot will process a url
         and immediately move on to the next. However on some loaded
         systems it may be desirable to have Linbot pause between requests.
         This option can be set to any non-negative number.
 -d levelSet debug level to level. For programmer-level debugging use a
         level > 1.
   url   The base url. Linbot checks this link first, then all the links it
         points to on down the "tree".
 locationThis specifies the hosts pointed to that are to be considered
         internal. By default Linbot only consideres urls pointing to the
         host of the base url to be internal. However if your site resides
         on multiple servers use this parameter to tell Linbot what other
         servers should be considered internal. May be used multiple times,
         but must follow url.
  ------------------------------------------------------------------------

Examples

Here are some examples of running Linbot.

$ linbot http://manson.ddns.org/ -x /linbot starship.skyport.net
$ linbot -o /stats/altavista/ http://altavista.digital.com/
$ linbot -o ~/Lang/Python/linbot -b -l
http://manson.ddns.org/images/marduk.gif http://manson.ddns.org/~marduk/
  ------------------------------------------------------------------------

Running Periodically

Linbot may be safely run periodically or on off-peak hours using on or at.
It may be safely run unattended. You may want to redirect Linbot's output to
the null device, log file, or have it emailed to an account. Consult your
operating system manuals for how this can be done on your system.

  ------------------------------------------------------------------------

Feedback

If you have any questions about Linbot or would like to report a bug, the
recommended way of doing so is through the mailing list. You should also
check the archives to make sure a bug hasn't already been reported (see the
BUGS file as well). It helps a lot to include a url where the problem can be
found, an HTML file where the error occurs or a (small) tar of the site
where the error occurs. Suggestions for improvements are also welcomed.
Patches and code contributions are even better. Please subscribe to the
mailing list. Please send to it. Please do not email marduk directly
concering bug reports.
