Web log analysis notes

From Helpful

This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine.

(See also Network tools)

Contents

Separate reports vs. merging

If you like to see detailed statistics on a site that has various vhosts, you would often configure the analyser to take each log file and create a separate report.

If however you only or also want an overall analysis, you want to look at merging the logs properly, by date, so that analysers won't trip over the out-of-orderedness merely catting them together would do. See e.g. mergelog.


Webalizer

It seems that when you supply a configuration file to webalizer, it applies on top of the configuration in /etc/webalizer.conf.

This means that individual reports for virtual hosts can be done by creating extra configuration files containing just a LogName, an OutputDir and (probably) a HostName. You can leave the general /etc/webalizer.conf make an overall report (with non-working links, mind) and have things like /etc/webalizer.wiki.conf for each separate site.

Quick examples

These are webalizer config lines.

Referrers

GroupReferrer  http://www.google.                  (Web searches)
HideReferrer   http://www.google.
GroupReferrer  .104/search                         (Web searches)
HideReferrer   .104/search
GroupReferrer  search.msn.                         (Web searches)
HideReferrer   search.msn.
GroupReferrer  search.live.                        (Web searches)
HideReferrer   search.live.
GroupReferrer  search.yahoo.                       (Web searches)
HideReferrer   search.yahoo.

GroupReferrer  mozbot.com/search                   (Web searches)
HideReferrer   mozbot.com/search
GroupReferrer  .dogpile.                           (Web searches)
HideReferrer   .dogpile.

GroupReferrer  images.google.                      (Web image searches)
HideReferrer   images.google.

GroupReferrer  /language/translatedPage         (Translation)
HideReferrer   /language/translatedPage
GroupReferrer  /translate_c                     (Translation)
HideReferrer   /translate_c
GroupReferrer  http://translate.google.         (Translation)
HideReferrer   http://translate.google.

GroupReferrer  .stumbleupon.  (Social sites)
HideReferrer   .stumbleupon.
GroupReferrer  del.icio.us    (Social sites)
#HideReferrer   del.icio.us  #These may be intersting to visit, so don't hide them.

GroupReferrer  .pingdom.                           (Web-based tools)
HideReferrer   .pingdom.
GroupReferrer  .whois.                             (Web-based tools)
HideReferrer   .whois.

Agents

See also user-agents.org (and google seaches) for reference.

Major crawlers

GroupAgent  Googlebot/        Crawler: Google
HideAgent   Googlebot/
GroupAgent  Googlebot-Image/  Crawler: Google
HideAgent   Googlebot-Image/
#Why aren't these Googlebot, though?
GroupAgent  GoogleSpider   Crawler: Google
HideAgent   GoogleSpider
GroupAgent  Mediapartners-Google/   Crawler: Google
HideAgent   Mediapartners-Google/


GroupAgent  msnbot/           Crawler: MSN
HideAgent   msnbot/
GroupAgent  msnbot-media/     Crawler: MSN
HideAgent   msnbot-media/

GroupAgent  slurp             Crawler: Yahoo
HideAgent   slurp
GroupAgent  Yahoo-MMCrawler   Crawler: Yahoo
HideAgent   Yahoo-MMCrawler

GroupAgent  Jeeves            Crawler: Jeeves
HideAgent   Jeeves

Minor crawlers

You'll get endless minor bots. There are lists with dozens of these in webalizer convif format, but I figure they're quite outdated so just add what I see

#Big? Small? Not sure.
GroupAgent  IlseBot           Crawler: minor crawlers
HideAgent   IlseBot
GroupAgent  iearthworm        Crawler: minor crawlers
HideAgent   iearthworm


GroupAgent  heritrix          Crawler: minor crawlers
HideAgent   heritrix
GroupAgent  larbin            Crawler: minor crawlers
HideAgent   larbin
GroupAgent  MJ12bot           Crawler: minor crawlers
HideAgent   MJ12bot

GroupAgent  Exabot            Crawler: minor crawlers
HideAgent   Exabot
GroupAgent  Nutch             Crawler: minor crawlers
HideAgent   Nutch
GroupAgent  SBIder            Crawler: minor crawlers
HideAgent   SBIder
GroupAgent  VadixBot          Crawler: minor crawlers
HideAgent   VadixBot
GroupAgent  Twiceler          Crawler: minor crawlers
HideAgent   Twiceler
GroupAgent  psbot             Crawler: minor crawlers
HideAgent   psbot
GroupAgent  VoilaBot          Crawler: minor crawlers
HideAgent   VoilaBot
GroupAgent  Yoono             Crawler: minor crawlers
HideAgent   Yoono
GroupAgent  yoono/            Crawler: minor crawlers
HideAgent   yoono/
GroupAgent  Gigabot           Crawler: minor crawlers
HideAgent   Gigabot
GroupAgent  TurnitinBot       Crawler: minor crawlers
HideAgent   TurnitinBot
GroupAgent  NextGenSearchBot  Crawler: minor crawlers
HideAgent   NextGenSearchBot
GroupAgent  findlinks         Crawler: minor crawlers
HideAgent   findlinks
GroupAgent  MyFamilyBot       Crawler: minor crawlers
HideAgent   MyFamilyBot
GroupAgent  ZyBorg            Crawler: minor crawlers
HideAgent   ZyBorg
GroupAgent  converacrawler/   Crawler: minor crawlers
HideAgent   converacrawler/
GroupAgent  BecomeBot/        Crawler: minor crawlers
HideAgent   BecomeBot/
GroupAgent  FurlBot/        Crawler: minor crawlers
HideAgent   FurlBot/

GroupAgent  Baiduspider       Crawler: minor crawlers
HideAgent   Baiduspider
GroupAgent  lanshanbot        Crawler: minor crawlers
HideAgent   lanshanbot
GroupAgent  OnetSzukaj        Crawler: minor crawlers
HideAgent   OnetSzukaj
GroupAgent  sogou             Crawler: minor crawlers
HideAgent   sogou


GroupAgent  Bloglines         Crawler: feed crawlers
HideAgent   Bloglines
GroupAgent  Feedster          Crawler: feed crawlers
HideAgent   Feedster
GroupAgent  OctBot            Crawler: feed crawlers
HideAgent   OctBot


Browsers

GroupAgent  Gecko/       Browser: Gecko-based (Firefox, Mozilla, etc.)
HideAgent   Gecko/

GroupAgent  KHTML,       Browser: KHTML-based (Safari, Konqueror, etc.)
HideAgent   KHTML,
GroupAgent  Konqueror/   Browser: KHTML-based (Safari, Konqueror, etc.)
HideAgent   Konqueror/

GroupAgent  Opera        Browser: Opera
HideAgent   Opera

GroupAgent  MSIE         Browser: MSIE
HideAgent   MSIE


GroupAgent  Sleipnir     Niche browsers, command-line browsers
HideAgent   Sleipnir
GroupAgent  Avant        Niche browsers, command-line browsers
HideAgent   Avant
GroupAgent  Dillo        Niche browsers, command-line browsers
HideAgent   Dillo

GroupAgent  Lynx         Niche browsers, command-line browsers
HideAgent   Lynx
GroupAgent  w3m/         Niche browsers, command-line browsers
HideAgent   w3m/
GroupAgent  Links        Niche browsers, command-line browsers
HideAgent   Links
GroupAgent  edbrowse     Niche browsers, command-line browsers
HideAgent   edbrowse

Other

GroupAgent  Acrobat     Applications and libraries
HideAgent   Acrobat
GroupAgent  Java        Applications and libraries
HideAgent   Java
GroupAgent  libwww      Applications and libraries
HideAgent   libwww

GroupAgent  Pingdom     Web-based tools and playthings
HideAgent   Pingdom
GroupAgent  webGobbler  Web-based tools and playthings
HideAgent   webGobbler

GroupAgent  Wget/        Downloaders
HideAgent   Wget/
GroupAgent  curl/        Downloaders
HideAgent   curl/
GroupAgent  lftp         Downloaders
HideAgent   lftp
GroupAgent  WebCopier    Downloaders
HideAgent   WebCopier

Search engines

These are used to extract the search strings. There are various lists out there you can copy-paste. Note, however, that these things change over time, and various lists have outdated entries, both in the URLs that will be used, and sometimes in the query value used.

A minimal start:

SearchEngine google.             q=
SearchEngine search.yahoo.       p=
SearchEngine search.msn.         q=
SearchEngine search.live.        q=
SearchEngine search.lycos.       query=
SearchEngine search.netscape.    search=
SearchEngine search.aol.         query=
SearchEngine ask.                q=
SearchEngine altavista.          q=
SearchEngine hotbot.             query=

I don't think it's easy to see false negatives -- that is, to see what engines it's missing and what configuration lines are incorrect.