Web log analysis notes
From Helpful
| This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine. |
(See also Network tools)
Contents |
Separate reports vs. merging
If you like to see detailed statistics on a site that has various vhosts, you would often configure the analyser to take each log file and create a separate report.
If however you only or also want an overall analysis, you want to look at merging the logs properly, by date, so that analysers won't trip over the out-of-orderedness merely catting them together would do. See e.g. mergelog.
Webalizer
It seems that when you supply a configuration file to webalizer, it applies on top of the configuration in /etc/webalizer.conf.
This means that individual reports for virtual hosts can be done by creating extra configuration files containing just a LogName, an OutputDir and (probably) a HostName. You can leave the general /etc/webalizer.conf make an overall report (with non-working links, mind) and have things like /etc/webalizer.wiki.conf for each separate site.
Quick examples
These are webalizer config lines.
Referrers
GroupReferrer http://www.google. (Web searches) HideReferrer http://www.google. GroupReferrer .104/search (Web searches) HideReferrer .104/search GroupReferrer search.msn. (Web searches) HideReferrer search.msn. GroupReferrer search.live. (Web searches) HideReferrer search.live. GroupReferrer search.yahoo. (Web searches) HideReferrer search.yahoo. GroupReferrer mozbot.com/search (Web searches) HideReferrer mozbot.com/search GroupReferrer .dogpile. (Web searches) HideReferrer .dogpile. GroupReferrer images.google. (Web image searches) HideReferrer images.google. GroupReferrer /language/translatedPage (Translation) HideReferrer /language/translatedPage GroupReferrer /translate_c (Translation) HideReferrer /translate_c GroupReferrer http://translate.google. (Translation) HideReferrer http://translate.google. GroupReferrer .stumbleupon. (Social sites) HideReferrer .stumbleupon. GroupReferrer del.icio.us (Social sites) #HideReferrer del.icio.us #These may be intersting to visit, so don't hide them. GroupReferrer .pingdom. (Web-based tools) HideReferrer .pingdom. GroupReferrer .whois. (Web-based tools) HideReferrer .whois.
Agents
See also user-agents.org (and google seaches) for reference.
Major crawlers
GroupAgent Googlebot/ Crawler: Google HideAgent Googlebot/ GroupAgent Googlebot-Image/ Crawler: Google HideAgent Googlebot-Image/ #Why aren't these Googlebot, though? GroupAgent GoogleSpider Crawler: Google HideAgent GoogleSpider GroupAgent Mediapartners-Google/ Crawler: Google HideAgent Mediapartners-Google/ GroupAgent msnbot/ Crawler: MSN HideAgent msnbot/ GroupAgent msnbot-media/ Crawler: MSN HideAgent msnbot-media/ GroupAgent slurp Crawler: Yahoo HideAgent slurp GroupAgent Yahoo-MMCrawler Crawler: Yahoo HideAgent Yahoo-MMCrawler GroupAgent Jeeves Crawler: Jeeves HideAgent Jeeves
Minor crawlers
You'll get endless minor bots. There are lists with dozens of these in webalizer convif format, but I figure they're quite outdated so just add what I see
#Big? Small? Not sure. GroupAgent IlseBot Crawler: minor crawlers HideAgent IlseBot GroupAgent iearthworm Crawler: minor crawlers HideAgent iearthworm GroupAgent heritrix Crawler: minor crawlers HideAgent heritrix GroupAgent larbin Crawler: minor crawlers HideAgent larbin GroupAgent MJ12bot Crawler: minor crawlers HideAgent MJ12bot GroupAgent Exabot Crawler: minor crawlers HideAgent Exabot GroupAgent Nutch Crawler: minor crawlers HideAgent Nutch GroupAgent SBIder Crawler: minor crawlers HideAgent SBIder GroupAgent VadixBot Crawler: minor crawlers HideAgent VadixBot GroupAgent Twiceler Crawler: minor crawlers HideAgent Twiceler GroupAgent psbot Crawler: minor crawlers HideAgent psbot GroupAgent VoilaBot Crawler: minor crawlers HideAgent VoilaBot GroupAgent Yoono Crawler: minor crawlers HideAgent Yoono GroupAgent yoono/ Crawler: minor crawlers HideAgent yoono/ GroupAgent Gigabot Crawler: minor crawlers HideAgent Gigabot GroupAgent TurnitinBot Crawler: minor crawlers HideAgent TurnitinBot GroupAgent NextGenSearchBot Crawler: minor crawlers HideAgent NextGenSearchBot GroupAgent findlinks Crawler: minor crawlers HideAgent findlinks GroupAgent MyFamilyBot Crawler: minor crawlers HideAgent MyFamilyBot GroupAgent ZyBorg Crawler: minor crawlers HideAgent ZyBorg GroupAgent converacrawler/ Crawler: minor crawlers HideAgent converacrawler/ GroupAgent BecomeBot/ Crawler: minor crawlers HideAgent BecomeBot/ GroupAgent FurlBot/ Crawler: minor crawlers HideAgent FurlBot/ GroupAgent Baiduspider Crawler: minor crawlers HideAgent Baiduspider GroupAgent lanshanbot Crawler: minor crawlers HideAgent lanshanbot GroupAgent OnetSzukaj Crawler: minor crawlers HideAgent OnetSzukaj GroupAgent sogou Crawler: minor crawlers HideAgent sogou GroupAgent Bloglines Crawler: feed crawlers HideAgent Bloglines GroupAgent Feedster Crawler: feed crawlers HideAgent Feedster GroupAgent OctBot Crawler: feed crawlers HideAgent OctBot
Browsers
GroupAgent Gecko/ Browser: Gecko-based (Firefox, Mozilla, etc.) HideAgent Gecko/ GroupAgent KHTML, Browser: KHTML-based (Safari, Konqueror, etc.) HideAgent KHTML, GroupAgent Konqueror/ Browser: KHTML-based (Safari, Konqueror, etc.) HideAgent Konqueror/ GroupAgent Opera Browser: Opera HideAgent Opera GroupAgent MSIE Browser: MSIE HideAgent MSIE GroupAgent Sleipnir Niche browsers, command-line browsers HideAgent Sleipnir GroupAgent Avant Niche browsers, command-line browsers HideAgent Avant GroupAgent Dillo Niche browsers, command-line browsers HideAgent Dillo GroupAgent Lynx Niche browsers, command-line browsers HideAgent Lynx GroupAgent w3m/ Niche browsers, command-line browsers HideAgent w3m/ GroupAgent Links Niche browsers, command-line browsers HideAgent Links GroupAgent edbrowse Niche browsers, command-line browsers HideAgent edbrowse
Other
GroupAgent Acrobat Applications and libraries HideAgent Acrobat GroupAgent Java Applications and libraries HideAgent Java GroupAgent libwww Applications and libraries HideAgent libwww GroupAgent Pingdom Web-based tools and playthings HideAgent Pingdom GroupAgent webGobbler Web-based tools and playthings HideAgent webGobbler GroupAgent Wget/ Downloaders HideAgent Wget/ GroupAgent curl/ Downloaders HideAgent curl/ GroupAgent lftp Downloaders HideAgent lftp GroupAgent WebCopier Downloaders HideAgent WebCopier
Search engines
These are used to extract the search strings. There are various lists out there you can copy-paste. Note, however, that these things change over time, and various lists have outdated entries, both in the URLs that will be used, and sometimes in the query value used.
A minimal start:
SearchEngine google. q= SearchEngine search.yahoo. p= SearchEngine search.msn. q= SearchEngine search.live. q= SearchEngine search.lycos. query= SearchEngine search.netscape. search= SearchEngine search.aol. query= SearchEngine ask. q= SearchEngine altavista. q= SearchEngine hotbot. query=
I don't think it's easy to see false negatives -- that is, to see what engines it's missing and what configuration lines are incorrect.

