Bash Webstat for Apache Logfiles
I am sure you know the annoying “accept cookies from this site” popups. In deed all other availabel webstat solutions require cookies as well as additional code that needs to be inserted into your webpage. This solution is different. You only need your Apache logs, something that should be available for every website running the predominant and most widespread web server on the web, Apache, usually via exclusive ftp download. Though we at elstel.org do also have an online python-CGI/mysql GUI solution for analyzing Apache logs, this solution is most simple and basic. However it requires downloading the logs first. Nonetheless this can also be seen as advantage as it inclusively leads to a log backup for several years possibly spanning multiple server installations with different log formats.
The solution we are talking about is a simple bash command line script to view the visits of the last day. Page views can be listed by URL or by country. Search engine results are grouped by URL and clustered with a count. If you have a certain, possibly wildcarded URL, you can view the countries people came from visiting this URL as well as the whole timeline of the visit, the browser in use, the country and of course the pages that have been visited in sequence. Visits can not only be viewed for a contained URL but also for a given country. Downloads and search engine crawls can be viewed separately. Most times this information is viewed for a special day given by date although an aggregation feature for the most recent log files is also available. Just have a look at the command lines below.
|webstat-bash v1.1||better spider and bot recognition; differ between Google-search and Gg-Ads hits|
|webstat-bash v1.0||first published version|
> alc --at 21/Feb/2014 … GeoIP Country Edition: AT, Austria 4 GeoIP Country Edition: IN, India 4 GeoIP Country Edition: RU, Russian Federation 5 GeoIP Country Edition: JP, Japan 6 GeoIP Country Edition: FR, France 6 GeoIP Country Edition: UA, Ukraine 7 GeoIP Country Edition: CA, Canada 8 GeoIP Country Edition: CN, China 9 GeoIP Country Edition: DE, Germany 31 GeoIP Country Edition: US, United States 33 sum: 153
A number of 153 different IPs have visited on that day with different numbers of IPs belonging to different countries.
> alc --at 21/Feb/2014 pages … 8 /qemu/ 10 /html5video/Html5VideoScripting.html.en 10 /xchroot/ 11 /index.html.en 11 /html5video/FlashVersusHtml5Video.html.en 14 /FilmReviewSamsara.html.de 55 /OS2Warp/InstallUpdate.html sum: 175 pages 172/3 - pages/rss
172+3=175; ‘rss’ amounts to ViewRSS.php and elstel.rss. These are page accesses rather than IPs. A page can be retrieved multiple times from the same IP.
> alc --at 21/Feb/2014 seo-landing … 4 google /html5video/Html5VideoScripting.html.en 6 google /html5video/FlashVersusHtml5Video.html.en 6 google /qemu/ 7 google /xchroot/ 12 http://de.wikipedia.org/wiki/Samsara_ /FilmReviewSamsara.html.de 13 google /OS2Warp/InstallUpdate.html sum: 100
> alc --at 21/Feb/2014 referers … 3 http://darkbooks.org/ 12 http://de.wikipedia.org/wiki/Samsara_(2011) 44 google sum: 100
Result counts by search engine and referer.
> alc --at 21/Feb/2014 --page elstel.rss GeoIP Country Edition: DE, Germany 1 GeoIP Country Edition: AT, Austria 1 GeoIP Country Edition: EU, Europe 2 sum: 4
Now if you are interested in different IPs/visitors visiting a site rather than retrieval access counts this command gives you what you want, grouped by country.
> alc --at 21/Feb/2014 --page xchroot visits … 22.214.171.124 (GeoIP Country Edition: EU, Europe) Mozilla/5.0_(Windows_NT_6.1;_WOW64;_Trident/7.0;_rv:11.0)_like_Gecko:: 0 21/Feb/2014:18:36:36 /xchroot/ 200 <<://www.bing.com/search?q=x11+in+chroot&go=&qs=ds&form=QBRE 1 21/Feb/2014:18:36:36 /elstel.rss 200 << http://www.elstel.org/xchroot/ > alc --at 21/Feb/2014 --from EU visits 126.96.36.199 (GeoIP Country Edition: EU, Europe) Mozilla/5.0_(Windows_NT_6.1;_WOW64;_Trident/7.0;_rv:11.0)_like_Gecko:: 0 21/Feb/2014:18:36:36 /xchroot/ 200 << http://www.bing.com/search?q=x11+in+chroot&go=&qs=ds&form=QBRE 1 21/Feb/2014:18:36:36 /elstel.rss 200 << http://www.elstel.org/xchroot/ 188.8.131.52 (GeoIP Country Edition: EU, Europe) Mozilla/5.0_(Windows_NT_6.1;_WOW64)_AppleWebKit/537.36_(KHTML,_like_Gecko)_Chrome/30.0.1599.101_Safari/537.36:: 0 21/Feb/2014:07:39:04 /OS2Warp/InstallUpdate.html 200 << http://de.wikipedia.org/wiki/OS/2 1 21/Feb/2014:07:39:04 /elstel.rss 200 << http://www.elstel.org/OS2Warp/InstallUpdate.html
Finally this lists up visits for certain criteria. A visit are all page recalls from the same IP and browser on that day. Different browsers (agents) from the same IP make up different visits.
> alc --at 06/Jan/2020 downloads /database/dbschemacmd-v1.1.tar.gz Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36 200 /database/dbschemacmd-v1.1.tar.gz Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36 200 /confinedrv-v1.7.7 Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/42.0.2311.138 Mobile Safari/537.36 Mb2345Browser/9.0 301
There are some more commands for alc which can be shown by alc --help.
Before you can use our solution for your own log files you need to specify the log format. For each location (in this example: ’/home/weblog/[dotplex/|/revido/old]’) you specify the praefix of the logfile names (‘access.log’ or ‘access_log_’), their sort order, their format and possibly an url-match token if access for multiple sites end up in the same logfiles:
#format: M-domain, I-IP, D-date, C-command line with GET/HEAD, S-http-status (200=ok), L-content/file-length, R-http-referer (site the user visited before this one), A-agent(browserid) ‘-’skipped
logdirs=("/home/weblog/dotplex/ access.log .+2. MI--DCSLRA https?://(www.)?elstel.(org|com)" "/home/weblog/revido/old/ access_log_ _-2_w-1---1. I--DCSLRAM https?://(www.)?elstel.(org|com)");
I will give you an example for an ‘MI--DCSLRA’ log entry:
elstel.com:443 184.108.40.206 - - [03/Jan/2021:02:59:55 +0100] "GET /robots.txt HTTP/1.1" 200 3614 "-" "Mozilla/5.0 (compatible; SeznamBot/3.2; +http://napoveda.seznam.cz/en/seznambot-intro/)"
… and here an example for a ‘I--DCSLRAM’ log entry:220.127.116.11 - - [09/Oct/2012:02:29:04 +0200] "GET / HTTP/1.1" 200 1361 "-" "check_http/v1.4.15 (nagios-plugins 1.4.15)" www.elstel.com
The sort order field for ‘access.log.NN[.gz]’ is ‘.+2.’: This means take the field starting with the second dot ‘.+2’, sort ascending ‘+’ and read until another dot is encountered. The + effectively means to read access.log.1 for more recent logs before access.log.2.gz. If no tag is found it is the most recent logfile, i.e. ‘access.log’.
We also have a more difficult example for logfile ordering:
access_log_2012_w53-0.gz access_log_2013_w10-0.gz access_log_2013_w11-0.gz access_log_2013_w12-0.gz
Here we need to read until the second ‘_’ and take the field until another ‘_’ in descending order in order to start with the most recent log (‘_-2_’). Besides the year there is another field for the month starting with the first ‘w’ and ending with ‘-’ (‘w-1-’). The last field seems to be always zero here but it will also be considered for the sort order by taking the first ‘-’ until the next ’.’ (‘--1.’), all in descending order ‘-’.
Adjust the line with logdirs directly in the sources of the alc script with an editor.
Finally you need to be in the right logfile directory as specified in the logdirs-line when you invoke alc.