This is a quick review of free tools for web analytics / stats-analysis / weblog analysis. I’ll follow up with some more detailed posts about non-web tracking. Follow-up posts will extend this into game development, but this post is purely about web stuff.
Where to start with analysing data?
Analysis of data about what your consumers are doing is invaluable to any company looking to optimize their sales and profitability, especially so for online games. But today (2008) there is little or nothing in the way of credible products for doing analysis suitable for games. Many companies have built proprietary systems, with all the costs and horros that come with that. But that’s way out of the reach of startups and games studios.
So, where can we start? Well, website analysis is a mass-market use for these tools which has a lot in common, so there should be a good range of free software, and open-source stuff too.
Free webstats: Historical Perspective
(not *just* boring history :) – this briefly explains something fundamental about the different types and complexities of stats analysis tools that will be useful background knowledge for future posts)
A little over ten years ago, when I was still a student, and didn’t want to spend money on anything if I didn’t have to, I wrote my own web log analyser, that would run over my Apache and/or IIS logfiles and tell me lots of fascinating things about who was visiting my web site.
There were commercial alternatives at the time, varying from cheap ($50-$150) that lacked basic features I quickly wrote for myself, to expensive ($500-$1000) that had everything I could ever want and a lot more. I soon gave up maintaining my proprietary software (lack of time, real job – and the assumption that good analyser software would come down in price). I started using the open-source analysers, because they were “almost good enough” for very basic usage, and – in theory – they would improve over time.
The cheap stuff compared to free was mostly better just because it had pretty graphs and convenient user interfaces for “1st order” data. i.e. anything that could be determined merely by looking directly at raw data, e.g.
- total number of visitors
- which countries visitors came from
- sites that linked to yours
The expensive stuff compared to the cheap stuff was mostly better for having “2nd order” data, i.e. things that could only be calculated by looking at raw data BUT ALSO by using some pre-calculated 1st order data, e.g.
- percentage of visitors from each site (requires you to count number of visitors from each site AND ALSO to calculate the total number of visitors who came from any site)
…and also “higher order” data, i.e. things that could only be calculated by using 1st order and/or 2nd order data and combining it in new ways, e.g.
- number of visitors from each site who did NOT also come in from another site (requires you to first generate a list of visitors from each site, then start again from start of the logs checking for each one whether they ALSO came in from a site that was NOT the original one)
Most famously, the most desired piece of higher-order data was “find out where each user is going, the sequence of pages they click through whilst on the site, from the first page they visit to the last page”. None of the free stuff did that, and most of the cheap stuff didn’t do it, or did it very badly.
So … what is free today?
10 years later AWStats is not noticeably better now than it was then, despite being actively maintained. It’s picked up some – IMHO – relatively frivolous features and still hasn’t gained the most basic of analysis features from the commercial products of 10 years ago: it still can’t/won’t track for you the progress of a user through the site.
Analog and Webalizer, the two other free analysers I tried around the time I stopped doing my own, both of which were vastly inferior even to AWStats, don’t seem to have gone anywhere in that time either (although someone has forked Webalizer to make a slightly improved version)
Has *no-one* been playing with the source of these tools and adding basic features? I know a few sites- like the excellent InternetOfficer page on AWStats – have been adding and sharing basic hacks to vastly improve it, but these really just scratch the surface of what is needed. (if you’re using AWStats and you haven’t added the IO stuff, I highly recommend looking at them and cherrypicking some you like – although it’s a real PITA to add more than one hack because of the stupid config system used by AWStats – you have to remember to manually increment a unique ID for each module you add. ARGH!)
For the record: I have been using AWStats continuously for the last 6 or 7 years, and have hacked a lot of stuff to work with it. *I* don’t have problems with it, but it’s disappointingly lacking in areas where I need more.
So, I thought it was time to have a look around at what else is out there.
When Google bought Urchin, I thought maybe this would mean we wouldn’t need to rely on AWstats any more. The truth turned out to be a bit different – Google Analytics is, in many ways, as “almost but not quite enough” as AWStats. In particular, getting meaningful Referrer analsysis out of GA is a nightmare (I have no idea why we’re still having to hack in custom regexps just to get one of the most fundamental pieces of info out of GA – and note that the manual regexp additions still don’t work for a lot of sites: I’ve sometimes set it up on a GA site and nothing happens, for no apparent reason).
GA is awesome for some things – like marketing-centric tracking – and is adaptable (as above) – but it’s still missing so much that it’s no surprise to me that other alternatives continue to be heavily used. Apart from the many things you need to make custom strings to track (like the referrers above), it:
- is several days behind “live” data (at least in Europe, it’s nearly always more than 24 hours behind)
- over-simplifies reports (very litle data is provided for most reports)
- provides no easy way to combine output of one report with output of another – no mashups allowed! – c.f. Yahoo Pipes for an example of what GA could trivially provide to the user to become totally awesome
Now, if there’s a chance GA might be “good enough” for you, then I suggest you take that route and run with it – GA “can do” a lot (if you muck around with it a lot), it’s owned by Google, and it’s very well-known. You can google for a lot of tips on using it, and I suggest reading things like Andrew Chen’s blog which has a lot of tips on what you should be looking for when doing your web metrics. I’ll be coming back to the topic of “what you should be looking for” in another post – but first I want to get the basic state of tools out of the way.
Free Alternatives – a future?
What’s on the scene today? Here’s 6 other free webstats analysers I found (in addition to the market-leader (AWStats) and the aforementioned Analog and Webaliser (which you really shouldn’t bother looking at).
Microsoft’s adCenter Analytics
This seems to be being pitched as a direct competitor to GA, right down to similar naming and presentation (as well as being free to use, and requiring the creation of a Microsoft account to be eligible for using it).
I tried signing up, but then I got this very disappointing response:
Thank you for registering for the Microsoft adCenter AnalyticsBeta project.
You will receive your adCenter Analytics invitation as capacity allows.
This is pretty fricking stupid: if you’re competing against Google, you shouldn’t go around offering users access to your program and then getting all high and mighty about how you might deign to allow them to use it at some non-specified future time of your choosing.
So, for now: Microsoft’s product is effectively vaporware. Sigh.
“ClickHeat is a visual heatmap of clicks on a HTML page, showing hot and cold click zones.” – i.e. it tracks exactly where the user clicked the mouse on your page, and then shows you an aggregate of “all clicks by all people”, with places that were clicked more often showing up in a lighter colour than places clicked less often. Heatmaps are a great visualization tool for aggregate data like this.
They have a nice live demo that you can try out, and see what happens on their site – use the username “demo” and password “demo” – although it defaults to showing clicks from “today” which for their site is too few to be interesting, you can just click on the “month” button in the navbar at the top to see an interesting map of their site.
In particular, the way you can change the transparency level in real time is awesome – if a map gets too bright in one area, and you can’t see what people were clicking on, change the transparency to get a better look.
Works fine, but … this is nothing more or less than a simplified view of the AWStats core data – it’s got less data than AWStats but makes it easier to read all in one place.
This is a first-order analyser only. That makes it a complete waste of time, IMHO. Any first order stats I want to track I can do *from the command line* in linux by typing something about this long:
cat "access*.log" | cut -d=" " -f7,9 | uniq -c
…which looks obscure and obtuse, but you can google to find premade ones that do what you want, and then you only need to change the numbers 7 and 9 in there to change what data summaries are provided. And when you use linux regularly, you can remember the whole command line off the top of your head easily, bung it in a script, and you’re done.
Roxr Software’s Clicky Analytics
EDIT: DECEIVED! This one isn’t free at all; it’s like a bunch of the commercial ones today that “pretend” to be free, but have absurdly low limits on the free usage; if my niche blog is enough to go over their daily limits (hint: yes, it does), then the service is clearly a waste of time
This one looks really good. The only problem I can see so far is that it won’t work for sites with “more than 100,000 daily page views” – that’s not going to be a problem for anyone here; when your site gets that popular, you should have the spare manpower to build/spare money to buy whatever you need.
I’ve only just started using this, so I can’t comment on it yet. But I do want to point out they are nice enough to provide a WordPress plugin for you that automatically adds the tracking stuff to each page as required, so that makes life easier for anyone wanting to track their WP blogs.
This used to be a web stats analyser, I know a few people who used to swear by it, but apparently not any more – they’ve replaced it with a desktop application that is “powered by REinvigorate” but appears to be a lot less what we want here than the old Reinvigorate stats analysis.
There appears to be no way to get access to the *actual* Reinvigorate, the product we wanted to use; all links just go back to the download site for the desktop application instead. Oh, well.
Looks promising – but (like the Microsoft product) it’s currently an invite-only beta, with a low limit on the number of daily pageviews, so although it *could become* totally awesome, for now it’s a case of “may work for you – IF you can get into the beta – and IF the final product doesn’t turn out too expensive”. Some big unknowns there. But worth a look, IMHO.
If you need something doing properly, you gotta do it yourself?
So, although this started off as a review of free web tools, now that I’ve got this far I’m considering digging out the source code for my old proprietary web server log analyser and starting to use it again. Maybe even share it with other people if anyone’s interested.
It was very fast (at least for some uses it was much faster than AWStats), although I think the latest version I was doing some slightly nasty and interesting-but-silly things with using the local file system as a dynamic database – not flat files, but on-disk hashes, to be able to process arbitrarily complex relationships (“show all users who did X after doing Y more than twice in the previous week, but only if they used Internet Explorer on their first visit”) large files in very low memory (hey, back then my server had about 64Mb RAM; memory was at a premium!).
This time, I think it would be interesting to do the whole thing in SQL instead, and run against an in-memory SQL DB like HSQLDB.
(really, though, I’m hoping that this absurd suggestion – that I might write a log analyser myself :) – will poke at least one person into pointing out how ignorant and unobservant I am for not noticing some open-source tool out there already which rocks and does the few things that GA doesn’t :))
For another time, I want to cover some of this:
How is this useful for game developers (apart from the obvious)?
What other options are there for people doing online games?
If you’re going to roll your own metrics for games development, how should you do it?