The difference between data and information

Data is valuable. Nobody holds data in higher esteem than data miners and web analysts, but there is a world of difference between data and information.

Take the famous (to devotees of data mining) and much admired (by devotees of data mining) Wallmart customer basket database. The implementation of barcode scanners meant that each collection of products bought by a Wallmart customers could be recorded, and they did just that. Back in the early years of the new century this database was probably the biggest in the world. It ran into TeraBytes back when that was an impressive achievement (even to devotees of data mining). Depending on how you reckon 1 TB, it’s either 1000000000000 or 1099511627776 bytes. The good folks at the University of California, Berkeley, reckon that amount of data at 50000 trees or a good sized forest worth of paper.   To put it another way, Dr Jess’ PhD thesis weighed in at 2464KB, or one 405844th of a TeraByte.

Cute arithmetic aside, a TB is a lot of data, and there are plenty of databases of that size around today. Many of the server logs we deal with are well into GB or larger. Wading through all those numbers is the job of data mining algorithms and not people, and that’s as it should be.

It’s the job of web analysts to sort through the data spat out by programs like Google Analytics and Webalizer, sometimes using algorithms and software tools and sometimes their own judgment.

The task isn’t just to build a giant database, it’s to convert that data into usable information. Wallmart doesn’t find out whether coffee and sugar are bought together by reading through their data, and nobody trying to make money from a website needs to wade through masses of facts and figures and sieve out what’s important. That’s our job.

We know what to look for and how to extract information from data, and we also know how to present that information in an easily accessible form, which is half the battle when dealing with large databases.

Tags: , ,

The significance of significant figures

At this point in time, the use of significant figures is almost exclusively confined to scientific disciplines. It’s a pity, because these days so many businesses are relying in analytics to provide the basis for some serious decisions and the figures they have are rarely as accurate as they might appear.

The significant figures (or significant digits) from any number are those that contribute meaningfully once uncertainty is taken into account. At DrJess we like to get stuck in and really investigate uncertainty, but the first step is to realise it’s there.

Say Google Analytics says drjess.com/blog site has seen 1621483 unique viewers in the last month. A touch on the generous side, but in thought experiments we are allowed to be optimistic.  Neither the Goog nor almost any other free web analytics suites have much to say on the subject of accuracy, but presenting in any halfway scientific context a number with that many significant figures implies a very high confidence in its accuracy, percentage-wise.

Web analytics pros know that Google Analytics and its tag-based friends tend to under count traffic by rather more than 10%. More on the reasons why in another post, but let’s be generous and assume a maximum 10% error for the sake of a simple thought experiment. Applying that to my imaginary unique user count of somewhere between  1621483 and 1783631.3.

If I had to pick a single number to represent that range, it would be 1700000, not 1621483. Not only is it probably more accurate, it gives anyone looking at it a much better indication of what the uncertainty is.

Now, if I was going to base some serious and potentially very expensive business decisions based on analytics numbers, I’d want to know just how good my figures were. So would most people who want analytics done, if it occurred to them to wonder if precision might be lacking in the first place. With the over-precise numbers spat out by most analytics programs, the question is rarely posed unless inconsistencies crop up.

I’m not blaming Google. The simple truth is that given the choice between a web analytics program that spits out 1700000 and one that spits out 1621483, the overwhelming majority of users will perceive the latter as more accurate.

Legend states that the first surveyor of Mount Everest measured the height at 29000ft exactly. It was reported as 29002ft, supposedly because the surveyor didn’t think anyone would believe his rounder figure had a reasonable degree of precision.

That was back in 1856. Unfortunately it still seems we’re having this kind of problem. The only solution is to state clearly what the uncertainty is. Bring it out of hiding and discuss error, accuracy, and precision before making decisions.

Tags: , ,

Welcome to Dr Jess

Welcome to DrJess.com, the new home of JM Spate Web Analytics. Our new site will have a lot more functionality and will eventually host more content than the previous one, and it already has greater bandwidth. We’ll be moving content over to this site over the next few months.