Data is valuable. Nobody holds data in higher esteem than data miners and web analysts, but there is a world of difference between data and information.
Take the famous (to devotees of data mining) and much admired (by devotees of data mining) Wallmart customer basket database. The implementation of barcode scanners meant that each collection of products bought by a Wallmart customers could be recorded, and they did just that. Back in the early years of the new century this database was probably the biggest in the world. It ran into TeraBytes back when that was an impressive achievement (even to devotees of data mining). Depending on how you reckon 1 TB, it’s either 1000000000000 or 1099511627776 bytes. The good folks at the University of California, Berkeley, reckon that amount of data at 50000 trees or a good sized forest worth of paper. To put it another way, Dr Jess’ PhD thesis weighed in at 2464KB, or one 405844th of a TeraByte.
Cute arithmetic aside, a TB is a lot of data, and there are plenty of databases of that size around today. Many of the server logs we deal with are well into GB or larger. Wading through all those numbers is the job of data mining algorithms and not people, and that’s as it should be.
It’s the job of web analysts to sort through the data spat out by programs like Google Analytics and Webalizer, sometimes using algorithms and software tools and sometimes their own judgment.
The task isn’t just to build a giant database, it’s to convert that data into usable information. Wallmart doesn’t find out whether coffee and sugar are bought together by reading through their data, and nobody trying to make money from a website needs to wade through masses of facts and figures and sieve out what’s important. That’s our job.
We know what to look for and how to extract information from data, and we also know how to present that information in an easily accessible form, which is half the battle when dealing with large databases.
