06:45 PM
So Much Data, So Little Time
In a microsecond economy, most data is only useful in the first few milliseconds, or to an extent, hours after it is created. But the way the industry is collecting data, or more accurately, hoarding it, you'd think its value lasts a lifetime.
Yes, storage costs are going down and selecting data to delete is no easy task, especially for the unstructured and unclassified sets. And fear of deleting something that could one day be useful is always going to be a concern. But does this give firms the go-ahead to be data hoarders?
"There's debate about the industry collecting data over time, and how much of that long-term tail is useful," explains Dane Atkinson, CEO of SumAll, a NYC-based data analytics startup. "People think of data as an endless repository, but most of the data's value only lasts for the five seconds after it's created."
Atkinson says in the long-tail, what's important is how much data can be economized given the cost of storage. Storage costs may be dropping, but the hosting fees aren't negligible, and firms still have to pay to keep their history.
He says firms are imagining they will eventually be in inflection mode, when someone or some tool will eventually come along and leverage all the data to generate deep insights, ones that ultimately justify the warehouse costs.
"They want infinite data sets so that one day you can ask any question. But it's impractical and rarely used correctly." He argues it's better to sit down and come up with specific questions based on what you need to know later (rather than "just in case") then narrow down the data sets to relevant indexes that tally the transactions.
How does a company reverse course on data hoarding? Given data's steep drop in value, Atkinson suggests adjusting the granularity as a good way to start. While some firms may see the value in storing transactions in an minute or hourly or daily log, others data sets may be most sensibly rolled up into a weekly or monthly metric, especially if looking at a ten-year plus timeframe.
Of course, that's rarely the reality. "Ever company I talk to has stored everything they possibly can," says Atkinson. "But most data, more than 50 percent, is like driving a car off the lot. The value drops significantly seconds later."
Leveraging tiered storage models can also help, he suggests, and archiving of files should be done in the most off-line and largest scale possible. "Way too many people keep data on expensive, highly granular tiers on aspirations that one day they will use it. And at the end of the day, once you get down to the results people want to see, it tends to be small files."
For example, an average customer that has 120 gigabytes per year will really want to use is 2-3 megabytes.
Despite the realities of the way firms are actually using their data, or how they will leverage data for discovery purposes in the future, the industry has shown no inclination in slowing down.
"You're living in the fantasy if you think you're going to leverage it."