Big Data Challenge: Data Paring

"As the amount of data increases exponentially, the amount of interesting data doesn’t."

David Beer, Adaptive Computing

One of the well-known challenges for big data is that you have to pay a lot more attention to where the data is now. This isn’t a new challenge—I remember a customer laughing as he told me how 15ish years ago his team would express mail hard drives from one site to another because the postal service was faster than the network transfer—but the size of data is growing so quickly that this is changing from a fringe concern to a core concern.

You always hear about the need to localize computational resources and make intelligent data staging decisions, but one of the dimensions of this problem that needs to be more discussed is data paring. The need for this is fairly obvious: data is growing exponentially, and growing your compute data exponentially will require budgets that aren’t realistic.

One of the keys to winning at Big Data will be ignoring the noise. As the amount of data increases exponentially, the amount of interesting data doesn’t; I would bet that for most purposes the interesting data added is a tiny percentage of the new data that is added to the overall pool of data.

To explain these claims, let's suppose I'm making a tool that is predicting stocks that need to be purchased or sold based on market events. Obviously, there are a near infinite amount of things that happen every day which fit under the umbrella of market events: trading trends throughout the day, changes in government regulations, purchasing trends, the weather, a team winning their playoff series, gas prices, mergers, acquisitions, emerging markets, the list goes on and on. Finding correlations in a near-infinite amount of combinations gives you a problem requiring a near-infinite amount of computing power to stay ahead of. One might say there's simply no way to solve this problem without ignoring or at least devaluing less significant data.

As I'm comparing all of the different scenarios, how much more likely is one event to influence another if the companies share board members, have the same parent companies, are located in the same region, share supply routes, or are part of an otherwise tightly coupled economy? How many of these events can be deemed irrelevant or insignificant? Does a change in food prices immediately affect defense contractors? Does a change in the value of silver affect the value of software companies? At what point does the sheer amount of trading of a stock outweigh the real world events behind the trading?

Answering these questions through paring and sifting algorithms is a dimension of Big Data that will only grow in significance over time. Data capturing will always be fundamentally faster and easier than data analysis, and data will continue to multiply faster than bunny rabbits. Not wasting time on irrelevant data will be one of the keys to staying ahead of the competition.

The scientific community has been determining how to remove irrelevant data for a long time, so long that the term outlier is mainstream. As Big Data moves to the forefront, organizations that can adapt techniques to ignore outliers and draw intelligent conclusions based on higher-correlated data are going to lead the way.

David Beer is a senior software engineer at Adaptive Computing