What is a Data Lake?

A new concept in big data is making the rounds, and it could change the way firms approach long-term analysis.

About two years ago, while many firms were still struggling with Hadoop and wrapping their minds around the potential of big data, a new and potentially revolutionary concept, a data lake, piqued the attention of technology officers.

A data lake (or "enterprise data hub," in Cloudera's market), as opposed to a data warehouse, contains the mess of raw unstructured or multi-structured data that for the most part has unrecognized value for the firm. While traditional data warehouses will clean up and convert incoming data for specific analysis and applications, the raw (and processed) data residing in lakes are still waiting for applications to discover ways to manufacture insights.

The data channels with unrecognized value include social media, clickstream data, sensor data, server logs, customer transactions, videos, and more. In a rapidly growing world of big data and growing sources of information, this category is likely to increase exponentially.

"The data is stored there only once and it can be accessed in many ways," explains Chris Twogood, VP of product and services marketing at Teradata. "It's always there in its original fidelity for you go back, manipulate, change, and refine." Customers use Hadoop to shift through the lake and extract the relevant data for the query.

[For more on big data trends, read: Big Data in Capital Markets Crosses the Chasm]

The data lake concept has been gaining popularity in the last 8 months with applications coming from Howrtonworks and Cloudera. Lowered costs of storage, and Hadoop's ability to store more data types at scale made the idea increasingly tangible. Concepts for use cases in financial services are still underway, but it's easy to see where it might aid application in risk analysis and fraud analysis by identifying patterns and building profiles of customers and clients likely to cause or experience issues.

Using call center notes, detailed web log information, cookies, tweets, and branch and ATM information (all largely unstructured data) a bank could use the lake to ID customers and better understand behaviors. For example, are they showing signs of closing their account, notable life changes, or interest in new products?

True, it may be the latest method of data hoarding, but the marketplace has shown no sign of reversing the trend of storing more data over longer periods of time. Increasingly, firms want to go back to the raw source to rebuild and repopulate analytic engines. "We don't see Hadoop becoming a data warehouse, or vice versa," says Twogood, so the emergence and flexibility of the new architecture is well timed. Becca Lipman is Senior Editor for Wall Street & Technology. She writes in-depth news articles with a focus on big data and compliance in the capital markets. She regularly meets with information technology leaders and innovators and writes about cloud computing, datacenters, ... View Full Bio