November - 2014 - issue > CXO View Point

Wither Hadoop?

M.C. Srivas

CTO & Founder-MapR Technologies Inc.

Tuesday, December 2, 2014

Forward

Hadoop came out of technology pioneered by Google, and this technology was used to index the World Wide Web.Since then, Hadoophas emerged as the platform of choice forenterprise-wide data processing. When I was at Google, I had the privilege of working at the center of such big data technology, and saw the potential it had to transform how businesses gain value from their data. Hadoop allows businesses to take advantage of their data to gain intelligence about their customers, optimize revenues, mitigate risks, track every little bit of information, and achieve all of this at a fraction of what it would cost with traditional database technology.

Some examples are worth mentioning. The largest cellphone manufacturer in the world uses Hadoop to track how well their phones are performing (e.g., what functionality is used, or how often the battery is recharged). The largest cancer center in the world uses Hadoop to understand which genes are indicated in which ailments. Farmers can buy insurance against bad weather ruining their crops from insurers that use Hadoop to forecast weather right down to a 4 square kilometer area anywhere in the United States. The Indian government uses biometric information served by Hadoop to provide identities and proof-of-residence for the 1.3 billion residents in India.

Why does Hadoop win? As Google has famously shownin the paper, "The Unreasonable Effectiveness of Data,"simple models and a lot of data trump more elaborate models based on less data. The key to Hadoop's success is that fact that it was designed to scale to tens of petabytes of storage.

There are manylessons that we can learn from Google about architectural advances, and here I present three.

First of all, you must start with a top-notch storage engine capable of holding all the dataï¿½you need a system that can hold an unlimited number of files, so you can connect as many data sources as possible todirectly deposit and update data continuously without fear of running out of files. Without the ability to continuously update, you are limiting yourself to batchprocessing.
Secondly, you need a top-notch database engine capable of indexing all of the data you just collected: a database that can handle BOTH semi-structured and multi-media data. You need a database that can run both online and analytics workloads on the same copy. Why? In a world that is increasingly demanding real time insights, there just isn't time to ETL the data to another systemï¿½there's just too much data and it's arriving too fast.