The ABC of Big Data





The ABC of Big Data Date: Sunday , March 04, 2012 Headquartered in Sunnyvale, CA, NetApp (NASDAQ: NTAP), creates innovative storage and data management solutions that accelerate business breakthroughs and deliver outstanding cost efficiency. With a revenue of over $5 billion, the company has 150 plus offices across the globe with over 11,000 employees. Last few years have been an era of nebulous buzz-words in our industry that have caught on like wild-fire. First it was "SaaS", then "Cloud” and now “Big Data", and not that the first two have gone away. It is also interesting to see how the latest kid-on-the-block (Big Data) plays in the context of the former ones. But first, let's set some context. Big Data covers a number of different dimensions and means different things to different organizations. "Big Data" has been defined as something which is as simple as "ABC". The Big Data space has been segmented into Big Data Analytics, Big Bandwidth Applications and Big Content. Big Bandwidth applications like media streaming, full motion video, and video editing are examples where the infrastructure needs to provide large bandwidth for data I/O. Due to the large number of end-devices onto which media can be streamed, each with its own rendering capability, the onus is on the streaming side of the infrastructure to stream the right media format. Thus, there are large number of formats into which media must be transcoded and streamed. Both these aspects are bandwidth intensive on the storage side. Simultaneous video editing too has similar needs in a digital production house, where multiple video frames are being accessed and modified. Big Content is typically characterized by large amounts of data that once written are never modified. Immutable data of this nature occurs in the form of media objects (like pictures and videos), patient records (in medical diagnostics), seismic data (for oil exploration), call detail records (in the telecom industry) or simply click-stream data (in Internet companies). Some of this data will be read multiple times in the initial few days of its generation and will slowly become cold. Depending on retention policies of that dataset, this may be retained for a few years without being actively accessed. Storage solutions in this space need to provide reasonable access bandwidth but the focus is on providing online data storage at the lowest $/GB. Analytics has traditionally been part of the Business Intelligence investments of an organization and that had been realized using traditional Data Warehouses. The technologies that allow enterprises to extract insights from this data have now evolved to a point such that the cost of producing a "unit of insight" per TB of raw data is now less than the value provided by that "unit of insight” to the enterprise. In other words, the ROI from processing enterprise data has now tipped in favor of processing, rather than throwing that data away. Parallel data processing technologies like Hadoop and NoSQL have allowed enterprises to march over this tipping point. This, in turn, has resulted in enterprises clamoring for more and they have started to retain almost all data that was earlier considered as "junk-drawer" data. Most of this data is either not actively being processed, or is being processed with very relaxed response-time SLAs. Thus, Big Data Analytics fuels the growth in demand of Big Data Content solutions. The above typically leads to batch-oriented data processing slowly becoming an integral part of the core data processes of an organization. As the organization sees a positive impact due to these investments, through an impact to either the top-line (growing business) or the bottom-line (improving efficiencies), the enterprises typically want those insights faster. This is where near-realtime-analytics comes into the picture. Being able to analyze billions of financial transactions in a day and point out fraudulent ones is a case of batch-oriented analytics. To be able to do this before approving a transaction, is real-time analytics. Being able to predict a fraudulent transaction looking at emerging patterns is an example of near-realtime-analytics. Each of these classes of analytics needs very different technology solutions. Below are some of the "getting started" myths in the Big Data Analytics space: 1. Virtualization has no place here: Virtualized environments along with the right tools to keep focus on developing business logic, help in getting started with Big Data Analytics. 2. I need a 100-node cluster to do anything meaningful: 10-node cluster can be used to process 250 billion records of auto-support data. Most enterprises don’t need anything more than a 30-node cluster and a few need a 100-node cluster. Trying to emulate a 6000-node cluster, as touted by Internet companies is overkill for most enterprises. 3.Commodity is cheap: Using commodity components such as DAS (direct attached storage) is not always cheap. It is important to look at more practical metrics, like the usable storage capacity, the total cost of ownership and management simplicity to reduce operator interventions. Enterprises should develop their own cost models to determine the degree of impact of these aspects in their solutions. 4. Store all the data: Some organizations will still store all of it, for compliance or other reasons. But, most organizations would want to be given the option of knowing what to delete. Thus, being able to provide the insight into what data might be more important than the other will become critical for an organization. 5. I still need to know the question: Machine-learning techniques are a very interesting form of mechanisms that are emerging so that the patterns present themselves rather than a human having to manually arrange data along different dimensions and experiment with the data to find the interesting ones. Clustering and classification algorithms help in eliminating noise and arranging data in a manner such that the insights present- itself. These are some of the advanced analytics techniques that data scientists will specialize in. However, this is not panacea, and drill-down interactive analytics will not go away. I still consider most Big Data solutions in the market as an extension of the “experiment” conducted by some Internet companies, in the open source. That experimental infrastructure does not extend itself naturally into enterprises and I still do not see any of the Hadoop distributions or solutions providing an Enterprise-grade Hadoop. There are still a lot of inefficiencies: • Resource utilizations are very low, and highly imbalanced in a Hadoop cluster • The total cost of ownership is high • The tenets of cluster expansion only tend to further worsen the resource utilization in Hadoop clusters • Prevention against data loss still comes with probabilities associated with it, rather than a 100 percent guarantee • Faults are not contained within a node and thus result in lowered predictability etc. • "Data unrolling" as a means of providing HA is highly inefficient at large scale All these aspects may be fine for a low-grade batch-oriented solution, but not for a production environment with stringent SLAs. Further, there is no good way to leverage emerging technologies like flash in this space, to eliminate hot-spots. There is no way I can use flash for effective and automated data tiering in these solutions. Big Data is just about emerging and the infrastructures and solutions for Big Data Analytics are still at their infancy. I am keenly watching MRv2 (the next version of Map-Reduce) in Hadoop as that would help expand the applicability to Hadoop into different industry verticals having their own computational models. I am also keenly watching Hcatalog project of Hadoop evolve and I hope to see it become the metadata repository for an enterprise, also helping bridge the different data center technology islands. I expect Hadoop Mahout to come of age as iterative paradigms would be supported with MRv2 and advanced analytics would get that huge impetus through this. Finally, I expect near-realtime analytics to become mainstream and computational models supporting interactive and drill-down analysis to become common-place. A couple of indirect implications are that Big Data Content solutions with highly optimized $/GB solutions and high resiliency characteristics will emerge. Replication as a mechanism for data resiliency in Big Data Content solutions will diminish. Finally, server side footprints will emerge as a critical control point going forward and into real-time analytics solutions, while batch-analytics will move into the cloud provided through a SaaS model.