Does Big Data Mean Big Storage - Education.emc

Transcription

DOES BIG DATA MEAN BIG STORAGE?Mikhail GloukhovtsevSr. Cloud Solutions ArchitectOrange Business Services

Table of Contents1.Introduction . 42.Types of Storage Architecture for Big Data . 72.1 Storage Requirements for Big Data: Batch and Real-Time Processing . 72.2 Integration of Big Data Ecosystem with Traditional Enterprise Data Warehouse. 82.3 Data Lake . 92.4 SMAQ Stack . 92.5 Big Data Storage Access Patterns .102.6 Taxonomy of Storage Architectures for Big Data .112.7 Selection of Storage Solutions for Big Data .143.Hadoop Framework .143.1 Hadoop Architecture and Storage Options.143.2 Enterprise-class Hadoop Distributions .183.3 Big Data Storage and Security.203.4 EMC Isilon Storage for Big Data .213.5 EMC Greenplum Distributed Computing Appliance (DCA) .223.6 NetApp Storage for Hadoop.233.7 Object-Based Storage for Big Data .233.7.1 Why Is Object-based Storage for Big Data Gaining Popularity? .233.7.2 EMC Atmos .263.8 Fabric Storage for Big Data: SAN Functionality at DAS Pricing .273.9 Virtualization of Hadoop.274.Cloud Computing and Big Data .305.Big Data Backups .305.1 Challenges of Big Data Backups and How They Can Be Addressed .305.2 EMC Data Domain as a Solution for Big Data Backups .326.Big Data Retention .342014 EMC Proven Professional Knowledge Sharing2

6.1 General Considerations for Big Data Archiving .346.1.1 Backup vs. Archive.346.1.2 Why Is Archiving Needed for Big Data? .346.1.3 Pre-requisites for Implementing Big Data Archiving .346.1.4 Specifics of Big Data Archiving.356.1.5 Archiving Solution Components .366.1.6 Checklist for Selecting Big Data Archiving Solution .376.2 Big Data Archiving with EMC Isilon .376.3 RainStor and Dell Archive Solution for Big Data .397.Conclusions .398.References .41Disclaimer: The views, processes or methodologies published in this article are those of theauthor. They do not necessarily reflect the views, processes or methodologies of EMCCorporation or Orange Business Services (my employer).2014 EMC Proven Professional Knowledge Sharing3

1. IntroductionBig Data has become a buzz word today and we can hear about Big Data from early morning –reading the newspaper that tells us “How Big Data Is Changing the Whole Equation forBusiness”1 – through our entire day. A search for “big data” on Google returned about2,030,000,000 results in December 2013. So what is Big Data?According to Krish Krishnan,2 the so-called three V’s definition of Big Data that became popularin the industry was first suggested by Doug Laney in a research report published by METAGroup (now Gartner) in 2001. In a more recent report,3 Doug Laney and Mark Beyer define BigData as follows: "’Big Data’ is high-volume, -velocity and -variety information assets thatdemand cost-effective, innovative forms of information processing for enhanced insight anddecision making.”Let us briefly review these characteristics of Big Data in more detail.1. Volume of data is huge (for instance, billions of rows and millions of columns). Peoplecreate digital data every day by using mobile devices and social media. Data defined asBig Data includes machine-generated data from sensor networks, nuclear plants, X-rayand scanning devices, and consumer-driven data from social media. According to IBM,as of 2012, every day 2.5 exabytes of data were created and 90% of the data in theworld today was created in the last 2 years alone.4 This data growth is being acceleratedby the Internet of Things (IoT), which is defined as the network of physical objects thatcontain embedded technology to communicate and interact with their internal states orthe external environment (IoT excludes PCs, tablets, and smartphones). IoT will grow to26 billion units installed in 2020, representing an almost 30-fold increase from 0.9 billionin 2009, according to Gartner.52. Velocity of new data creation and processing. Velocity means both how fast data isbeing produced and how fast the data must be processed to meet demand. In the caseof Big Data, the data streams in a continuous fashion and time-to-value can be achievedwhen data capture, data preparation, and processing are fast. This requirement is morechallenging if we take into account that the data generation speed changes and datasize varies.2014 EMC Proven Professional Knowledge Sharing4

3. Variety of data. In addition to traditional structured data, the data types include semistructured (for example, XML files), quasi-structured (for example, clickstream string),and unstructured data.A misunderstanding that big volume is the key characteristic defining Big Data can result in afailure of a Big Data–related project unless it also focuses on variety, velocity, and complexity ofthe Big Data, which are becoming the leading features of Big Data. What is seen as a large datavolume today can become a new normal data size in a year or two.A fourth V – Veracity – is frequently added to this definition of Big Data. Data Veracity dealswith uncertain or imprecise data. How accurate is that data in predicting business value? Doesa Big Data analytics give meaningful results that are valuable for business? Data accuracy mustbe verifiable.Just retaining more and more data of various types does not create any business advantageunless the company has developed a Big Data strategy to get business information from BigData sets. Business benefits are frequently higher when addressing the variety of the datarather than addressing just the data volume. Business value can also be created by combiningthe new Big Data types with the existing information assets that results in even larger data typediversity. According to research done by MIT and the IBM Institute for Business Value6,organizations applying analytics to create a competitive advantage within their markets orindustries are more than twice as likely to substantially outperform their peers.The requirement of time-to-value warrants innovations in data processing that are challenged byBig Data complexity. Indeed, in addition to a great variety in the Big Data types, the combinationof different data types presenting different challenges and requiring different data analyticalmethods in order to generate a business value makes data management more complex.Complexity with an increasing volume of unstructured data (80%–90% of the data in existenceis unstructured) means that different standards, data processing methods, and storage formatscan exist with each asset type and structure.The level of complexity and/or data size of Big Data has resulted in another definition as datathat cannot be efficiently managed using only traditional data-capture technology and processesor methods. Therefore, new applications and new infrastructure as well as new processes andprocedures are required to use Big Data. The storage infrastructure for Big Data applications2014 EMC Proven Professional Knowledge Sharing5

should be capable of managing large data sets and providing required performance.Development of new storage solutions should address the 3V V characteristics of Big Data.Big Data creates great potential for business development but at the same time it can mean “BigMistakes” by spending a lot of money and time on poorly defined business goals oropportunities. The goal of this article is to help readers anticipate how Big Data will affectstorage infrastructure design and data lifecycle management so that they can work with storagevendors to develop Big Data road maps for their companies and spend the budget for Big Datastorage solutions wisely.While this article considers the storage technologies for Big Data, I want readers to keep in mindthat Big Data is about more than just a technology. To get business advantages of Big Data,companies have to make changes in the way they do business and develop enterpriseinformation management strategies to address the Big Data lifecycle, including hardware,software, services, and policies for capturing, storing, and analyzing Big Data. For more detail,refer to the excellent EMC course, Data Science and Big Data Analytics8.2014 EMC Proven Professional Knowledge Sharing6

2. Types of Storage Architecture for Big Data2.1 Storage Requirements for Big Data: Batch and Real-Time ProcessingBig Data architecture is based on two different technology types: for real-time, interactiveworkloads and for batch processing requirements. These classes of technology arecomplementary and frequently deployed together – for example, Pivotal One Platform thatincludes Pivotal Data Fabric9,10 (Pivotal is partly owned by EMC, VMware, and General Electric).Big Data frameworks such as Hadoop are batch process-oriented. They address the problemsof the cost and speed of Big Data processing by using open source software and massivelyparallel processing. Server and storage costs are reduced by implementing scale-out solutionsbased on commodity hardware.NoSQL databases that are highly optimized key–value data stores, such as HBase, are used forhigh performance, index-based retrieval in real-time processing. NoSQL database will processlarge amounts of data from various sources in a flexible data structure with low latency. It willalso provide real-time data integration with Complex Event Process (CEP) engine to enableactionable real-time Big Data Analytics. High-speed processing of Big Data in-flight – so-calledFast Big Data – is typically done using in-memory computing (IMC). IMC relies on in-memorydata management software to deliver high speed, low-latency access to terabytes of dataacross a distributed application. Some Fast Big Data solutions such as Terracotta BigMemory11maintain all the data in-memory with the motto “Ditch the Disk” and use disk-based storage onlyfor storing data copies and redo logs for DB startup and fault recovery as SAP HANA, an inmemory database, does.12 Other solutions for Fast Big Data such as Oracle Exalytics InMemory Machine,13 Teradata Active EDW platform,14 and DataDirect Networks' SFA12KXseries appliances15 use hybrid storage architecture (see Section 2.3).As data of various types are captured, they are stored and processed in traditional DBMS (forstructured data), simple files, or distributed-clustered systems such as NoSQL data stores andHadoop Distributed File System (HDFS). Due to the size of the Big Data sets, the raw data isnot moved directly to a data warehouse. The raw data undergoes transformation usingMapReduce processing, and the resulted reduced data sets are loaded into the data warehouseenvironment, where they are used for further analysis – conventional BI reporting/dashboards,statistical, semantic, correlation capabilities, and advanced data visualization.2014 EMC Proven Professional Knowledge Sharing7

Figure 1: Enterprise Big Data Architecture (Ref. 10)Storage requirements for batch processing and for real-time analytics are very different. As thecapability to store and manage hundreds of terabytes in a cost-effective way is an importantrequirement for batch processing of Big Data, low I/O latency is the key for large-capacityperformance-intensive Fast Big Data analytics applications. Storage architectures for such I/Operformance applications that include using all-flash storage appliances and/or in-memorydatabases are not in the scope of this article.2.2 Integration of Big Data Ecosystem with Traditional Enterprise Data WarehouseThe integration of a Big Data platform with traditional BI enterprise data warehouse (EDW)architecture is important, and this requirement, which many enterprises have, results indevelopment of so-called consolidated storage systems. Instead of a “rip & replace” of theexisting storage ecosystems, organizations can leverage the existing storage systems andadapt their data integration strategy using Hadoop as a form of preprocessor for Big Dataintegration in the data warehouse. The consolidated storage includes storage tiering and is usedfor very different data management processes: for primary workloads, real-time online analyticsqueries, and offline batch-processing analytics. These different data processing types result inheterogeneous or hybrid storage environments discussed later. Readers can find moreinformation about consolidated storage in Ref. 16.2014 EMC Proven Professional Knowledge Sharing8

The new integrated EDW should have three main capabilities:1. Hadoop-based analytics to process and analyze any data type across commodity serverclusters2. Real-time stream processing with Complex Event Processing (CEP) engine with submillisecond response times3. Data Warehousing providing insight with advanced in-database analyticsThe integration of Big Data and EDW environments is reflected in the definition of “Data Lake”.2.3 Data LakeBooz Allen Hamilton introduced the concept of Data Lake.17 Instead of storing information indiscrete data structures, the Data Lake consolidates an organization’s complete repository ofdata in a single, large table. The Data Lake includes data from all data sources – unstructured,semi-structured, streaming, and batch data. To be able to store and process terabytes topetabytes of data, a Data Lake should scale in both storage and processing capacity in anaffordable manner. Enterprise Data Lake should provide high-availability, protected storage;support existing data management processes and tools as well as real-time data ingestion andextraction; and be capable of data archiving.Support for Data Lake architecture is included in Pivotal's Hadoop products that are designed towork within existing SQL environments and can co-exist alongside in-memory databases forsimultaneous batch and real-time analytic queries.18 Customers implementing Data Lakes canuse the Pivotal HD and HAWQ platform for storing and analyzing all types of data – structuredand unstructured.Pentaho has created an optimized system for organizing the data that is stored in the DataLake, allowing customers to use Hadoop to sift through the data and extract the chunks thatanswer the questions at hand.192.4 SMAQ StackThe term SMAQ stack, coined by Ed Dumbill in a blog post at O’Reilly Radar,20 relates to aprocessing stack for Big Data that consists of layers of Storage, MapReduce technologies, andQuery technologies. SMAQ systems are typically open source and distributed and run oncommodity hardware. Similar to the commodity LAMP stack of Linux, Apache, MySQL andPHP, which has played a critical role in development of Web 2.0, SMAQ systems are expected2014 EMC Proven Professional Knowledge Sharing9

to be a framework for development of Big Data-driven products and services. While Hadoopbased architectures dominate in SMAQ, SMAQ systems also include a variety of NoSQLdatabases.Figure 2: The SMAQ Stack for Big Data (Ref. 8)As expected, storage is a foundational layer of the SMAQ stack and is characterized bydistributed and unstructured content. At the intermediate layer, MapReduce technologies enablethe distribution of computation across many servers as well as supporting a batch-orientedprocessing model of data retrieval and computation. Finally, at the top of the Stack are theQuery functions. Characteristic of this function is the ability to find efficient ways of definingcomputation and providing a platform for “user-friendly” analytics.2.5 Big Data Storage Access PatternsTypical Big Data storage access patterns are write once, read many times workloads withmetadata lookups and large block–sized reads (64 MB to 128 MB, e.g. Hadoop HDFS) as wellas small-sized accesses for HBase. Therefore, the Big Data processing design should provideefficient data reads.2014 EMC Proven Professional Knowledge Sharing10

Both scale-out file systems and object-based systems can meet the Big Data storagerequirements. Scale-out file systems provide a global namespace file system, whereas use ofmetadata in object storage (see Section 3.7 below) allows for high scalability for large data sets.2.6 Taxonomy of Storage Architectures for Big DataBig Data storage architectures can be categorized into shared-nothing, shared primary, orshared secondary storage. Implementation of Hadoop using Direct Attached Storage (DAS) iscommon, as many Big Data architects see shared storage architectures as relatively slow,complex, and, above all, expensive. However, DAS has its own limitations (first of all,inefficiency in storage use) and is an extreme in the broad spectrum of storage architectures. Asa result, in addition to DAS-based HDFS systems, enterprise-class storage solutions usingshared storage (scaled-out NAS, i.e. EMC Isilon , or SAN), alternative distributed file systems,cloud object-based storage for Hadoop (using REST API like CDMI, S3, or Swift; see Ref.7 fordetail), decoupled storage and compute nodes such as solutions using vSphere BDE (seeSection 3.9) are gaining popularity (see Figure 3).For Hadoop workloads, the storage resource to compute resource ratios vary by application andit is often difficult to determine them in advance. This challenge makes it imperative that aHadoop cluster is designed for flexibility and scaling storage and compute independently.Decoupling storage and compute reso

Storage requirements for batch processing and for real-time analytics are very different. As the capability to store and manage hundreds of terabytes in a cost-effective way is an important