A Primer On Big Data Testing - QA Consultants

Transcription

A Primer on Big Data TestingFor more information please contactMaria Kelebeev416-238-5333 ext m

A Primer on Big Data TestingPREFACEThis report is the output of a research project by QA Consultants - the North American leader inonshore software testing. This paper focuses on the primary challenges of testing Big Datasystems and proposes methodology to overcome those challenges. Because of the complexnature of both Big Data and the highly distributed, asynchronous systems that process it, organizations have been struggling to define testing strategies and to set up optimal testing environments. The focus of this primer is on important aspects of methods, tools and processes for BigData testing. It was completed with the support of the National Research Council of Canada.TABLE OF CONTENTS1.2.Big Data and Bad Data3Characteristics of Big Data42.1 Volume: The quantity of data42.2 Velocity: Streaming data42.3 Variety: Different types of data43.Testing Big Data Systems64.Testing Methods, Tools and Reporting for Validation of Pre-Hadoop Processing74.1 Tools for validating pre-Hadoop processing9Testing Methods, Tools and Reporting for Hadoop MapReduce Processes105.1 Methods and tools11Testing Methods, Tools and Reporting for Data Extract and EDW Loading126.1 Methods136.2 Different methods for ETL testing136.3 Areas of ETL testing136.4 Tools135.6.7.8.9.Testing Methods, Tools and Reporting on Analytics147.1 Four Big Data reporting strategies147.2 Methodology for report testing157.3 Apache Falcon17Testing Methods, Tools and Reporting on Performance and Failover Testing188.1 Performance testing188.2 Failover testing188.3 Methods and tools198.3.1 Jepsen19Infrastructure Setup, Design and Implementation209.1 Hardware selection for master nodes (NameNode, JobTracker, HBase Master)209.2 Hardware selection for slave nodes (DataNodes, TaskTrackers, RegionServers)219.3 Infrastructure setup key points22Conclusion23 QA Consultants 2015

A Primer on Big Data Testing1Big Data and Bad Data“70% of enterprises have either deployed or are planning to deploy bigdata projects and programs this year.”Analyst firm IDG“75% of businesses are wasting 14% of revenue due to poor data quality.”Experian Data Quality Global Research report“Big Data is growing at a rapid pace and with Big Data comes bad data.Many companies are using Business Intelligence to make strategic decisions in the hope of gaining a competitive advantage in a tough business landscape. But bad data will cause them to make decisions thatwill cost their firms millions of dollars.According to analyst firm Gartner, the average organization loses 8.2million annually through poor Data Quality, with 22% estimating theirannual losses resulting from bad data at 20 million and 4% puttingthat figure as high as an astounding 100 million. Yet according to theExperian Data Quality report, 99% of organizations have a data qualitystrategy in place. This is disturbing in that these Data Quality practices are not finding the bad data that exists in their Big Data.”Query SurgeWith respect to software development and verification processes, testing teams may not yet fully understandthe implications of Big Data's impact on the design, configuration and operation of systems and databases.Testers need a clear plan to execute their tests but there are many new unknowns as Big Data systems arelayered on top of enterprise systems struggling with data quality. Added to those struggles are the challenges of replicating and porting that information into the Big Data analytic and predictive software suite. How doyou measure the quality of data, particularly when it is unstructured or generated through statistical processes? How do you confirm that highly concurrent systems do not have deadlock or race conditions? What toolsshould be used?It is imperative that software testers understand that Big Data is about far more than simply data volume. Forexample, having a two-petabyte Oracle database doesn't necessarily constitute a true Big Data situation just a high load one. Big Data management involves fundamentally different methods for storing andprocessing data, and the outputs may also be of a quite different nature. With the increased likelihood thatBad Data is imbedded in the mix, the challenges facing the quality assurance testing departments increasedramatically. This primer on Big Data testing provides guidelines and methodologies on how to approachthese data quality problems. QA Consultants 20153

A Primer on Big Data Testing2Characteristics of Big DataBig Data is often characterized as involving the so-called“three Vs”: Volume, Velocity and Variety.(Some authors also add a fourth characteristic: Veracity.)VOLUMEVELOCITYVARIETY2.1 Volume: The quantity of dataWith the rise of the Web, then mobile computing, the volume of data generated daily around the world has exploded. Forexample, organizations such as Facebook generate terabytes of data daily that must be stored and managed.As the number of communications and sensing devices being deployed annually accelerates to create the encompassing “Internet of Things,” the volumes of data continue to rise exponentially. By recording the raw, detailed data streamingfrom these devices, organizations are beginning to develop high-resolution models based on all available data ratherthan just a sample. Important details that would otherwise have been washed out by the sampling process can now beidentified and exploited. The bottleneck in modern systems is increasingly the limited speed at which data can be readfrom a disk drive. Sequential processing simply takes too long when such data volumes are involved. New databasetechnologies that are resistant to failure and enable massive parallelism are a necessity.2.2 Velocity: Streaming dataThe Dawn of Big Data: the uncertainty of newinformation is growing alongside its complexityIt is estimated that 2.3 trillion gigabytes of data9000world, trends of interest may last only a few8000days, hours or even just minutes. Someimportant events, such as online fraud orhacking attempts, may last only seconds, andneed to be responded to immediately. Theneed for near-real time sensing, processingand response is driving the development ofVolume in Exabytesare created each day. In our highly connectedSensors& Devices70006000SocialMedia50004000We are hereVoIP3000EnterpriseDatanew technologies for identifying patterns indata within very small and immediate time2010windows.Chart courtesy of International Business Machines - Slide Share20202.3 Variety: Different types of dataA common theme in Big Data systems is that the source data is increasingly diverse, involving types of data and structures that are more complex and/or less structured than the traditional strings of text and numbers that are the mainstayof relational databases. Increasingly, organizations must be able to deal with text from social networks, image data, voicerecordings, video, spreadsheet data, and raw feeds directly from sensor sources.4 QA Consultants 2015

A Primer on Big Data Testing2Even on the Web, where computer-to-computer communication ought to be straightforward, the reality isthat data is messy. Different browsers send differently formatted data, users withhold information, andthey may be using different software versions or vendors to communicate with you. Traditional relationaldatabase management systems are well-suited for storing transactional data but do not perform well withmixed data types.In response to the “three Vs” challenge, Hadoop, an open source software framework, has been developed by a number of contributors. Hadoop is designed to capture raw data using a cluster of relativelyinexpensive, general-purpose servers. Hadoop achieves reliability that equals or betters specializedstorage equipment by managing redundant copies of each file, intentionally and carefully distributedacross the cluster.Hadoop uses its own distributed file system,HDFS, which extends the native file system ofthe host operating system, typically Linux orWindows.A common Hadoop usage pattern involvesthree stages:1. Loading data into HDFS2. MapReduce operations3. Retrieving results from HDFSThis process is by nature a batch operation,suited to analytical or non-interactive computingtasks. For this reason, Hadoop is not itself a fullgeneral-purpose database or data warehousesolution, but can act as an analytical adjunct to orthe basis of one when supplemented with othertools.One of the best-known Hadoop users is Facebook, whose model follows this pattern. AMySQL database stores the core data, which isthen reflected into Hadoop, where computationsoccur, such as creating recommendations for youbased on your friends’ interests. Facebook thentransfers the results back into MySQL for use inpages served to users. QA Consultants 2015Hadoop is not actually a single product but isinstead a growing collection of components andrelated projects. Following are a few of the manycomponents that would need to be tested forcorrect installation, configuration and functioningin a typical Hadoop environment.1. Hadoop HDFS: the file system2. Hadoop YARN: the Hadoop resource coordinator3. Apache Pig: a query system4. Apache Hive: a system that allows datastored in Hadoop to be structured as tablesand queried using SQL5. Apache HCatalog: the Hadoop metadatastore6. Apache Zookeeper: a coordination system7. Apache Oozie: a process scheduling andsequencing system8. Apache Sqoop: a tool for connecting Hadoopto relational databases9. Hue: a browser-based user interface forseveral of the tools above5

A Primer on Big Data Testing3Big Data Testing AspectsSome of the key aspects of Big Data testing are the following.Poor implementation of the above key components of testing Big Data environments can lead topoor quality, delays in testing, and increased cost.Defining the test approach for process and datavalidation early in the implementation life cycleensures that defects are identified as soon aspossible, reducing the overall cost and time toimplementation of the finished system. Performingfunctional testing can identify data quality issuesthat originate in errors with coding or node configuration; effective test data and test environmentmanagement ensures that data from a variety ofsources is of sufficient quality for accurate analysis and can be processed without error.6Apart from functional testing, non-functionaltesting (specifically performance and failovertesting) plays a key role in ensuring the scalabilityof the process. Functional testing is performed toidentify functional coding issues and requirementsissues, while non-functional testing identifiesperformance bottlenecks and validates thenon-functional requirements. The size of Big Dataapplications often makes it difficult or too costly toreplicate the entire system in a test environment; asmaller environment must be created instead, but thisintroduces the risk that applications that run well in thesmaller test environment behave differently in production. QA Consultants 2015

A Primer on Big Data Testing3Therefore, it is necessary that the system engineers areThe numerical stability of algorithms also becomes ancareful when building the test environment, as many ofissue when dealing with statistical or machine learningthese concerns can be mitigated by carefully designingalgorithms. Applications that run well with one datasetthe system architecture. A proper systems architecturemay abort unexpectedly or produce poor results whencan help eliminate performance issues (such as anpresented with a similar but poorly conditioned set ofimbalance in input splits or redundant shuffle and sort),inputs. Verification of numerical stability is particularlybut, of course, this approach alone doesn't guarantee aimportant for customer-facing systems.system that performs well.Two key areas of the testing problem are (1) establishEven with a smaller test environment, the data volumesing efficient test datasets and (2) availability ofmay still be huge, requiring hours to run a single test.Hadoop-centric testing tools (such PigUnit, Junit forCareful planning is required to exercise all paths withPig, Hive UDF testing, and BeeTest for Hive).smaller data sets in a manner that fully verifies theapplication but allows the test to run in a short enoughTesting should include all four phases shown below.period of time to allow repeated feasible testing.Data quality issues can manifest themselves at any ofthese stages.4Testing Methods, Tools and Reporting for Validationof Pre-Hadoop ProcessingBig Data systems typically process a mix of both struc-Most Big Data database management systems aretured data (such as point-of-sale transactions, calldesigned to store data in its rawest form, creating whatdetail records, general ledger transactions, and callhas come to be known as a "data lake," a largelycenter transactions), unstructured data (such as userundifferentiated collection of data as captured from thecomments, doctors' notes, insurance claims descrip-source. These DBMSs use an approach calledtions and web logs) and semi-structured social media"schema on read," i.e. the data is given a simpledata (from sites like Twitter, Facebook, LinkedIn andstructure appropriate to the application as it is read, butPinterest). Often the data is extracted from its sourcevery little structure is imposed during the loading phase.location and saved in its raw or a processed form inHadoop or another Big Data database managementThe most important activity during data loading is tosystem. Data is typically extracted from a variety ofcompare data to ensure extraction has happenedsource systems and in varying file formats, e.g.correctly and to confirm that the data loaded into therelational tables, fixed size records, flat files withHDFS (Hadoop Distributed File System) is adelimiters (CSV), XML files, JSON and text files.complete, accurate copy. QA Consultants 20157

A Primer on Big Data Testing4Data sources can include a local file system, HDFS,This splitting is key to Hadoop's ability to performHive tables, streaming sources, and relational or otherresilient parallel processing. For those transformationsdatabases. If the data is loaded into Hive, it can bethat can be parallelized, Hadoop will use the all thevalidated and transformed using HiveQL, as one canprocessors across the cluster to perform the computa-do in SQL. If not, a MapReduce or equivalent processtion as quickly as possible.(e.g. Spark) will be needed.One node of the cluster is reserved to serve as theTypical tests include:NameNode, which knows the identifiers (names) of1.Data type validation. Data type validation iseach of the other nodes in the cluster. The client (thecustomarily carried out on one or more simpledevice initiating a data load or other computation) asksdata fields. The simplest kind of data type valida-the NameNode for the list of DataNodes, the serverstion verifies that the individual characters provid-where the data resides. When a client writes data, ited through user input are consistent with thefirst asks the NameNode to choose DataNodes to hostexpected characters of one or more knownreplicas of the first block of the file. The client organiz-primitive data types as defined in a programminges a direct pipeline from the source server and sendslanguage or data storage and retrieval mecha-the data. When the first block is filled, the clientnism.requests a DataNode to be chosen for the next block,2.Range and constraint validation. Simple rangemost likely a different one than received the first block.and constraint validation may examine user inputA new pipeline is organized, and the client sends thefor consistency with a minimum/maximum range,next block of data from the file.or consistency with a test for evaluating a3.sequence of characters, such as one or more1. Checkpoint Nodetests against regular expressions.The Checkpoint Node periodically combines theCode and cross-reference validation. Codeexisting checkpoint and journal to create a new check-and cross-reference validation includes tests forpoint and an empty journal. The Checkpoint Nodedata type validation, combined with one or moreusually runs on a different host from the NameNodeoperations to verify that the user-supplied data issince it has the same memory requirements as theconsistent with one or more external rules,NameNode. It downloads the current checkpoint andrequirements or validity constraints relevant to ajournal files from the NameNode, merges them locally,particular organization, context or set of underly-and returns the new checkpoint back to the ferencingCreating periodic checkpoints is one way to protect thesupplied data with a known look-up table or direc-file system metadata. The system can start from thetory information service such as LDAP.most recent checkpoint if all other persistent copies ofStructured validation. Structured validationthe namespace image or journal are unavailable.allows for the combination of any number ofCreating a checkpoint also lets the NameNodevarious basic data type validation steps, alongtruncate the journal when the new checkpoint iswith more complex processing. Such complexuploaded to the NameNode. HDFS clusters run forprocessing may include the testing of conditionalprolonged periods of time without restarts during whichconstraints for an entire complex data object orthe journal constantly grows. If the journal grows veryset of process operations within a system.large, the probability of loss or corruption of the journalconstraints4.Thesemayinvolvefile increases. Also, a very large journal extends theHDFS can support up to 200 PB of storage and atime required to restart the NameNode. For a largesingle cluster of 4,500 servers, with close to a billioncluster, it takes an hour to process a week-long journal.files and blocks. When HDFS ingests data, it splits theGood practice is to create a daily checkpoint.file into smaller pieces and distributes them acrossdifferent nodes in a cluster.8 QA Consultants 2015

A Primer on Big Data Testing42. Backup NodeThen it writes the new checkpoint and the emptyThe Backup Node can be viewed as a read-onlyjournal to a new location, so that the old checkpointNameNode. It contains all file system metadataand journal remain unchanged.information except for block locations. It can perform alloperations of the regular NameNode that do not involveDuring handshake the NameNode tells DataNodesmodification of the namespace or knowledge of blockwhether to create a local snapshot. The local snapshotlocations. Use of a Backup Node provides the option ofon the DataNode cannot be created by replicating therunning the NameNode without persistent storage,directories containing the data files as this woulddelegating responsibility of persisting the namespacerequire doubling the storage capacity of every DataN-state to the Backup Node.ode on the cluster. Instead, each DataNode creates acopy of the storage directory and hard links existing3. Upgrades and File System Snapshotsblock files into it. When the DataNode removes aThe snapshot (only one can exist) is created at theblock, it removes only the hard link, and block modifi-cluster administrator's discretion whenever the systemcationsis started. If a snapshot is requested, the NameNodetechnique. Thus old block replicas remain untouchedfirst reads the checkpoint and journal files and mergesin their old directories.duringappendsusethecopy-on-writethem in memory.4.1 Tools for validating pre-Hadoop processingThe following tools are components of the Hadoop ecosystem and can be used to assist with validatingpre-Hadoop processing.1. Apache FlumeThe file channel stores all events on disk so if the OSApache Flume is a reliable service for efficiently trans-crashes or reboots, events that were not successfullyferr

Hadoop-centric testing tools (such PigUnit, Junit for Pig, Hive UDF testing, and BeeTest for Hive). A Primer on Big Data Testing QA Consultants 2015 7 3 4 Testing Methods, Tools and Reporting for Validation of Pre-Hadoop Processing Testing should include all four phases shown below. Data q