BigData Environment - Policloud.polimi.it

Transcription

BigData environmentat PoliCloudInterdepartmental Research LaboratoryMario MarchenteDEIB - Politecnico di Milano23/09/2014

BigDataNowadays data world : isstructuredand(muchmore)unstructured s,websites,sensors,meters,andotherdata- ‐genera5ngmachines). ischaracterizedbygreatvolumes(tera/petabytes) ess.“Big Data is now almost universally understood to refer to therealization of greater business intelligence working with data thatwas previously ignored due to the limitations of traditional datamanagement technologies.” (Harness the Power of Big Data – 2012,McGrawHill)

What is BigDataNew data management and analytic technologies are beingimplemented to complement rather than replace traditionalapproaches to data management and analytics.

The MapReduce paradigmMapReduce is a framework for executing highly parallelizableand distributable algorithms across huge data sets using alarge number of commodity computers. Google built it andcalled this computing paradigm MapReduce, based on thefollowing "map" and "reduce" functional programmingparadigm:

The MapReduce paradigm Map In contrast with traditional relational database–orientedinformation— which organizes data into fairly rigid rows andcolumns that are stored in tables — MapReduce input datainto the map() function as key/value pairs. The map()function then produces one or more intermediate valuesalong with an output key from the input. Reduce - After the Map phase is over, all the intermediatevalues for a given output key are combined together into alist. The reduce() function then combines the intermediatevalues into one or more final values for the same key.

The MapReduce paradigmIn a typical MapReduce environment: Components will fail at a high rate: normally usedinexpensive, commodity hardware. Data will be contained in a relatively small number of bigfiles: each file will be 100 MB to several GB. Data files are write-once: but they can be appended. Lots of streaming reads: many threads accessing files at anygiven time. Higher sustained throughput across large amounts ofdata: MapReduce implementations work best when processconsistently and predictably colossal amounts of information across theentire environment,as opposed to achieving irregular sub-secondresponse on random instances.

Apache Hadoop

Apache HadoopA project to develop open-source software library frameworkfor reliable, scalable, distributed computing, that: s. achines,eachofferinglocalcomputa5onandstorage. deliversahighly- eapplica5onlayer.Hadoop cluster components are: Master node (few) with the following processes:– JobTracker: interacts with client applications. It is also responsiblefor distributing MapReduce tasks to particular nodes within a cluster.– TaskTracker: is capable, in the cluster, of receiving tasks (includingMap, Reduce, and Shuffle) from a JobTracker

Apache Hadoop― NameNode: : a) stores a directory tree of all files in the HDFS), andb) keeps track of where the file data is within the cluster. Clientapplications contact NameNodes when they need to locate a file, oradd, copy, or delete a file.– Data Nodes a) stores data in the HDFS, and b) is responsible forreplicating data across clusters. Interact with client applications, whenthe NameNode has supplied the DataNode’s address. Worker nodes (many) which provide enough processingpower to analyze a few hundred terabytes up to onepetabyte. Each worker node includes a DataNode or aTaskTracker, or both.

Apache Hadoop

BigData HW Configuration

BigData HW Configuration

BigData HW Configuration

BigData HW Configuration

BigData HW Configuration

BigData SW ConfigurationNow: IBM Infosphere BigInsights Enterpise Edition 2.1.0.1(General Parallel File System – File Placement Optimizer(GPFS-FPO), Hadoop, MapReduce) – Linux RedHat 6.5 Data Integration DB2 e Blu Accelator – AIXNext: Cognos / SPSS Ilog Cplex

IBM Infosphere BigInsights Enterpise Edition

IBM Infosphere BigInsights Enterpise EditionKey Areas Built-in analytics– Text analytics: a vast library of extractors enabling actionableinsights from large amounts of native textual data.– Social Data Analytics Accelerator: takes and processes largevolumes of social media data, yielding key insights that can beused to develop programs/applications (ex. customer retentionand acquisition, campaign effectiveness).– Machine Data Analytics Accelerator: provides the capability toingest and process large volumes of machine data sources,including IT machines, sensors, meters, GPS devices and more.– Big R: Big R enables the use of R as a query language toexplore, visualize, transform, and model big data right from theirR environment and without any explicit programming usingMapReduce or Jaql.

IBM Infosphere BigInsights Enterpise EditionKey Areas Usability– Big SQL: SQL on Hadoop. It provides a single point of accessand view across all big data.– BigSheets: Web-based analysis and visualization tool with aspreadsheet-like interface, featuring D3 graphs, enabling largeamounts of data analysis and supporting design andmanagement of data collection jobs.– Development Tools: Eclipse based development environmentfor building and deploying analytic applications.– Management Console: Auditing helps tighten security andaccess control, while monitoring provides the ability to control allapplications from a centralized dashboard.– GPFS-FPO: Provides POSIX compliant, enterprise-class bigdata distributed file system capabilities to the Hadoop andMapReduce environment.

IBM Infosphere BigInsights Enterpise EditionKey Areas Usability– Workload Optimization: Adaptive MapReduce adapts to userneeds and system workloads automatically to improveperformance and simplify job tuning, while workload schedulerprovides optimization and control of job scheduling based onuser-selected metrics. Enterprise integration– Big Data Integration: Codeless creation of data integration logicand jobs, reusable across the enterprise through ETL jobspowered by Information Server. Enable data governanceincluding data lineage, business rule and policy management,data quality.– Data Privacy for Hadoop: Mitigate risk with sensitive datadiscovery. Maintain an acceptable risk tolerance with datamonitoring, within source systems and on Hadoop itself.

IBM Infosphere BigInsights Enterpise EditionKey Areas Enterprise performance and integration– Big Data Integration: Codeless creation of data integration logicand jobs, reusable across the enterprise through ETL jobspowered by Information Server. Enable data governanceincluding data lineage, business rule and policy management,data quality.– Customer Identification in Hadoop: Enhance customeranalytics by establishing a unique identifier for customerinformation stored in Hadoop with Master Data Managementprobabilistic matching.– Data Privacy for Hadoop: Mitigate risk with sensitive datadiscovery. Maintain an acceptable risk tolerance with datamonitoring, within source systems and on Hadoop itself.

IBM Infosphere BigInsights Enterpise EditionThe Web Interface

IBM Infosphere BigInsights Enterpise EditionBigSheet Interface

IBM Infosphere BigInsights Enterpise EditionBigSQLIBM's SQL interface to its Hadoop-based platform, that: managedbyHadoop. usdatasources. ouseitforqueries. structandarray,ratherthansimply"flat"rows).

IBM Infosphere BigInsights Enterpise EditionBigSQLBigSQL supports several underlying storage mechanismsas: Delimitedfiles(suchascomma- ‐separatedfiles)storedinHDFSorGPFS- ‐FPO enta5on) HBasetables(HBaseisHadoop'skey- ‐valueorcolumn- ‐baseddatastore)

IBM Infosphere BigInsights Enterpise EditionBigSQL

IBM Infosphere BigInsights Enterpise EditionBigSQL Web Interface

IBM Infosphere BigInsights Enterpise EditionBigSQLA table em(L ORDERKEYINT,L PARTKEYINT,L SUPPKEYINT,L LINENUMBERINT,L QUANTITYDOUBLE,L EXTENDEDPRICEDOUBLE,L DISCOUNTDOUBLE,L TAXDOUBLE,L RETURNFLAGSTRING,L LINESTATUSSTRING,L SHIPDATESTRING,L COMMITDATESTRING,L RECEIPTDATESTRING,L SHIPINSTRUCTSTRING,L SHIPMODESTRING,L ' 'STOREDASTEXTFILE;

IBM Infosphere BigInsights Enterpise EditionApache HBaseA column-oriented not relational database management system(*)modeled after Google's Bigtable (a distributed storage system forstructured data designed to scale to a very large size: petabytes of data acrossthousands of commodity servers) on top of Hadoop and HDFS: bigdatausecases. doesnotsupportastructuredquerylanguagelikeSQL; oconductupdates,insertsanddeletes eapplica5on. vroorThricgatewayAPIs

IBM Infosphere BigInsights Enterpise EditionApache HBaseHBase: daryindexes,triggers,andadvancedquerylanguages). stedoncommodityclassservers). randTaskTrackerslaves)opera5veconfigura5ons).

IBM Infosphere BigInsights Enterpise EditionApache HBaseConsider an HBase table moreas a multi-dimensional map,than a traditional table with rowsand columns.

IBM Infosphere BigInsights Enterpise EditionApache HiveA data warehouse infrastructure (*) built on top of Hadoop, towork with large datasets residing in distributed storage. Itprovides: oad(ETL)opera5on; heHDFS)andquerythedatatobeanalyzedusingaSQL- ‐likelanguagecalledHiveQL ormances.

IBM Infosphere BigInsights Enterpise EditionApache Hive

IBM Infosphere BigInsights Enterpise EditionApache length;

IBM Infosphere BigInsights Enterpise EditionApache PigA high-level Hadoop programming language for data analysisprograms that provides: adata- ‐flowlanguage anexecu5onframeworkforparallelcomputa5onPig programs structure is suited for substantial parallelization,enabling them to handle very large data sets.Actually, Pig’s infrastructure layer is based on a compiler thatproduces sequences of Map-Reduce programs, for which largescale parallel implementations already exists.

IBM Infosphere BigInsights Enterpise EditionApache PigPig's language layer currently consists of a textual languagecalled Pig Latin (*), characterized by the following features: Easeofprogrammingparalleldataanalysistasks. ng. Extensibilitythroughfunc5onsdevelopedforspecial- ‐purposeprocessing.Pig supports running scripts (and Jar files) that are stored inHDFS, Amazon S3, and other distributed file systems.

IBM Infosphere BigInsights Enterpise EditionApache PigPig runs (execute Pig Latin statements and Pig commands)using various modes.a) Interactive Mode (both Local/ Mapreduce Mode) via ashell called grunt invoked by command pig .or java- ‐cppig.jarorg.apache.pig.Main b) Batch Mode (both Local/ Mapreduce Mode) calling a pigscript (file.pig) with command pigfile.pig

IBM Infosphere BigInsights Enterpise EditionApache ank/java/piggybank.jar;records ouped tring.LENGTH(word);final ;DUMPfinal;

IBM Infosphere BigInsights Enterpise EditionEclipse SDK

IBM Infosphere BigInsights Enterpise EditionEclipse SDK – Text Analytics

DataStudio – IBM DB2 Visual Client

DataStudio – IBM DB2 Visual Client

The MapReduce paradigm In a typical MapReduce environment: Components will fail at a high rate: normally used inexpensive, commodity hardware. Data will be contained in a relatively small number of big files: each file will be 100 MB to several GB. Data files are write-once: but they can be appended. Lots of streaming reads: many threads accessing files at any