Pentaho - HPCC

Transcription

WelcomeHPCC Systems Architecture Overview &Pentaho Spoon/Kettle IntegrationMonday, March 12, 20127:00 pm, ESTLogisticsTwitter hashtags: #cfbi & #hpccmeetupAgenda1. High level architecture of the HPCC platform:2. Pentaho plugin integration demo:3. Q&A (raise your hand)20 minutes20 minutes20 minutesPresentersFlavio Villanustre & Arjuna Chalahttp://hpccsystems.comRisk Solutions1

What is HPCC Systems?HPCC is a massive parallel-processing computing platformESPRisk Solutions2

The HPCC cluster computing paradigm and an efficient data-centricprogramming language are key factors in our company’s success“Grid” ComputingSplits problems into pieces to beworked in parallel by commodityservers Risk SolutionsData-centric language (ECL)Integrated Delivery System“Big Data” language brings thecomputing to the dataConsistent approach across dataingestion, processing, anddelivery systems 3

The Three main HPCC components1 HPCC Data Refinery (Thor) 2HPCC Data Delivery Engine (Roxie) Enterprise Control Language (ECL)Conclusion: End to End solutionRisk SolutionsA massively parallel, high throughput, structured query response engineUltra fast low latency and highly available due to its read-only nature.Allows indices to be built onto data for efficient multi-user retrieval of dataSuitable forVolumes of structured queriesFull text ranked Boolean searchProgrammable using ECL An easy to use , data-centric programming language optimized for large-scale datamanagement and query processingHighly efficient; Automatically distributes workload across all nodes.Automatic parallelization and synchronization of sequential algorithms for parallel anddistributed processingLarge library of efficient modules to handle common data manipulation tasks No need for any third party tools (Cassandra, GreenPlum, MongoDB, RDBMS, Oozie, Pig, etc.) 3Massively Parallel Extract Transform and Load (ETL) engine– Built from the ground up as a parallel data environment. Leverages inexpensive locally attachedstorage. Doesn’t require a SAN infrastructure.Enables data integration on a scale not previously available:– Current LexisNexis person data build process generates 350 Billion intermediate results at peakSuitable for:– Massive joins/merges– Massive sorts & transformations– Programmable using ECL 4

BI And AnalyticsData SourcesWorkflowBig Data ProcessingBusiness put tionRDBMSDWStructured DataRisk Solutions5

Pentaho SpoonRisk Solutions6

Kettle – ECL General Functions TRIBUTEROLLUPPARSERisk Solutions7

Kettle – ECL ML Functions Clustering– K-Means– Agglomerative Linear Regression– OLS Classifier– Logistic Regression– Naïve Bayes– Perceptron Correlations– Pearson and Spearman– Kendal’s TauRisk Solutions Document Parsing– N-Gram Extraction– Histograms– Frequent pattern mining Statistics– Random data generators– PCA Discretizers– Rounding, bucketing, tiling Linear algebra library– Distributed matrix operations– SVD8

JDBC Standard support for SQL queriesFiltersAggregationJoinsOptimized by available Indexed fieldsRisk Solutions9

Core Benefits A simple technology stack results in more effective resource leveraging A Single Data Store Faster scoringBetter classification and correlationSpeed More efficient use of data and storageHigher productivity across data analystsInnovative productsIncreased precision and recallAn online Data WarehouseBetter Analytics Fewer required skill sets means more flexibility in staffing projectsWe no longer have silos of specialized expertise that can’t be sharedScales to extreme workloads quickly and easilyIncrease speed of development leads to faster production/deliveryImproved developer productivityCapacity Enables massive joins, merges, sorts, transformationsRisk Solutions11

AppendixRisk Solutions12

ECL: Enterprise Control Language Declarative programming language: Describe whatneeds to be done and not how to do itPowerful: Unlike Java, high level primitives as JOIN,TRANSFORM, PROJECT, SORT, DISTRIBUTE, MAP, etc.are available. Higher level code means fewerprogrammers & shortens time to deliveryExtensible: As new attributes are defined, theybecome primitives that other programmers can useImplicitly parallel: Parallelism is built into theunderlying platform. The programmer needs not beconcerned with itMaintainable: A high level programming language, noside effects and attribute encapsulation provide formore succinct, reliable and easier to troubleshootcodeComplete: Unlike Pig and Hive, ECL provides for acomplete programming paradigm.Homogeneous: One language to express dataalgorithms across the entire HPCC platform, includingdata ETL and high speed data delivery.Risk Solutions13

Roxie Delivery Engine Low latency: Data queries are typically completed in fractionsof a secondNot a key-value store: Unlike HBase, Cassandra and others,Roxie is not limited by the constrains of key-value data stores,allowing for complex queries, multi-key retrieval and fuzzymatchingHighly available: Roxie is designed to operate in criticalenvironments, under the most rigorous service levelrequirementsScalable: Horizontal linear scalability provides room toaccommodate for future data and performance growthHighly concurrent: In a typical environment, thousands ofconcurrent clients can be simultaneously executing transactionson the same Roxie systemRedundant: A shared-nothing architecture with no single pointsof failure provides extreme fault toleranceECL inside: One language to describe both, the datatransformations in Thor and the data delivery strategies inRoxieConsistent tools: Thor and Roxie share the same exact set oftools, which provides consistency across the platformRisk Solutions14

Data ModelHPCC supports a flexible data model: The data model isdefined by the data analyst in the way it fits better theorganizational skills, the data at hand and/or the processthat needs to happenRisk Solutions15

Processing ModelHPCC supports a flexible dataflow oriented model: Thecomplexities of parallel processing and distributedplatform are abstracted behind high level data primitivesRisk Solutions16

Programming ModelData flows and data queries in HPCC are programmed inECL: A complete high level declarative dataflow orientedlanguage created for readability, extensibility andcode/data reuseRisk Solutions17

Beyond MapReduce Open Data Model: Unlike Hadoop, the data model is defined by the user,and is not constrained by the limitations of a strict key-value paradigm Simple: Unlike Hadoop MapReduce, solutions to complex data problemscan be expressed easily and directly in terms of high level ECL primitives.With Hadoop, creating MapReduce solutions to all but the most simpledata problems can be a daunting task. Many of these complexities areeliminated by the HPCC programming model Truly parallel: Unlike Hadoop, nodes of a data graph can be processed inparallel as data seamlessly flows through them. In Hadoop MapReduce(Java, Pig, Hive, Cascading, etc.) almost every complex datatransformation requires a series of MapReduce cycles; each of the phasesfor these cycles cannot be started until the previous phase has completedfor every record, which contributes to the well-known “long tail problem”in Hadoop. HPCC effectively avoids this, which effectively results in higherand predictable performance. Powerful optimizer: The HPCC optimizer ensures that submitted ECL codeis executed at the maximum possible speed for the underlying hardware.Advanced techniques such as lazy execution and code reordering arethoroughly utilized to maximize performanceRisk SolutionsECL18

Enterprise Ready Batteries included: All components are included in a consistent andhomogeneous platform – a single configuration tool, a completemanagement system, seamless integration with existing enterprisemonitoring systems and all the documentation needed to operate theenvironment is part of the package Backed by over 10 years of experience: The HPCC platform is thetechnology underpinning LexisNexis data offerings, serving multi-billiondollar critical 24/7 business environments with the strictest SLA’s. In useby the US Government and Commercial settings for critical operations Fewer moving parts: Unlike Hadoop, HPCC is an integrated solutionextending across the entire data lifecycle, from data ingest and dataprocessing to data delivery. No third party tools are needed Multiple data types: Supported out of the box, including fixed andvariable length delimited records and XML“To boldly go where no open source data intensive platform has gone before”Risk Solutions19

The power of “what if ”The power of the “what if ”Imagine you had a platform designed for [Big Data] Data Analysts from the ground up?Where organizations already invested in Data Analysts (BI) talent no longer have to worryabout reinvesting in Java developers.Or perhaps where new talent can be selected on the basis of data knowledge and creativity;rather than upon coding experience in a systems programming language. Java developers, bynature, are not necessarily BI or Big Data developers. And conversely, Data Analysts are notJava developers.The HPCC Systems platform’s ECL programming language was designed by data scientists for[Big Data] Data Analysts. It is made for people who think in terms of the “What if” ratherthan the “How”.Specifically ECL is a declarative, data centric, distributed processing language for Big Data.The declarative nature of the ECL language lets the analyst focus on finding the rightquestion instead of the steps required to answer. Interactive iterative “what if” scenariorunning helps the data analyst pinpoint and formulate the correct questions faster.ECL is to Big Data what SQL is to RDBMS. Good SQL developers will be able to quickly adaptto developing in ECL. Organizations investing in the HPCC Systems platform can reuse theirexisting BI staff to help solve the Big Data problems. No need for an additional layer of (Java)developers.Risk Solutions20

ConclusionQ&AThank YouWeb: http://hpccsystems.comEmail : info@hpccsystems.comContact us: 877.316.9669Risk Solutions21

Risk Solutions Monday, March 12, 2012 7:00 pm, EST Agenda 1. High level architecture of the HPCC platform: 20 minutes 2. Pentaho plugin integration demo: 20 minutes 3. Q&A (raise your hand) 20 minutes Welcome HPCC Systems Architecture Overview & Pentaho Spoon/Kettle Integ