The Open Source HPCC Systems Platform

Transcription

Welcomehttp://hpccsystems.comWelcome attendees!ALE MeetupThe Open Source HPCC Systems PlatformAgenda7:30-7:45pm:Welcome, Greetings & Announcements7:45-8:45pm:HPCC Systems Architecture Overview & Demo8:45-9:15pm:Q&A/ Open discussion & Nook give-away9:15-9:25pm:9:30 – lights out:Linux "Help Desk" SwitchboardWrap-upTwitter event hashtag:#hpccmeetuphpccsystems.comRisk Solutions1

Linux Trivia and memorabilia(can we call it “Livia”?)http://hpccsystems.comFrom: torvalds@klaava.Helsinki.FI (Linus Benedict Torvalds)Newsgroups: comp.os.minixSubject: What would you like to see most in minix?Summary: small poll for my new operating systemMessage-ID: Date: 25 Aug 91 20:57:08 GMTOrganization: University of HelsinkiHello everybody out there using minix - I'm doing a (free) operatingsystem (just a hobby, won't be big and professional like gnu) for386(486) AT clones. This has been brewing since april, and is startingto get ready. I'd like any feedback on things people like/dislike inminix, as my OS resembles it somewhat (same physical layout of thefile-system (due to practical reasons) among other things).I've currently ported bash(1.08) and gcc(1.40), and things seem towork. This implies that I'll get something practical within a fewmonths, and I'd like to know what features most people would want.Any suggestions are welcome, but I won't promise I'll implement them:-)Linus (torvalds@kruuna.helsinki.fi)PS. Yes - it's free of any minix code, and it has a multi-threaded fs. It isNOT protable (uses 386 task switching etc), and it probably never willsupport anything other than AT-harddisks, as that's all I have :-(.Risk Solutions2

What is HPCC Systems?http://hpccsystems.comHPCC is a massive parallel-processing computing platformESPRisk Solutions3

http://hpccsystems.comThe HPCC cluster computing paradigm and an efficient data-centricprogramming language are key factors in our company’s success“Grid” ComputingSplits problems into pieces to beworked in parallel by commodityservers Risk SolutionsData-centric language (ECL)Integrated Delivery System“Big Data” language brings thecomputing to the dataConsistent approach across dataingestion, processing, anddelivery systems 4

The Three main HPCC components1 HPCC Data Refinery (Thor) 2HPCC Data Delivery Engine (Roxie) Enterprise Control Language (ECL)Conclusion: End to End solutionRisk SolutionsMassively Parallel Extract Transform and Load (ETL) engine– Built from the ground up as a parallel data environment. Leverages inexpensive locally attachedstorage. Doesn’t require a SAN infrastructure.Enables data integration on a scale not previously available:– Current LexisNexis person data build process generates 350 Billion intermediate results at peakSuitable for:– Massive joins/merges– Massive sorts & transformations– Programmable using ECLA massively parallel, high throughput, structured query response engineUltra fast low latency and highly available due to its read-only nature.Allows indices to be built onto data for efficient multi-user retrieval of dataSuitable forVolumes of structured queriesFull text ranked Boolean searchProgrammable using ECL An easy to use , data-centric programming language optimized for large-scale datamanagement and query processingHighly efficient; Automatically distributes workload across all nodes.Automatic parallelization and synchronization of sequential algorithms for parallel anddistributed processingLarge library of efficient modules to handle common data manipulation tasks No need for any third party tools (Cassandra, GreenPlum, MongoDB, RDBMS, Oozie, Pig, etc.) 3http://hpccsystems.com 5

BI And AnalyticsData SourcesWorkflowhttp://hpccsystems.comBig Data ProcessingBusiness put tionRDBMSDWStructured DataRisk Solutions6

Pentaho SpoonRisk Solutionshttp://hpccsystems.com7

Kettle – ECL General Functions INITERATEDEDUPTABLEDISTRIBUTEROLLUPPARSERisk Solutions8

Kettle – ECL ML Functions Clustering– K-Means– Agglomerative Linear Regression– OLS Classifier– Logistic Regression– Naïve Bayes– Perceptron Correlations– Pearson and Spearman– Kendal’s TauRisk Solutionshttp://hpccsystems.com Document Parsing– N-Gram Extraction– Histograms– Frequent pattern mining Statistics– Random data generators– PCA Discretizers– Rounding, bucketing, tiling Linear algebra library– Distributed matrix operations– SVD9

JDBC http://hpccsystems.comStandard support for SQL queriesFiltersAggregationJoinsOptimized by available Indexed fieldsRisk Solutions10

Reporting out of BIRT using HPCC JDBC(numbers are fictitious, of course!)Risk Solutionshttp://hpccsystems.com11

Core Benefits A simple technology stack results in more effective resource leveraging Faster scoringBetter classification and correlationSpeed More efficient use of data and storageHigher productivity across data analystsInnovative productsIncreased precision and recallAn online Data WarehouseBetter Analytics Fewer required skill sets means more flexibility in staffing projectsWe no longer have silos of specialized expertise that can’t be sharedA Single Data Store http://hpccsystems.comScales to extreme workloads quickly and easilyIncrease speed of development leads to faster production/deliveryImproved developer productivityCapacity Enables massive joins, merges, sorts, transformationsRisk Solutions12

What is HPCC Systems?http://hpccsystems.comHPCC is a massive parallel-processing computing platformESPRisk Solutions13

http://hpccsystems.comHadoop: Many platforms, fragmented, heterogeneous : Complex & ExpensiveRisk Solutions14

http://hpccsystems.comComparison between HPCC Thor and Hadoop stacksHPCC ThorRisk SolutionsHadoop15

http://hpccsystems.comAppendixRisk Solutions16

ECL: Enterprise Control Language http://hpccsystems.comDeclarative programming language: Describe whatneeds to be done and not how to do itPowerful: Unlike Java, high level primitives as JOIN,TRANSFORM, PROJECT, SORT, DISTRIBUTE, MAP, etc.are available. Higher level code means fewerprogrammers & shortens time to deliveryExtensible: As new attributes are defined, theybecome primitives that other programmers can useImplicitly parallel: Parallelism is built into theunderlying platform. The programmer needs not beconcerned with itMaintainable: A high level programming language, noside effects and attribute encapsulation provide formore succinct, reliable and easier to troubleshootcodeComplete: Unlike Pig and Hive, ECL provides for acomplete programming paradigm.Homogeneous: One language to express dataalgorithms across the entire HPCC platform, includingdata ETL and high speed data delivery.Risk Solutions17

Roxie Delivery Engine http://hpccsystems.comLow latency: Data queries are typically completed in fractionsof a secondNot a key-value store: Unlike HBase, Cassandra and others,Roxie is not limited by the constrains of key-value data stores,allowing for complex queries, multi-key retrieval and fuzzymatchingHighly available: Roxie is designed to operate in criticalenvironments, under the most rigorous service levelrequirementsScalable: Horizontal linear scalability provides room toaccommodate for future data and performance growthHighly concurrent: In a typical environment, thousands ofconcurrent clients can be simultaneously executing transactionson the same Roxie systemRedundant: A shared-nothing architecture with no single pointsof failure provides extreme fault toleranceECL inside: One language to describe both, the datatransformations in Thor and the data delivery strategies inRoxieConsistent tools: Thor and Roxie share the same exact set oftools, which provides consistency across the platformRisk Solutions18

Data Modelhttp://hpccsystems.comHPCC supports a flexible data model: The data model isdefined by the data analyst in the way it fits better theorganizational skills, the data at hand and/or the processthat needs to happenRisk Solutions19

Processing Modelhttp://hpccsystems.comHPCC supports a flexible dataflow oriented model: Thecomplexities of parallel processing and distributedplatform are abstracted behind high level data primitivesRisk Solutions20

Programming Modelhttp://hpccsystems.comData flows and data queries in HPCC are programmed inECL: A complete high level declarative dataflow orientedlanguage created for readability, extensibility andcode/data reuseRisk Solutions21

Beyond MapReduce Open Data Model: Unlike Hadoop, the data model is defined by the user,and is not constrained by the limitations of a strict key-value paradigm Simple: Unlike Hadoop MapReduce, solutions to complex data problemscan be expressed easily and directly in terms of high level ECL primitives.With Hadoop, creating MapReduce solutions to all but the most simpledata problems can be a daunting task. Many of these complexities areeliminated by the HPCC programming model Truly parallel: Unlike Hadoop, nodes of a data graph can be processed inparallel as data seamlessly flows through them. In Hadoop MapReduce(Java, Pig, Hive, Cascading, etc.) almost every complex datatransformation requires a series of MapReduce cycles; each of the phasesfor these cycles cannot be started until the previous phase has completedfor every record, which contributes to the well-known “long tail problem”in Hadoop. HPCC effectively avoids this, which effectively results in higherand predictable performance. Powerful optimizer: The HPCC optimizer ensures that submitted ECL codeis executed at the maximum possible speed for the underlying hardware.Advanced techniques such as lazy execution and code reordering arethoroughly utilized to maximize performanceRisk Solutionshttp://hpccsystems.comECL22

Enterprise Ready Batteries included: All components are included in a consistent andhomogeneous platform – a single configuration tool, a completemanagement system, seamless integration with existing enterprisemonitoring systems and all the documentation needed to operate theenvironment is part of the package Backed by over 10 years of experience: The HPCC platform is thetechnology underpinning LexisNexis data offerings, serving multi-billiondollar critical 24/7 business environments with the strictest SLA’s. In useby the US Government and Commercial settings for critical operations Fewer moving parts: Unlike Hadoop, HPCC is an integrated solutionextending across the entire data lifecycle, from data ingest and dataprocessing to data delivery. No third party tools are needed Multiple data types: Supported out of the box, including fixed andvariable length delimited records and XMLhttp://hpccsystems.com“To boldly go where no open source data intensive platform has gone before”Risk Solutions23

The power of “what if ”http://hpccsystems.comThe power of the “what if ”Imagine you had a platform designed for [Big Data] Data Analysts from the ground up?Where organizations already invested in Data Analysts (BI) talent no longer have to worryabout reinvesting in Java developers.Or perhaps where new talent can be selected on the basis of data knowledge and creativity;rather than upon coding experience in a systems programming language. Java developers, bynature, are not necessarily BI or Big Data developers. And conversely, Data Analysts are notJava developers.The HPCC Systems platform’s ECL programming language was designed by data scientists for[Big Data] Data Analysts. It is made for people who think in terms of the “What if” ratherthan the “How”.Specifically ECL is a declarative, data centric, distributed processing language for Big Data.The declarative nature of the ECL language lets the analyst focus on finding the rightquestion instead of the steps required to answer. Interactive iterative “what if” scenariorunning helps the data analyst pinpoint and formulate the correct questions faster.ECL is to Big Data what SQL is to RDBMS. Good SQL developers will be able to quickly adaptto developing in ECL. Organizations investing in the HPCC Systems platform can reuse theirexisting BI staff to help solve the Big Data problems. No need for an additional layer of (Java)developers.Risk Solutions24

http://hpccsystems.comQ&AThank YouWeb: http://hpccsystems.comEmail : info@hpccsystems.comContact us: 877.316.9669Risk Solutions25

Unlike Hadoop MapReduce, solutions to complex data problems can be expressed easily and directly in terms of high level ECL primitives. With Hadoop, creating MapReduce solutions to all but the most simple data problems can be a daunting task. Many of these complexities are e