Simplifying The Process Of Uploading And Extracting Data .

Transcription

Simplifying the Process of Uploadingand Extracting Data from ApacheHadoopRohit Bakhshi, Solution Architect, HortonworksJim Walker, Director Product Marketing, Talend Hortonworks Inc. 2012Page 1

About UsRohit BakhshiSolution Architect at HortonworksExperience Hadoop in enterprise architectureBuilding advanced analytical applicationsEnjoys live jazz and drinking espressoJim WalkerDirector Product Marketing at TalendExperience 10 years as developer, 10 as marketerComputer Security, DQ, MDM Big dataIs a bit of a foodie and enjoys baseball (White Sox) Hortonworks Inc. 2012Page 2

Agenda Introduction Impact Big Data in the Enterprise Hortonworks Data Platform Talend Overview Demo Q&A Hortonworks Inc. 2012Page 3

Hortonworks VisionWe believe that by the end of 2015,more than half the world's data willbe processed by Apache HadoopHow to achieve that vision?Enable ecosystem around enterprise-viableopen source data platform. Hortonworks Inc. 2012Page 4

Hortonworks Strategic FocusEnable Hadoop to be next-generation enterprise data platform Lead within Hadoop Community– Engineering team that delivered everymajor Hadoop release since 0.1– Experience managing world’s largestdeployment– Ongoing access to Y!’s 1,000 users and40k nodes for testing, QA, etc. Unify & Enable Hadoop Ecosystem– Provide 100% open source productExpert Role-based Training– Empower customers and partnersovercome Hadoop knowledge gaps– Enable organizations successfullydevelop and deploy solutions based onHadoopFull Lifecycle Support and ServicesEvaluate Hortonworks Inc. 2012PilotProductionPage 5

Impact of Big Data on Data AnalyticsOne of our top 5IT priorities, 45%One of our top 10IT priorities, 27%Source: Enterprise Strategy Group, 2012 Hortonworks Inc. 2012Page 6

Transactions – Interactions – Observations AND FAR FAR BEYONDChart content courtesy of Teradata, Inc. Hortonworks Inc. 2012Page 7

Data-Driven BusinessThe days are over when youbuild a product once and it just works.You have to take ideas, test them, iterate them,use data and analytics to understand what worksand what doesn't in order to be successful.And that's how we useour big data infrastructure.Aaron Batalion, CTO of LivingSocialThe Big Promise of Big Data, PCWorld, March 13, 251754/the big promise of big data.html Hortonworks Inc. 2012Page 8

What is Apache Hadoop? Solution for Big Data– Designed for volume, velocity,variety & complexity of data Data Platform Deployed onCommodity Hardware that– Stores petabytes of data reliably– Runs highly distributed applications– Enables a rational economics model Set of Open Source Projects– Apache Software Foundation– Loosely coupled, ship early/often Hortonworks Inc. 2012One of the best examples ofopen source driving innovationand creating a marketPage 9

Connecting All Of Your Big DataTraditional Data Warehouses,BI & AnalyticsServing ApplicationsWebServingNoSQLRDMS EDWDataMartsBI /AnalyticsStore, Transform, Refine,Iterative Analytics,Archive all dataServingLogsSocialMediaSensorDataTextSystems Unstructured Systems Hortonworks Inc. 2012Page 10

Bridging Classic & Big Data WorldsIntegrating EDW & HadoopClassic Method“Capture only what’s needed”Structured & Repeatable AnalysisBusinessdetermines whatquestions to askEDW“Capture in case it’s needed”IT delivers a platformfor storing, refining,and analyzing all datasources Hortonworks Inc. 2012IT structures thedata to answerthose questionsHadoopBig Data MethodMulti-structured & Iterative AnalysisBusiness exploresdata for questionsworth answeringPage 11

Bridging Classic & Big Data WorldsEnabling Developers, Data Scientists, and Business AnalystsJava, C/C , Pig, JavaScript, Python, R, SAS, SQL, Excel, Reporting, etc.Ingest, Transform, Archive, Discover, Explore, Analyze, Report Fast data loadingELT/ETL and refinementIterative analysisOnline archivalBatch Hortonworks Inc. 2012 Path & pattern analysisGraph analysisText analysisMachine learningInteractive Operational analysisTransactional analysisHigh volume ad-hocElastic data martsActivePage 12

Key Components of Hadoop Stack(Columnar NoSQL Store)HBaseZookeeper(Cluster Coordination)Core ComponentsExtended ComponentsPigHiveAmbari &(Data Flow)(SQL)Other Monitoring & ManagementMapReduceOozie &(Distributed Programing Framework)Other Workflow SchedulingHCatalogSqoop &(Table & Schema Management)Other Ingest, ETL toolsHDFSMahout &(Hadoop Distributed File System)Other libraries Hortonworks Inc. 2012Page 13

Enable Ecosystem Around PlatformHortonworksData PlatformOperational APIsOperationsThe market needs a platform that is Open across 3 facets: Data: directly processes as well as coexists/integrates with any data flowing through a businessApps: delivers business value by enabling innovative new apps and enhancing existing appsOperations: integrates with operational models within the enterprise datacenter and the cloud Hortonworks Inc. 2012Page 14

Talend and big data everything old is new again! Hortonworks Inc. 2012Page 15

Challenge 1: Where project rmationInformation provides value tothe businessIf you can't rely on your information then the resultcan be missed opportunities, or higher costs.BIGdecisionsdrivesBIGbusinessMatthew West and Julian Fowler (1999). Developing High Quality Data Models.The European Process Industries STEP Technical Liaison Executive (EPISTLE). Hortonworks Inc. 2012Page 16

Challenge 2: Data QualityPoor Data Quality * Big Data Big Problems 2 Hortonworks Inc. 2012Page 17

Challenge 3: Complex technology, limited resourcesvolume, velocity, variety resources? Hortonworks Inc. 2012Page 18

Talend Big Data Strategy Big Data Integration- Land data in a BD cluster without coding- Code generation for Hadoop HDFS, Hive, Sqoop Big Data Manipulation– Simplify manipulation, such as sort and filter– Pig components, HBase Big Data Quality & Governance– Identify linkages & duplicates, validate big data– Match component, execute basic quality features Big Data Project Management– Place frameworks around big data projects– Common Repository, scheduling, monitoring Hortonworks Inc. 20124strategic pillarsPage 19

Why Talend Hortonworks Inc. 2012Page 20

Why Talend Hortonworks Inc. 2012Page 21

DemonstrationPage 22 Talend 201122

2012 Talend Roadmapdatabig2012springfallComputationally Intense Functions Matching (complete)ProfilingParsingSurvivorship Hortonworks Inc. 2012 continue to monitor “winning”technologies & expand partnershipsPage 23

Talend Open Studio for Big DataDemocratize Big DataTalend Open Studio for Big Data Improves efficiency of big data job designwith graphic interface Abstracts and generates codePig an open sourceecosystem Hortonworks Inc. 2012 Run transforms inside Hadoop Native support for HDFS, Pig, Hbase,Sqoop and Hive Apache License Available at talend.com Embedded in HWx Data PlatformPage 24

Other Resources Next webinar: HDFS Federated– April 18, 2012 @ 10am PST– Register now :http://hortonworks.com/webinars/ Hadoop Summit– June 13-14– San Jose, California– www.Hadoopsummit.org Hadoop Training and Certification– Developing Solutions Using Apache Hadoop– Administering Apache Hadoop– http://hortonworks.com/training/ Hortonworks Inc. 2012Page 25

Thank You!Questions? Hortonworks Inc. 2012Page 26

Talend Open Studio for Big Data Improves efficiency of big data job design with graphic interface Abstracts and generates code Run transforms inside Hadoop Native support for HDFS, Pig, Hbase, Sqoop and Hive Apache License Available at talend.com Embedded in HWx Data Platform Talend Open