Unleash Power Of Big Data With Informatica For

Transcription

Unleash Power of Big Data withInformatica forWei ZhengSenior Director, Product ManagementInformatica

Agenda Big Data Overview What is Hadoop? Informatica for Hadoop Getting Data In and Out Parsing and Preparing Data Profiling and Discovering Data Transforming and Cleansing Data Orchestrating and Monitoring HadoopRoadmap

Big Data Overview

What’s happening?Business ValueExplosive Growth of Data – Volume, Variety, VelocityVolumeVelocityData VolumeAcross Time ScalesSource: IDCYearsVarietyLatencySub-Second

Big DataConfluence of Big Transaction, Big Interaction & Big Data ProcessingBIG TRANSACTION DATAOnlineTransactionProcessing(OLTP)Online AnalyticalProcessing(OLAP) &DW AppliancesBIG INTERACTION DATASocialMedia DataCloudSalesforce.comConcurGoogle App EngineAmazon DeviceSensor DataCall detailrecords, image,click stream dataScientific, genomicBIG DATA INTEGRATIONMachine/DeviceBIG DATA PROCESSING

What is Hadoop?

What is Hadoop?CORE HADOOP COMPONENTSHadoop is a big data platform for datastorage and processing that is Scalable Fault tolerant Open sourceHadoopDistributed FileSystem (HDFS)File Sharing & DataProtection AcrossPhysical ServersDistribution Example: Cloudera (CDH 3.0)File SystemMountFUSE-DFSWorkflowAPACHE OOZIEUI FrameworkSDKHUESchedulingAPACHE OOZIEHUE SDKMetadataAPACHE HIVEMapReduceDistributed ComputingAcross Physical ServersHadoop Design Axioms1.System Shall Manage and Heal Itself2.Performance Shall Scale Linearly3.Compute Shall Move to Data4.Simple Core, Modular and ExtensibleLanguages / CompilersDataIntegrationAPACHE PIG, APACHE HIVEAPACHE FLUME,APACHE SQOOPFastRead/WriteAccessAPACHE HBASECoordinationAPACHEZOOKEEPER

Hadoop Distributions

What can Hadoop Help You With?ImproveDecisionsImproveEfficiency& ReduceCostsModernizeBusinessPredictive Analytics(Recommendations,Outcomes, MRO)MergersAcquisitions&DivestituresCustomer Analytics(Customer Sentiment,& Satisfaction)Acquire &RetainCustomersOutsourceNon-coreFunctionsPattern Recognition(Fraud (Pricing, SupplyChain)Risk & PortfolioAnalysisIncrease Value of Big DataTimelyActionableBusiness CostsAccessibleLabor CostsRelevantHolisticSoftware CostsSecureHardware CostsTrustworthyAuthoritativeStorage CostsLower Cost of Big actions

Informatica for Hadoop

Unleash the Power of Hadoop With Informatica9.5.15. Invoke Custom Business Analytics onHadoop4. Transform & Cleanse/Standardize Datain Hadoop (MapReduce)3. Parse & Prepare Data in Hadoop(MapReduce)2. Discover Hadoop data for anomalies,relationships and domain typesOrchestrate Workflows (Hadoop or non Hadoopjobs/processes)6. Extract Data from HadoopProfile DataMonitor & Manage (Hadoop or non Hadoop jobs/processes)Available NowSales & MarketingData martCustomer ServicePortal1. Ingest Data into HadoopAccount TransactionsProduct & Service Offerings Marketing CampaignsCustomer ProfileSocial MediaCustomer Service Logs & Surveys

Why Informatica? What are the Benefits? Repeatability Predictable, repeatable deployments and methodology Isolation from rapid Hadoop changes Frequent new versions and projects Avoiding placing bets on the wrong technology Reuse of existing assets Apply existing integration logic to load data to/from Hadoop Reuse existing data quality rules to validate Hadoop data Reuse of existing skills Enable ETL developers to leverage the power of Hadoop Governance Enforce and validate data security, data quality andregulatory policies

Get Data Into and Out of HadoopPowerExchange for HadoophStream with MapRData Archiving for HadoopReplication for Hadoop

Data Ingestion and ExtractionMoving tens of terabytes per hour of transaction, interactionand streaming dataTransactions,OLTP, OLAPBatch aWarehouseSocial Media,Web LogsStreamMachine stryStandards

Unleash the Power of HadoopWith High Performance Universal Data AccessMessaging,and WebServicesPackagedApplicationsWebSphere MQJMSMSMQSAP NetWeaver XIWeb ServicesTIBCOwebMethodsRelational andFlat FilesMainframeand MidrangeUnstructuredData and FilesMPPAppliancesOracleDB2 UDBDB2/400SQL aNetezzaODBCJDBCVSAMC-ISAMBinary Flat FilesTape Formats Word, ExcelPDFStarOfficeWordPerfectEmail (POP, IMPA)HTTPEMC/GreenplumVerticaFlat filesASCII reportsHTMLRPGANSILDAPAsterDataJD EdwardsSAP NetWeaverLotus NotesSAP NetWeaver BIOracle E-Business SASPeopleSoftSiebelSalesforce CRMForce.comRightNowNetSuiteADPHewittSAP By DesignOracle argo IMPMVRXMLLegalXMLIFXcXMLebXMLHL7 v3.0ACORD (AL3, ndardsLinkedInSocial Media

PowerExchange for HadoopHDFS and Hive AdaptersSupport pushdownof source and targetconnections toensure maximumperformance andscaleIntegrated developmentenvironment withmetadata and previewsupportPerform any preprocessing neededbefore ingestionNative HDFS and HiveSource/Target Support

hStream with MapR – Continuous IngestionDocuments,EmailSocial Media,Web LogsMachine Device,ScientificIndustryStandardsInformatica Ultra MessagingStreaming Data ContinuouslyNetwork File System (NFS)Transactions,OLTP, OLAP

Informatica Data ArchiveArchiving to HadoopOptimized File ArchiveStored on Hadoop File SystemProductionData Archive data to optimized file formatfor storage reduction Compressed (up to 90%) Immutable Accessible (SQL, ODBC, JDBC)

Informatica Data ArchiveArchiving from HadoopFile Archive

Parse and Prepare Data OnHadoophParser

Informatica HparserTackling Diversity of Big DataThe broadest coverage for Big DataFlat Files &DocumentsXMLPositionalIndustry StandardsInteraction datasocialName Value / Delimited \ Device/sensorscientificProductivity Visual parsingenvironmentPredefinedtranslationsAny DI/BI architecturePIGEDWMDM

Parse and Prepare Data on HadoopHow does it work?1. Define parser in HParser visualstudio2. Deploy the parser on HadoopDistributed File System (HDFS)3. Run HParser to extract data andproduce tabular format inHadoophadoop dt-hadoop.jar My Parser /input/*/input*.txt

Informatica HParserProductivity: Data Transformation StudioFinancialInsuranceB2B StandardsSWIFT MTSWIFT MXNACHAFIXTelekursFpMLBAI – V2.0\LockboxCREST DEXIFXTWISTEnhancedUNIFI(ISO CORD XMLUN\EDIFACTEasy exampleEDI-X12basedvisualEDIARRenhancements andEDIeditsUCS WINSEDI VICSRosettaNetOAGIHealthcareHL7HL7 V3HIPAANCPDPCDISCEasy example based visualenhancements and editsOtherIATA-PADISPLMXMLNEIMOut of the boxtransformations for allmessages in allversionsUpdates and newversions deliveredfrom Informatica

An hParser ExampleProprietary web logsWhy Hadoop? Extremely large data sets Often information is splitacross multi files Not sure what are welooking for

Profiling and Discovering DataInformatica Profiling for Hadoop

Discovery of Hadoop Issues/AnomaliesReview and share results viabrowser or Eclipse clientsSingle table/data objectCross table/data objectData Domain Discovery3Import metadata vianative connectivity toHadoop (Hive, HDFS,Hbase, Reduce 1Create/Run profile to discover Hadoop data attributesProfile auto-converted to Hadoop queries/code (Hive,MapReduce, etc.)Executed natively on Hadoopbeta

Hadoop Data Profiling ResultsValue and PatternFrequency to isolatedinconsistent/dirty data orunexpected patternsHadoop Data Profilingresults – exposed toanyone in enterprise viabrowserCOUNTRY CODE exampleCUSTOMER ID example1. Profiling Stats:Min/Max Values, NULLs,Inferred Data Types, etc.2. Value &PatternAnalysis ofHadoop Data3. Drilldown Analysis (into Hadoop Data)Stats to identifyoutliers andanomalies in dataZIP CODE exampleDrill down into actualdata values to inspectresults across entire dataset, including potentialduplicatesbeta

Hadoop Data Domain DiscoveryFinding functional meaning of Hadoop Data1. Leverage INFA rules/mapplets to identifyfunctional meaning of Hadoop data2. Sensitive data (e.g. SSN, Credit Card number,etc.)3. Liability and Compliance risk?PHI: Protected Health InformationPII: Personally Identifiable InformationScalable to look for/discover ANY Domain typebeta2. View/share report of datadomains/sensitive datacontained in Hadoop. Abilityto drill down to see suspectdata values.

Transforming and Cleansing DataPowerCenter for HadoopInformatica Data Quality for Hadoop

Data Integration and Data QualityHadoop MapReduce ProcessingInformatica Developer1.Informatica mapping translated to optimizedHive HQL2.HQL invokes custom UDF within InformaticaDTM for certain specialized data transformations3.Optimized HQL translated to MapReduce4.MapReduce and UDF executed on HadoopData RDERKEY1 AS ORDERKEY2, T1.li count, orders.O CUSTKEY ASCUSTKEY,Data Transformation Librarycustomer.C NAME, customer.C NATIONKEY, nation.N NAME,nation.N REGIONKEYFROM(SELECT TRANSFORM (L Orderkey.id) USING CustomInfaTxFROM lineitemGROUP BY L ORDERKEY) T1JOIN orders ON (customer.C ORDERKEY orders.O ORDERKEY)JOIN customer ON (orders.O CUSTKEY customer.C CUSTKEY)JOIN nation ON (customer.C NATIONKEY nation.N NATIONKEY)WHERE nation.N NAME 'UNITED STATES') T2INSERT OVERWRITE TABLE TARGET1 SELECT *INSERT OVERWRITE TABLE TARGET2 SELECT CUSTKEY, count(ORDERKEY2) GROUP BYCUSTKEY;UDFHive HQLbetaMapReduce

Reuse and Import PC Metadata for HadoopImport existing PCartifacts intoHadoopdevelopmentenvironmentValidate import logicbefore the actualimport process toensure compatibilitybeta

Design Mappings as Usual Design integrationand quality logic forHadoop in agraphical andmetadata drivenenvironmentConfigure where theintegration logicshould run – Hadoopor Nativebeta

View Generated HiveQLView completegenerated andpushed down Hive orMR code fromHadoop mappingsbeta

Orchestrating and MonitoringHadoopInformatica Workflow & Administration for Hadoop

Mixed Workflow OrchestrationOne workflow running tasks on hadoop and local environmentsbeta

Monitoring – Hive Query Plan DetailsSame hive query availablein developer tool.beta

Monitoring – Hive Query Drilldown to M/RView HiveQuery DetailsTraceability toindividual M/RJobs. Link to JobTracker URLsSummary of job trackerstatusbeta

Support for parallelProduct Roadmap Hadoop Beta(9.5 Release) Native HDFS and Hiveconnectivity Integrated parsing onHadoop Data Integration & DataCapabilityQuality push down PowerExchangefor Hadoop(HDFS and PC) Hparser(including JSONexecution on Hadoop Data Discovery onHadoop Mixed workloadorchestration andadministrationParsing)Available Now1H, 2012processing of large file Hadoop GA(9.5.1 Release) Native HDFS andHive connectivity Integrated parsingon Hadoop Data Integration &Data Quality pushdown execution onHadoop Data Discovery onHadoop Mixed workloadparsing Support for parsing ofarchived files Managed file transfer Metadata Manager &Lineage Integration Translation to PIGsupport Profiling API on Hadoop(call from Java or M/R) Persistence of profilingstats on Hadoop Additional DI & DQorchestration andtransformations runningadministrationon Hadoop2H, 20121H, 2013

When Is It Available? Hadoop Planned Release Beta/Early Access: August – Oct, 2012 GA: 9.5.1 Release, December 2012 PowerCenter Big Data Edition – Q3 2012 (Tentative) PowerCenter Standard EditionEnterprise Grid Option for PowerCenterPowerExchange for HadoopPowerExchange for Social MediaPowerExchange for Data Warehouse AppliancehParserPowerCenter on Hadoop (Available Dec 2012)

Unleash the Power of Hadoop With Informatica 9.5.1 Available Now Sales & Marketing Data mart Customer Service Portal Account Transactions Product & Service Offerings Marketing Campaigns Customer Profile Social Media Customer Service Logs & Surveys 3. Parse & Prepare