You Call It Data Lake; We Call It Data Historian - O'Reilly Media

Transcription

You call it Data Lake; we call it Data HistorianNaghman Waheed – Data Platforms LeadBrian Arnold – Data Platforms ArchitectMay-24-2018

Naghman WaheedBrian ArnoldData Platforms ArchitectData Platforms Lead 25 year career at Monsanto.Data Warehousing, BusinessIntelligence, Data Architecture,Cloud Engineering.Data solutions spanning keybusiness functions such as SupplyChain, Manufacturing, Order-ToCash, Finance and Procurement. 10 year career in IT, 6 years in BigDataSoftware Development, FunctionalProgramming, Streaming, BigData, Cloud EngineeringEcommerce, RecommendationEngines

Monsanto - Who are we? Bringing a broad range of solutions to help nourish our growingworldProduce with morejudicious useof limited naturalresources. Headquartered in Saint Louis, Missouri 20,000 employees in 66 countries A global company with 50% employees based outside of theUnited States One of the 25 World’s Best Multinational Workplaces by GreatPlace to Work InstituteIncreaseproductionto meetneeds ofa growingpopulation.“We succeed whenfarmers succeed.”-Hugh Grant, MonsantoCEOimprove thelives of theworld’sfarmers.

Solving real challenges in agriculture industryRisingPopulationGrowing enough fora growing worldChangingEconomiesand DietsA growing global middleclass is choosing animalprotein – meat, eggs,and dairy – as a largerpart of their dietLimitedFarmlandChangingClimateFarmers will need toproduce enough foodwith fewer resourcesto support ourworld populationFarmers are impactedby climate changein many ways:WATER AVAILABILITY ISSUES9.6B 7.1B14%1 1/34.4BINSECT RANGE EXPANSION9%1980INCREASINGLYUNPREDICTABLE WEATHERTODAY2050Global Population1965WEED PRESSURE CHANGES2030Dietary Percentage ofProtein19612050Acres per PersonCROP DISEASE INCREASESPLANTING ZONE SHIFTS

Our Solutionsfor Sustainable AgricultureOur toolkit includes:PlantBreedingBiotechnologyCrop ProtectionPrecisionAgriculture5

Key Technology Trends In AgricultureEconomies of DataScience at ScaleLow-cost ObservationTechnology /IoT1A typical farm is generating 20GB ofunique field data every yearComputing unit costs have gone downby 1,000x in last 10 yearsSource : Gartner Technology Trends 2015Connected sensors on tractors,combines, and in fields has increasedover 1000x in the last 10 yearsMobile DeviceProliferation amongGrowers 1/394% of US farmers own a mobilephone or a smartphoneThe cost of the average digital1961 sensor 2050 Compared to less than 10& 10 yearshad dropped more than half over thatagotime

Why Data Historian?Strategy IngestionAccessIntegrationSelf ServiceLicenseInfrastructureCost reviewSupportCloud FirstOpen SourceAPI FirstEcosystem fitCapabilitiesArchitecture Scalable FaultTolerance PerformanceTCOBuild vs. Buy Customization Iterativerelease Technologycommitment

Data StrategyEnterpriseDataWarehouseChange DataCaptureChange orLocation360OtherDatastoresEnterpriseData HubInsightsAnalytics PlatformothersKafkaVisualizationHaystackData eDiscover

Data Platforms EcosystemAnalytics PlatformVirtual sted Partner PortalGeospatialPlatformAuthenticationToData terRegisTag & IsAPCustomAPI HarvesterDataFrontDoorMetadata linked to searchEnterpriseData HubEvent360ToAPI GatewayTopicMetadataKafkaLocation360InsightsChange agementTransactionalSystemsArchive LogChange DataCapture30 oreAncestryDatastoreOtherDatastoresTo IDMBatch IngestionData StoresStreamingIngestionAPI IngestionData HistorianUI Ingestion

Data Historian - Reference ArchitectureAuthentication /AuthorizationData rsData StoresSecurityData StoresHistorianUIData HistorianProcessingEngineQueryEngineAWS S3StorageAPIGatewayAdhoc AnalysisAPI afkaArchiveGlacier StorageStreamingAuthentication /AuthorizationAccessData Storage & onsantoInternalUsers

Data Historian – TechnologyLambdaS3Glacier

AWS Data Historian Architecture

Ingest Batch imports from RDBMS Full, delta, merge Streaming from Kafka (Datahub) File ingestion through API and Data Historian UI Users can append files to existing datasets as well

Ingestion ProcessRebuildMaterialized ViewSchedulerImport RawRecordsBuild Hive StagingTables in HDFSExport Raw DataTo S3ValidationExport Data ToMaster Tables inS3ExportArchive

Metadata Required fields Name, Description, Source, Publisher, etc. Optional fields Tags, custom fields Forwarded to our metadata platform (Haystack) Metadata objects pushed through Kafka (Datahub)

Exports Export to RDBMS MaterializedExportFull, delta, mergeArchiveExportExport to KafkaScheduler Export to Redshift Export to S3 Materialized tKafka ExportS3 ExportValidationPurgeSource

Governance Archive & Retention Automated Compliance Checks Security Permissions

APIs & Integration Get/List/Put Datasets Get/Put Dataset Metadata Get Dataset Status Query SDKs Java, Scala, R, Python

APIs - QueryPhysicalClientData Historian UIVirtualData HistorianAPIData Historian JDBCDriverData HistorianSecurity Service

Data Historian UI - Query Interface204

Data Historian UI – Browse Datasets214

Data Historian UI – Dataset Details224

Data Historian UI – Permissions Management234

Data Historian UI - Future244

Highlights v1.0 production release 16 months ago164 active datasets in prod10TB of data in prod 1,000 query requests per dayEarly Adopters : Internal Security Office, Research & Development IoT, Data Assets, Supply Chain Finance, Commercial, HR, Other Early Majority : Late Majority :25

Lessons Learned Open Source FlexibilityLearning CurveSpecialized Skill Set Cloud – AWS AgilitySecurity Support Resource Staffing26

Questions?27

Metadata objects pushed through Kafka (Datahub) Metadata Export to RDBMS Full, delta, merge . Java, Scala, R, Python APIs & Integration. Physical APIs -Query Data Historian API Virtual Client . Early Adopters : Internal Security Office, Research & Development Early Majority :