Data Lake Or Data Swamp? - OSIsoft

Transcription

Data Lake or Data Swamp?Keeping the Data Lakefrom Becoming aData Swamp.KSG Solutions Data Lake or Data Swamp?1

INTRODUCTIONIncreasingly, businesses of all kinds are beginning to see their dataas an important asset that can help make their operations moreeffective and profitable. As our ability to gather time-series datagrows, more technologies are becoming available to help us makesense of it. How do we choose the right technology and approachfor our business problems?ABOUT THE AUTHORJohn de Koning, success advisor in industrial data processing, created his rootsin the oil and gas industry. As a technology and innovation manager for Shell,John was focused on generating 500 million value annually by introducinginnovative ways of processing manufacturing and production data. He becamean industry leader by introducing architectures to contextualize, integrate andaggregate manufacturing and production data at a corporate level. The experienceand understanding gained has been used as the foundation for this white paper.The paper is focused on helping industry leaders understandthe characteristics of the various data processing techniques,and how they link together to form an optimum solutionarchitecture for processing time-series data in combinationwith enterprise data lake initiatives.KSG Solutions Data Lake or Data Swamp?2

MANAGEMENT SUMMARYData lakes are a simple way for businesses to collect and store raw data from a variety of inputs, withouthaving to know in advance exactly how the data will be used. But in order for data to drive businessoutcomes, it must be organized and accessible. Without structure, the data lake becomes a swamp.A variety of advanced real-time software systems are available that integrate with enterprise data lake softwareand can help collect and structure data, so it can be used effectively.Various solutions are available for processingtime-series data. Some of them even pretend to bethe Holy Grail for data management and advocatestreaming sensor and machine data straight to a datalake and the cloud – and suggest organizing it later.But what about the nature of industrial data streamsand the legacy automation equipment that is alreadyout there? Especially in the area of industrial dataenvironment, automation systems have a life cycle upto 20 years and replacement is a serious investment.Sending the raw data from these sources to the datalake is not even an option as interfaces for theselegacy sources do not exist.The solution architecture for time-series data shouldfollow a few strict rules:Access to the data should besimple and affordable, but stillenable enterprise wide reportingand analyses.Systems should have easy-to-understand asset/equipment-based relations between the individualdata streams, to enable business users to easilycompare, view and analyze data on an equipmentlevel without being an IT specialist or data scientist.1. ConnectivityEnsure the corporate solution is able to connect tothe variety of (legacy) data sources and potentialnew sources.2. Time-Series CapacityThe system should be able to deal withtime-series data (high fidelity, time indexing, and timesynchronization).3. Context4. AccessibilityProcess users should be able to analyze andvisualize the data to help optimize the use ofproduction facilities.5. SecurityKeep your production facility safe and secure!Don’t allow unintended back-door access to yourautomation system.KSG Solutions Data Lake or Data Swamp?3

Often positioned as the one size fits all, the currently available data lake technologies do not have theability yet to handle the above key rules in an effective and efficient way. To assure data from a largevariety of (legacy) source systems is landing in the cloud with the correct timestamp, synchronized in timeand having the right context, it is important to add an infrastructure layer specially designed for this purpose.This combination of time-series and data lake technologies (cloud or on premise) will bring the flexibility andcriticality in the various levels of the organization: (a) on the production level, ensuring that data will be securebut accessible; and (b) on the corporate level, allowing data to be contextualized, integrated, and aggregated forbetter business decision making.Various solutions are available in the combination ofreal-time infrastructure and data lake technologies.Based on the above rules, the technology combinationof OSIsoft PI System toolkit with supporting DataContext Automation tools, like those delivered byElement Analytics, is a leading strategy for a solutionarchitecture that supports both the time-series dataneeds within operations and enterprise data lakeinitiatives. Dedicated integration tools are availableto easily integrate with Enterprise Data Warehousesand data lake technologies from Microsoft, SAP, orHadoop, both in the cloud or on premise.Large companies, like global energy enterprises,have proven this technology combination can easilydrive 500 million benefit per year by introducingenterprise tools and processes for ProactiveMonitoring, Exception Based Surveillance, RotatingEquipment Monitoring, Condition Based Maintenance,Margin visualizing, etc. All of these will result in betteruptime and higher efficiency of the facilities.Figure 1 : Hybrid EnvironmentKSG Solutions Data Lake or Data Swamp?4

TRANSITION TO DATA LAKESTraditional data warehouse technologies use predefined data models to describe the database. The advantageis that you know upfront what the data structure looks like. The downside is that data warehouses areinflexible. A traditional data warehouse cannot keep up with rapid changes in the data model due to theproliferation of new data sources and new questions people want to ask of the data. This overwhelming rate ofchange is preventing the traditional way of working by first building a data model and a database schema. Inaddition, the traditional way of (data) change management will not work anymore, as version control will be hardwith a fast changing data model.possible for business professionals (non-IT) to querythe data as the complex data modeling when it wasalready done upfront by IT specialist. In the case of adata lake, you need to be a data scientist to be ableto analyze the various chunks of data and link themtogether to make sense. Table 1 summarizes keycharacteristics of data warehouses versus data lakes.In a data lake environment, raw data is pushed to thestore in their original state. This can be structured,unstructured, blobs, etc. Instead of predefining howthe data elements are related to each other (datamodel), as with a data warehouse, you create therelationships once you need to retrieve the data fromthe data lake. This is also the major downside of adata lake. With databases and warehouses, it wasData Warehouse vs. Data Lakestructured, processedschema-on-writeexpensive for large data volumesless agile, fixed configurationmaturebusiness sersstructured / semi-structured /unstructured, rawschema-on-readdesigned for low-cost storagehighly agile, configure and reconfigureas neededmaturingdata scientists et. al.Table 1: Data Warehouse vs. Data LakeKSG Solutions Data Lake or Data Swamp?5

THE “PERFECT WORLD” IN AN INDUSTRIAL ENVIRONMENTThe “perfect world” is very simple. You want to have access to all the data that is available (internal andexternal), query the data in any combination, run integrated analytics to find the missing pieces and visualizethe information you are looking for with the tool of your preference. However, the reality is often different.When combined with a real-time, time-series environment, the core concerns are related to the diversity of (legacy)data sources, network latency and reliability, data latency, time synchronization of data streams, and the context orrelationship between data streams.Figure 2: “Perfect World” of Data Processing. What’s Missing?KSG Solutions Data Lake or Data Swamp?6

A GREAT ALTERNATIVE TO THE “PERFECT WORLD” IN AN INDUSTRIAL ENVIRONMENTAhybrid model delivered by an ecosystem of suppliers will help to bridge the gap between the ‘Perfect World’and the technology constraints.Depending on the company size and the equipmentused for production, the variety in time-series datasources can be significant. There will be a legacy ofcontrol and automation systems, especially in oldercompanies with various production locations that mayinclude systems from various brands, various types perbrand and various versions per type. Sending the rawdata from these sources to the data lake is not evenan option as interfaces for these legacy sources do notexist. Also, the facility location can introduce significantdata reliability concerns. Remote facilities connectedvia low bandwidth connections, like satellites, needadditional functionality to avoid data loss. Anotherimportant aspect is security. To assure the integrityand safe operations of your facility, the interfacetechnology must be very secure. The following tableshows the benefits of adding advanced real-time,time-series systems to the hybrid model to address tokey concerns from the data lake technology.Figure 3: Overview of system characteristicsKSG Solutions Data Lake or Data Swamp?7

THE OPTIMUM OF DATA PROCESSING IN THE REAL-TIME, TIME-SERIES WORLDThe combination of data lake technology and time-series infrastructure will help to address the core concernsof the “Perfect World”. In this situation, time series data infrastructure will collect all the data from the field.The time-series data infrastructure will also assure the availability of data in the field for local viewing,processing and reporting (Edge computing) or feeding data to (near) real-time optimization or advanced control.This Edge computing will assure the data and system availability needed to run and monitor the equipment in theproduction process itself by avoiding network availability and data latency issues.BENEFIT OF INTEGRATING ANDSTANDARDIZING ACCESS TO DATAThe integrated combination of time-series systemsand data lakes delivers a ‘One-Stop-Shop’ modelfor data access across the business and operationsenterprise. This enables enterprise wide reporting,enterprise big data analytics and the delivery ofenterprise applications across a broad spectrumof use cases. These enterprise applications andreports can be reused throughout the enterprisefrom one single platform. As the definition of a pieceof equipment is the same throughout the company,it is very easy to reuse use cases throughout thecompany. Best practices from one location can bere-deployed with very low effort at other locationsto rapidly generate value. In the case of an ITarchitecture with a consistent way of accessingdata and with a consistent way of building a datamodel, it will be very easy to build one consistentset of analytics per equipment type and deploy thisto all facilities throughout the enterprise. It avoidsreinventing the wheel at the various facilities;development and deployment of applications willbecome very Agile; and most important, the timeto value is very short. Large companies like globalenergy enterprises can easily drive 500 millionbenefit per year by introducing enterprise tools forProactive Monitoring, Exception Based Surveillance,Rotating Equipment Monitoring, Condition BasedMaintenance, Margin visualizing, etc. This all willresult in better uptime and higher efficiency ofthe facilities.In the Energy world, the use of heat exchangers is quite common.Fouling of heat exchangers is a serious concern as it slows downproduction or forces unplanned outages. This concern is addressedat all facilities by technologists, all of whom try to invent a way topredict the fouling of ‘their’ heat exchangers. However, at the end,such efforts often result in a huge amount of rework by reinventingthe same wheel.KSG Solutions Data Lake or Data Swamp?8

TIME-SERIES DATA INFRASTRUCTURE WITHDATA LAKE INTEGRATIONThe choice of time-series or real-time infrastructuretechnology will depend on the enterprisecharacteristics and requirements. The marketof real-time infrastructure systems varies in afew groups: Automation vendor-based like Honeywell PHDor Yokogawa Exaquatum Open source-based like InfluxDB, Graphite, andPrometheus Large Equipment vendor-based like Siemens XHQ Vendor independent systems like the OSIsoftPI SystemAutomation vendor-based time-seriesdata infrastructureAutomation vendors like Honeywell and Yokogawadeliver their own dedicated real-time infrastructure.These tools integrate very well in their automationtoolkit. The downside is that these tools have limitedanalytical capabilities compared to other toolkits, anddon’t integrate well in a big data environment.Open source time-series data infrastructureSystems like InfluxData have their origin in collectingreal-time information from online systems forperformance monitoring and alerting. Soon after theintroduction of InfluxData in 2013, the interfaces forcollecting real-time data rapidly extended in the worldof social media. Use cases continued to extend in theIoT world. InfluxData is an integration of various opensource initiatives: Telegraf for interfacing, InfluxDBfor time-series storage, Chronograf for visualization,and Kapacitor for detecting and alerting.KSG Solutions Data Lake or Data Swamp?Equipment based time-series data infrastructureEquipment vendors like Siemens need systems tooptimize the service they deliver. They needtime-series systems for remote monitoring oftheir large rotating equipment, like wind turbines.The growth of this turbine market pushed thedevelopment of these platforms forward.Independent vendor based time-seriesdata infrastructureIndependent vendors started to address the variousgaps in data collection, analyses and visualization.Two vendors stand out in this area: AspenTechwith the InfoPlus 21 system, and OSIsoft with thePI System time-series data infrastructure. WhereInfoPlus 21 is more focused on smaller scale,MES-like functionality and local plants, the OSIsoftPI System is designed to be an all-purpose real-timeinfrastructure from a single set of assets like windturbines, to a whole plant, an enterprise, or even acommunity of enterprises, vendors and regulatorswho need to capture, share and analyze data. Thebroad variety of interfaces to various types of datasources (450 ) is one of the major advantages ofthe OSIsoft PI System toolkit. There is no restrictionin getting data into the system. This means noadditional development or unexpected IT cost toconnect data sources. Meanwhile, a full contextengine with streaming analytics enables the hugevolume, variety and diversity of captured data, andturns it into valuable information in real-time thatanyone can consume, from a plant engineer to adata scientist working within a data lake.9

OVERVIEW OF CAPABILITIESTable 2: Comparison of Infrastructure CapabilitiesKSG Solutions Data Lake or Data Swamp?10

Table 2: Comparison of Infrastructure Capabilities - continuedKSG Solutions Data Lake or Data Swamp?11

DATA CONTEXT IS THE KEY TO SUCCESSIn order for data to drive business outcomes, it must be organized and accessible. Without structure, the datalake becomes a swamp. Individual data points have value for engineers very close to the production facility.Engineers usually know in detail how the facility is built and how to find each data point. However, as soon asreporting, monitoring or analyses happen outside of the local environment, it becomes important to add structure,governance, and context to the huge amount of available data points. Knowing the data individually by name isnot an option anymore.Figure 4: Streaming Operational Data to Multiple ApplicationsExample: Consider the contextual data that surroundsa single lube oil pump in a large facility. Eachpump will have a parameter for pump name, powerconsumption, outlet pressure, outlet flow, outlettemp, and filter differential pressure. Furthermore,anyone in the organization should know where theKSG Solutions Data Lake or Data Swamp?pump resides, where in a process, and what mayflow through the pump. Given the diversity of pumpsand their various applications and processes, simplycomparing all “pumps” is meaningless for analyticswithout this context.12

A template approach makes this complex data contextmore accessible to all users. With templates, users don’thave to search for multiple tag names for a data streamor need to know the name of the tag. All they need toknow is the pump name. For the other parameters, youdon’t need to know the data stream names anymore.You make this connection between a specific pump andthe actual data streams for this pump at the time youadd (instantiate) the pump to your system.Once all your assets are modeled on asset templates,access to the data is very easy. This makes it simplefor non-IT staff to use the data, but also buildingapplications and reports will become very fast and easyto deploy. However, one of the gaps for all availablesystems is the high amount of manual labor needed tomake the connections between the data streams andthe asset definitions. The problem is not with buildingthe templates themselves, but with connecting instancesof the templates to measuring points in the field. Withlarger systems of 100k data streams, this can becomequite labor intensive and costly.In the case of OSIsoft’s PI System, a toolkit is available to automate and significantly reduce theeffort needed to build templates and map data streams into a structure. This toolkit is delivered byElement Analytics and reduces the time to value by 80%.Integration with Cloud-Based ApplicationsThe Element PlatformAsset rosoft Azure CloudOn PremiseOperations DataFigure 5: Leveraging Element Analytics for accelerating implementation of data structureData scientists need to consider not only the contextof data, but also the data preparation. This is wheremost of the effort can be spent. Data scientistsneed to prepare the data by selecting the dataset, cleansing the data, aligning data in time andformatting it in the correct layout. This is the greatestchallenge to data scientists looking to utilizetime-series data for advanced analytics. Agileself-service data preparation tools like the OSIsoft’sBusiness Integrators, in combination with tools likeElement Analytics, help to open up big data analyticsKSG Solutions Data Lake or Data Swamp?for business users who are not IT specialists or datascientists. Companies like Cemex have shown thata traditional time-series data preparation that wouldtake six months before it’s ready for analytics canbe reduced to four minutes of preparation time withthe right tools. This agile and user-friendly way ofworking with the OSIsoft toolkit will reduce the time tovalue for business ideas significantly. In addition, lessinvolvement is needed from IT specialists and datascientists, and it reduces the Total Cost of Ownership(TCO) for the same business value significantly.13

CONCLUSIONThe perfect world for industrial data processing does not exist yet. Pushing all production and operationaldata in a raw format to a central big data store will result in a data swamp instead of a data lake. Onlyspecialized data scientists will be able to make sense out of the data. In an industrial environment,pre-processing of all real-time data is essential. Bringing context to the data is a must to assure that businessusers can leverage the data to optimize operations.This means that in an industrial environmentthe combination of a data lake with a real-timeinfrastructure will bring all the benefits of Big Dataprocessing like: Connectivity to the very diverse production andautomation world Enterprise application development and reportingis enabled by having a ‘One-Stop-Shop’ for datawith a standardized data model for all assets Operational staff will have direct access toreal-time data in a structured and agile way tooptimize day-to-day operations Data Scientists will be able to find the big valueitems by combining all the dataFigure 6: An Enterprise Operations Infrastructure provides the foundation to ensure analytics-ready data for data initiatives.KSG Solutions Data Lake or Data Swamp?14

The seamless integration of OSIsoft’sPI System on the production andautomation level (interfaces andconnectors) and the seamlessintegration on the BusinessIntelligence level with cloud and datalake integration makes the OSIsoft’sPI System infrastructure a very popularproduct to bridge the gap betweenproduction and data lakes.KSG Solutions Data Lake or Data Swamp?In addition, there is no need for software developmentand complex IT infrastructure to run the PI Systemwhich is built on a self-service model. This reducesthe need for large (costly) IT teams to make anOSIsoft’s PI System implementation a success. Themajority of business innovation can be done by keybusiness users (subject matter experts) themselves.Simple integration, no additional development, andsimplicity in use will drive down the TCO for this typeof infrastructure significantly. The combination ofOSIsoft’s PI System, with capabilities supported byvendors like Element Analytics, for data modeling andanalyses, and the integration of all enterprise data ina data lake platform, will provide an environment foreasy implementation and fast realization of value frombig data processing.15

ABOUT KSG SOLUTIONSKSG-Solutions is a service and consultancycompany with focus on industrial information systems.KSG-Solutions is founded with the objective to helpindustrial companies to generate more value out oftheir installed assets by implementing smart solutionsbased on off-the-shelve IT products. These solutionswill help to drive higher asset availability, increasedintegrity, lower energy consumption, and higher overallproductivity. 40 years of experience with Oil & Gasmajors and 30 years’ experience in Real-Time dataprocessing and MES systems, forms the basis for theservices provided by KSG-Solutions.For information, please visit our website atwww.ksg-solutions.nl Copyright 2017 KSG SolutionsAll companies, products, and brands mentioned are trademarks of their respective trademark owners.WPLSEN-102617

Hadoop, both in the cloud or on premise. Large companies, like global energy enterprises, have proven this technology combination can easily drive 500 million benefit per year by introducing enterprise tools and processes for Proactive Monitoring, Exception Based Surveillance, Rotating Equipment Monitoring, Condition Based Maintenance,