Transcription
Cloud Computing – Lecture 09Data acquisition, migrationand flow management05 April 2022Chinmaya Dehurychinmaya.dehury@ut.ee
Outlines Data Acquisition Data Migration Data Pipeline solutions AWS data pipeline Apache NifiCloud Computing - Lecture 09: Data Acquisition,migration and flow management2
Img src: http://visioforce.com/smarthome.htmlImg src: uting-3ef550c3d84eImg: and-file-sharing-servicesImg: /Cloud Computing - Lecture 09: DataAcquisition, migration and flow management3
Data Acquisition Process of gathering, filtering, and cleaningdata Difficult to find complete set of required data inone place Data sources: Social medias IoT Events Logs LinkedImg src: 3-319-21569-3 4.pdfCloud Computing - Lecture 09: DataAcquisition, migration and flow management4
Data Acquisition Data can be: Text Audio Video 5V’s of the data: Volume (size of the data) Velocity (how fast the data is generated?) Variety (Structured, Semi- Structured,Unstructured data) Veracity (messy, quality, and accuracy?) ValueCloud Computing - Lecture 09: DataAcquisition, migration and flow management5
Data Migration transferring data from one computer storagesystem to another. e.g. : Transferring images from your smart phone to yourlaptop Transferring data from old laptop to new one Transferring data from Google Drive to Dropbox Sending data from CCTV to cloud storage Sending sensors’ data to cloud storage process of selecting, preparing, extracting,transforming data and transferring Usually thousands of data sources are involved Generated data are of small size Higher frequency of data generationCloud Computing - Lecture 09: DataAcquisition, migration and flow management6
Data MigrationChallenges & Risks Data Loss At source At intermediate devices Over network At target Knowing data source Can you identify duplicate, missing data, erroneous data Data validation Validation at sourceMerged data validationTools validationIntegration validation, etcCloud Computing - Lecture 09: DataAcquisition, migration and flow management7
Data MigrationChallenges & Risks Compatibility issues: Storage Compatibility (e.g. S3 - DynamoDB, local harddisk - cloudstorage) Application compatibility (e.g. old excel file with Excel 2019) Platform compatibility (e.g. from on-premise to cloud), Cloud compatibility (e.g. AWS - Azure, UT’s Openstack - AWS)Cloud Computing - Lecture 09: DataAcquisition, migration and flow management8
Data Migration2 Broad categories Online: Migrating data without disrupting other applications e.g. live VM migration Offline This migration approach would invite disruption touser and application E.g. data migration during scheduled maintenance, backupand restore purpose.Challenges in choosing data migration method: Impact on downtime: Estimating downtime Risk while migrating data online Emergency decision and rollback planCloud Computing - Lecture 09: DataAcquisition, migration and flow management9
Data MigrationFactors to consider: Type of workload: Databases, virtual machines (VMs), Backups, etc Amount of data Imagine migrating some Petabytes of data online Imagine migrating few GBs of data in offline mode Speed to completion: For online migrations: amount of data For offline migrations: shipping timeCloud Computing - Lecture 09: DataAcquisition, migration and flow management10
Then What is Data Pipeline ?Cloud Computing - Lecture 09: DataAcquisition, migration and flow management15
Data PipelinePipeline approach for computer instruction sembly-line in automobile final.JPGCloud Computing - Lecture 09: DataAcquisition, migration and flow management16
Data PipelinePipeline approach in logistic:Cloud Computing - Lecture 09: DataAcquisition, migration and flow management17
Data PipelinePipeline approach for handling thedata acquisition, migration and itsflow.Cloud Computing - Lecture 09: DataAcquisition, migration and flow management18
Data Pipeline (DP)DataprocessingData inData inDataprocessing1Dataprocessing2Data outDataprocessing3Large data processing taskCloud Computing - Lecture 09: DataAcquisition, migration and flow managementData out19
Data Pipeline (DP) A system for moving data from one system toanother. Encompasses ETL as a subsystem Transformation of data is optional May process data in real-time or in batchmannerCloud Computing - Lecture 09: DataAcquisition, migration and flow management20
Data Pipeline properties1. Low Event Latency: query recent event data within mins/secs2. Scalability Able to scale to billions of data points3. Interactive Querying support both long-running batch queries andsmaller interactive queries4. Versioning5. Monitoring6. TestingCloud Computing - Lecture 09: DataAcquisition, migration and flow management21
Types of data pipeline solutions1. Batch: Suitable for large-volume of data Move in a regular time interval2. Real-time: Move and process data in a real-time3. Cloud native4. Open source5. Proprietary SolutionCloud Computing - Lecture 09: DataAcquisition, migration and flow management22
Types of data pipeline solutionsSolution typeSolutionsBatchApache Spark, Astera Centerprise, Hevo Data,Real-timeApache Kafka, Apache Spark, Astera Centerprise, HevoData,Cloud-NativeAWS Data pipeline, Hevo Data, Blendo, ConfluentOpen-sourceApache Spark, Apache Kafka, Apache NifiProprietary SolutionAstera Centerprise, Hevo DataCloud Computing - Lecture 09: DataAcquisition, migration and flow management23
Data Pipeline Technologies1. Amazon Data pipeline2. Apache NifiCloud Computing - Lecture 09: DataAcquisition, migration and flow management24
Data Pipeline Technologies1. Amazon Data pipeline2. Apache NifiCloud Computing - Lecture 09: DataAcquisition, migration and flow management25
Amazon Data PipelineImg src: veloperGuide/images/dp-how-dp-worksv2.png A web service for reliable process and movement of data Focus is on AWS compute and storage servicesCloud Computing - Lecture 09: DataAcquisition, migration and flow management26
Amazon Data Pipeline AWS services such as Storage services: Amazon S3, Amazon RDS, AmazonDynamoDB, Amazon Redshift Compute services: Amazon EC2, Amazon EMR Data processing workloads can be fault tolerant repeatable highly availableCloud Computing - Lecture 09: DataAcquisition, migration and flow management27
Amazon Data Pipeline : An ExampleCloud Computing - Lecture 09: DataAcquisition, migration and flow management28
Amazon Data Pipeline : An ExampleCloud Computing - Lecture 09: DataAcquisition, migration and flow management29
Amazon Data Pipeline – Components1. Major componentsI. DataNodesII. Activities2. Additional componentsI. SchedulesII. PreconditionsIII. ResourcesCloud Computing - Lecture 09: DataAcquisition, migration and flow management30
Amazon Data Pipeline – Major components1. Major componentsI. DataNodes: It specifies the name, location, and formatof the data sources such as Amazon S3, Dynamo DB, etc.i. DynamoDBDataNodeii. SqlDataNodeiii. RedshiftDataNodeiv. S3DataNodev. SqlDataNodeII. Activities: Activities are the actions that perform the SQLQueries on the databases, transforms the data from onedata source to another data source.Cloud Computing - Lecture 09: DataAcquisition, migration and flow management31
Amazon Data Pipeline – Major components1. Major componentsI. DataNodesII. Activitiesi. CopyActivityii. EmrActivityiii. HadoopActivityiv. HiveActivityv. HiveCopyActivityvi. ndActivityix. SqlActivityCloud Computing - Lecture 09: DataAcquisition, migration and flow management32
Amazon Data Pipeline - Additional components1. Major componentsI. DataNodesII. Activities2. Additional componentsI. Schedules: Schedule defines the timing of a scheduledevent, such as when an activity runs.Cloud Computing - Lecture 09: DataAcquisition, migration and flow management33
Amazon Data Pipeline - Additional components2. Additional componentsI. SchedulesII. Preconditions: A condition that must be true before anactivity can run. E.g., check if the data is present on thesource before attempting to run CopyActivity.A. System-managed Precondition:a) DynamoDBDataExistsb) DynamoDBTableExistsc) S3KeyExists, etc.B. User-managed preconditiona) Exists: Checks whether a data node exists.b) ShellCommandPrecondition: Unix/Linux shell command that canbe run as a preconditionCloud Computing - Lecture 09: DataAcquisition, migration and flow management34
Amazon Data Pipeline - Additional components2. Additional componentsI. SchedulesII. PreconditionsIII.Resources: refer to the computationalresource that performs the work that apipeline activity specifiedI. Ec2Resource: An EC2 instanceII. EmrCluster: An Amazon EMR clusterCloud Computing - Lecture 09: DataAcquisition, migration and flow management35
Trying Amazon Data PipelineIf your AWS account is less than 12 months old, youare eligible to use the free tier. (url)Cloud Computing - Lecture 09: DataAcquisition, migration and flow management36
Other commercial data pipeline solutionsMicrosoft Azure Data a-factory/Google Cloud Dataflow:https://cloud.google.com/dataflowIBM InfoSphere Virtual Data re-virtual-data-pipelineCloud Computing - Lecture 09: DataAcquisition, migration and flow management37
Data Pipeline Technologies1. Amazon Data pipeline2. Apache NifiCloud Computing - Lecture 09: DataCloudComputing- Lecture10:Acquisition,migrationand flowmanagementDeployement models in cloud38
Apache Nifi Data Pipeline Open-source, under the Apache Software Foundation Automates and manages the flow of data betweensystems Web-based User Interface for creating, monitoring, &controlling data flows. Clients [src]: Micron: Semiconductor ManufacturingPayoff: Financial Wellness (fintech)Slovak: Telekom TelecommunicationsLooker: SaaS & Analytics SoftwareHastings Group: Insuranceand many more . Latest version 1.15.3 (as on April 2022)Cloud Computing - Lecture 09: DataAcquisition, migration and flow management39
Apache Nifi Data PipelineKey FeaturesFlow Management: Data Buffering Prioritized Queuing Guaranteed DeliveryEase of Use: Flow Templates Data Provenance Fine-grained historyCloud Computing - Lecture 09: DataAcquisition, migration and flow management40
Apache Nifi Data PipelineKey FeaturesSecurity System to System User to System Multi-tenant AuthorizationExtensible Architecture Extension (e.g. having custom processor) Site-to-Site Communication ProtocolCloud Computing - Lecture 09: DataAcquisition, migration and flow management41
Apache Nifi Data PipelineNiFi ArchitectureSrc: https://www.tutorialspoint.com/apache nifi/apache nifi basic concepts.htmCloud Computing - Lecture 09: DataAcquisition, migration and flow management42
Apache Nifi Data PipelineNiFi Architecture - Repositorieswhat happened to a particular dataobject (FlowFile) is kept in here.History of each FlowFile is storedhere.stores the metadata of theFlowFiles during the activeflow.holds the actual content of theFlowFiles.Src: https://www.tutorialspoint.com/apache nifi/apache nifi basic concepts.htmCloud Computing - Lecture 09: DataAcquisition, migration and flow management43
Apache Nifi Key conceptsKey concepts1. FlowFile represents each object moving through the system Include: data record (pointer to data payload)2. Processor Processors actually perform the work E.g. processor to send email, upload data to S3 bucket, Readingdata from FTP server, etc3. Process Group Group of processors, connection, input/output, etc4. Event5. Data provenanceCloud Computing - Lecture 09: DataAcquisition, migration and flow management44
Apache Nifi – An exampleIDEProcessorsBufferConnectionCloud Computing - Lecture 09: DataAcquisition, migration and flow management45
Apache Nifi – An exampleIDEProcessorsBufferConnectionCloud Computing - Lecture 09: DataAcquisition, migration and flow management46
Apache Nifi Data Pipeline1. Major componentsI.Processors (execute the task)II. Queue (between processors)2. Additional componentsI.Input PortII. Output PortIII. Process Group (Groupism of multiple components such asprocessors)IV. Remote Process GroupV. TemplateCloud Computing - Lecture 09: DataAcquisition, migration and flow management47
Apache Nifi - Processors1. Major componentsI. Processors283 processorsCloud Computing - Lecture 09: DataAcquisition, migration and flow management48
Apache Nifi - Processors1. Major componentsI. ProcessorsDifferent States of a Processor:Start, Stop, Enable, & DisableDisable processor can not be started.When a group of Processors is started, this (disabled) Processorshould be excludedCloud Computing - Lecture 09: DataAcquisition, migration and flow management49
Apache Nifi – Processors Setting1. Major componentsI.ProcessorsConfiguring a ProcessorSETTING:Penalty duration: Time to wait, when the thedata can not be processed for some reason.Yield Duration: Time to wait, when the processcan not progress.Bulletin level: Level of bulletin, Nifi will displayin the user interface. (e.g. Warn, error, info, debug)Failure & SuccessCloud Computing - Lecture 09: DataAcquisition, migration and flow management50
Apache Nifi – Processors Scheduling1. Major componentsI. ProcessorsConfiguring a ProcessorScheduling :Time vs Event vs CRON DrivenConcurrent Tasks: Number of FlowFilesshould be processed by this Processor at the same time.Cloud Computing - Lecture 09: DataAcquisition, migration and flow management51
Apache Nifi – Processors Properties1. Major componentsI. ProcessorsConfiguring a ProcessorProperties : Provides a mechanism toconfigure Processor-specificbehavior. There are no defaultproperties.Cloud Computing - Lecture 09: DataAcquisition, migration and flow management52
Apache Nifi – Processor categoriesDifferent categories of processors Data Ingestion Processors: GetFile, GetHTTP, GetFTP, etc Routing and Mediation Processors: RouteOnAttribute,RouteOnContent, ControlRate, RouteText, etc. Database Access Processors: ExecuteSQL, PutSQL,PutDatabaseRecord, ListDatabaseTables, etc. Attribute Extraction Processors: UpdateAttribute,EvaluateJSONPath, ExtractText, AttributesToJSON, etc System Interaction Processors: ExecuteScript, ExecuteProcess,ExecuteGroovyScript, ExecuteStreamCommand, etcCloud Computing - Lecture 09: DataAcquisition, migration and flow management53
Apache Nifi – Processor categoriesDifferent categories of processors Data Transformation Processors: ReplaceText,JoltTransformJSON, etc Sending Data Processors: PutEmail, PutSFTP, PutFile, PutFTP,etc. Splitting and Aggregation Processors: SplitText,SplitJson, SplitXml, MergeContent, SplitContent, etc. HTTP Processors: InvokeHTTP , ListenHTTP, etc AWS Processors: GetSQS, PutSNS, PutS3Object, FetchS3Object, etc.Cloud Computing - Lecture 09: DataAcquisition, migration and flow management54
Apache Nifi - Queue1. Major componentsII. Queue To handle the largeamount of datainflow. Possible to see thecontent, ID,Filename, FileSize etcof a flowfileCloud Computing - Lecture 09: DataAcquisition, migration and flow management56
Apache Nifi - Flow TemplateTemplates: Can be thought of as a reusable sub-flow. Any properties that are identified as being Sensitive Properties(such as a password that is configured in a Processor) will not beadded to the template.Download TemplateCreate TemplateCloud Computing - Lecture 09: DataAcquisition, migration and flow management57
Apache Nifi - Flow TemplateTemplates:Add TemplateUpload TemplateCloud Computing - Lecture 09: DataAcquisition, migration and flow management58
Apache Nifi – Data Provenance Snapshots of each FlowFile.Event type, FlowFile Lineage Graph,Provenance event DetailsIn-depth discovery of the chain of events.Cloud Computing - Lecture 09: DataAcquisition, migration and flow management59
Apache Nifi – Data ProvenanceList of EventsProvenanceevent detailsEvent type: RECEIVE, SEND, DROP,JOIN, CONTENT MODIFIED,ATTRIBUTES MODIFIED, FORK,CLONE, ROUTE, etc.Cloud Computing - Lecture 09: DataAcquisition, migration and flow managementFlowFileLineage Graph60
Apache Nifi – Data ProvenanceProvenance event DetailsCloud Computing - Lecture 09: DataAcquisition, migration and flow management61
Apache Nifi – Data ProvenanceFlowFile Lineage GraphEventFlowFileEvent whose graph wasselected (red color)Timestamp of the eventCloud Computing - Lecture 09: DataAcquisition, migration and flow management62
What next ?Cloud Computing - Lecture 09: DataAcquisition, migration and flow management64
Let’s move to lab session (Introduction data pipelines using Apache NiFi )Cloud Computing - Lecture 09: DataAcquisition, migration and flow management65
References1.Lyko, Klaus, Marcus Nitzschke, and Axel-Cyrille Ngonga Ngomo. "Big data acquisition." New Horizons for a DataDriven Economy. Springer, Cham, 2016. 39-61.2.Casale, G., Artač, M., van den Heuvel, W. et al. RADON: rational decomposition and orchestration for serverlesscomputing. SICS Softw.-Inensiv. Cyber-Phys. Syst. (2019). o, K., Nitzschke, M., Ngonga Ngomo, AC. (2016). Big Data Acquisition. In: Cavanillas, J., Curry, E.,Wahlster, W. (eds) New Horizons for a Data-Driven Economy. Springer, Cham.https://doi.org/10.1007/978-3-319-21569-3 d Computing - Lecture 09: DataAcquisition, migration and flow management66
Thank youCloud Computing - Lecture 09: DataAcquisition, migration and flow management67
Data acquisition, migration and flow management 05 April 2022 Chinmaya Dehury chinmaya.dehury@ut.ee. Outlines Data Acquisition Data Migration . May process data in real-time or in batch manner. Data Pipeline properties 1.Low Event Latency: query recent event data within mins/secs