Cloud Computing Lecture 09 Data Acquisition, Migration And Flow Management

Transcription

Cloud Computing – Lecture 09Data acquisition, migrationand flow management05 April 2022Chinmaya Dehurychinmaya.dehury@ut.ee

Outlines Data Acquisition Data Migration Data Pipeline solutions AWS data pipeline Apache NifiCloud Computing - Lecture 09: Data Acquisition,migration and flow management2

Img src: http://visioforce.com/smarthome.htmlImg src: uting-3ef550c3d84eImg: and-file-sharing-servicesImg: /Cloud Computing - Lecture 09: DataAcquisition, migration and flow management3

Data Acquisition Process of gathering, filtering, and cleaningdata Difficult to find complete set of required data inone place Data sources: Social medias IoT Events Logs LinkedImg src: 3-319-21569-3 4.pdfCloud Computing - Lecture 09: DataAcquisition, migration and flow management4

Data Acquisition Data can be: Text Audio Video 5V’s of the data: Volume (size of the data) Velocity (how fast the data is generated?) Variety (Structured, Semi- Structured,Unstructured data) Veracity (messy, quality, and accuracy?) ValueCloud Computing - Lecture 09: DataAcquisition, migration and flow management5

Data Migration transferring data from one computer storagesystem to another. e.g. : Transferring images from your smart phone to yourlaptop Transferring data from old laptop to new one Transferring data from Google Drive to Dropbox Sending data from CCTV to cloud storage Sending sensors’ data to cloud storage process of selecting, preparing, extracting,transforming data and transferring Usually thousands of data sources are involved Generated data are of small size Higher frequency of data generationCloud Computing - Lecture 09: DataAcquisition, migration and flow management6

Data MigrationChallenges & Risks Data Loss At source At intermediate devices Over network At target Knowing data source Can you identify duplicate, missing data, erroneous data Data validation Validation at sourceMerged data validationTools validationIntegration validation, etcCloud Computing - Lecture 09: DataAcquisition, migration and flow management7

Data MigrationChallenges & Risks Compatibility issues: Storage Compatibility (e.g. S3 - DynamoDB, local harddisk - cloudstorage) Application compatibility (e.g. old excel file with Excel 2019) Platform compatibility (e.g. from on-premise to cloud), Cloud compatibility (e.g. AWS - Azure, UT’s Openstack - AWS)Cloud Computing - Lecture 09: DataAcquisition, migration and flow management8

Data Migration2 Broad categories Online: Migrating data without disrupting other applications e.g. live VM migration Offline This migration approach would invite disruption touser and application E.g. data migration during scheduled maintenance, backupand restore purpose.Challenges in choosing data migration method: Impact on downtime: Estimating downtime Risk while migrating data online Emergency decision and rollback planCloud Computing - Lecture 09: DataAcquisition, migration and flow management9

Data MigrationFactors to consider: Type of workload: Databases, virtual machines (VMs), Backups, etc Amount of data Imagine migrating some Petabytes of data online Imagine migrating few GBs of data in offline mode Speed to completion: For online migrations: amount of data For offline migrations: shipping timeCloud Computing - Lecture 09: DataAcquisition, migration and flow management10

Then What is Data Pipeline ?Cloud Computing - Lecture 09: DataAcquisition, migration and flow management15

Data PipelinePipeline approach for computer instruction sembly-line in automobile final.JPGCloud Computing - Lecture 09: DataAcquisition, migration and flow management16

Data PipelinePipeline approach in logistic:Cloud Computing - Lecture 09: DataAcquisition, migration and flow management17

Data PipelinePipeline approach for handling thedata acquisition, migration and itsflow.Cloud Computing - Lecture 09: DataAcquisition, migration and flow management18

Data Pipeline (DP)DataprocessingData inData inDataprocessing1Dataprocessing2Data outDataprocessing3Large data processing taskCloud Computing - Lecture 09: DataAcquisition, migration and flow managementData out19

Data Pipeline (DP) A system for moving data from one system toanother. Encompasses ETL as a subsystem Transformation of data is optional May process data in real-time or in batchmannerCloud Computing - Lecture 09: DataAcquisition, migration and flow management20

Data Pipeline properties1. Low Event Latency: query recent event data within mins/secs2. Scalability Able to scale to billions of data points3. Interactive Querying support both long-running batch queries andsmaller interactive queries4. Versioning5. Monitoring6. TestingCloud Computing - Lecture 09: DataAcquisition, migration and flow management21

Types of data pipeline solutions1. Batch: Suitable for large-volume of data Move in a regular time interval2. Real-time: Move and process data in a real-time3. Cloud native4. Open source5. Proprietary SolutionCloud Computing - Lecture 09: DataAcquisition, migration and flow management22

Types of data pipeline solutionsSolution typeSolutionsBatchApache Spark, Astera Centerprise, Hevo Data,Real-timeApache Kafka, Apache Spark, Astera Centerprise, HevoData,Cloud-NativeAWS Data pipeline, Hevo Data, Blendo, ConfluentOpen-sourceApache Spark, Apache Kafka, Apache NifiProprietary SolutionAstera Centerprise, Hevo DataCloud Computing - Lecture 09: DataAcquisition, migration and flow management23

Data Pipeline Technologies1. Amazon Data pipeline2. Apache NifiCloud Computing - Lecture 09: DataAcquisition, migration and flow management24

Data Pipeline Technologies1. Amazon Data pipeline2. Apache NifiCloud Computing - Lecture 09: DataAcquisition, migration and flow management25

Amazon Data PipelineImg src: veloperGuide/images/dp-how-dp-worksv2.png A web service for reliable process and movement of data Focus is on AWS compute and storage servicesCloud Computing - Lecture 09: DataAcquisition, migration and flow management26

Amazon Data Pipeline AWS services such as Storage services: Amazon S3, Amazon RDS, AmazonDynamoDB, Amazon Redshift Compute services: Amazon EC2, Amazon EMR Data processing workloads can be fault tolerant repeatable highly availableCloud Computing - Lecture 09: DataAcquisition, migration and flow management27

Amazon Data Pipeline : An ExampleCloud Computing - Lecture 09: DataAcquisition, migration and flow management28

Amazon Data Pipeline : An ExampleCloud Computing - Lecture 09: DataAcquisition, migration and flow management29

Amazon Data Pipeline – Components1. Major componentsI. DataNodesII. Activities2. Additional componentsI. SchedulesII. PreconditionsIII. ResourcesCloud Computing - Lecture 09: DataAcquisition, migration and flow management30

Amazon Data Pipeline – Major components1. Major componentsI. DataNodes: It specifies the name, location, and formatof the data sources such as Amazon S3, Dynamo DB, etc.i. DynamoDBDataNodeii. SqlDataNodeiii. RedshiftDataNodeiv. S3DataNodev. SqlDataNodeII. Activities: Activities are the actions that perform the SQLQueries on the databases, transforms the data from onedata source to another data source.Cloud Computing - Lecture 09: DataAcquisition, migration and flow management31

Amazon Data Pipeline – Major components1. Major componentsI. DataNodesII. Activitiesi. CopyActivityii. EmrActivityiii. HadoopActivityiv. HiveActivityv. HiveCopyActivityvi. ndActivityix. SqlActivityCloud Computing - Lecture 09: DataAcquisition, migration and flow management32

Amazon Data Pipeline - Additional components1. Major componentsI. DataNodesII. Activities2. Additional componentsI. Schedules: Schedule defines the timing of a scheduledevent, such as when an activity runs.Cloud Computing - Lecture 09: DataAcquisition, migration and flow management33

Amazon Data Pipeline - Additional components2. Additional componentsI. SchedulesII. Preconditions: A condition that must be true before anactivity can run. E.g., check if the data is present on thesource before attempting to run CopyActivity.A. System-managed Precondition:a) DynamoDBDataExistsb) DynamoDBTableExistsc) S3KeyExists, etc.B. User-managed preconditiona) Exists: Checks whether a data node exists.b) ShellCommandPrecondition: Unix/Linux shell command that canbe run as a preconditionCloud Computing - Lecture 09: DataAcquisition, migration and flow management34

Amazon Data Pipeline - Additional components2. Additional componentsI. SchedulesII. PreconditionsIII.Resources: refer to the computationalresource that performs the work that apipeline activity specifiedI. Ec2Resource: An EC2 instanceII. EmrCluster: An Amazon EMR clusterCloud Computing - Lecture 09: DataAcquisition, migration and flow management35

Trying Amazon Data PipelineIf your AWS account is less than 12 months old, youare eligible to use the free tier. (url)Cloud Computing - Lecture 09: DataAcquisition, migration and flow management36

Other commercial data pipeline solutionsMicrosoft Azure Data a-factory/Google Cloud Dataflow:https://cloud.google.com/dataflowIBM InfoSphere Virtual Data re-virtual-data-pipelineCloud Computing - Lecture 09: DataAcquisition, migration and flow management37

Data Pipeline Technologies1. Amazon Data pipeline2. Apache NifiCloud Computing - Lecture 09: DataCloudComputing- Lecture10:Acquisition,migrationand flowmanagementDeployement models in cloud38

Apache Nifi Data Pipeline Open-source, under the Apache Software Foundation Automates and manages the flow of data betweensystems Web-based User Interface for creating, monitoring, &controlling data flows. Clients [src]: Micron: Semiconductor ManufacturingPayoff: Financial Wellness (fintech)Slovak: Telekom TelecommunicationsLooker: SaaS & Analytics SoftwareHastings Group: Insuranceand many more . Latest version 1.15.3 (as on April 2022)Cloud Computing - Lecture 09: DataAcquisition, migration and flow management39

Apache Nifi Data PipelineKey FeaturesFlow Management: Data Buffering Prioritized Queuing Guaranteed DeliveryEase of Use: Flow Templates Data Provenance Fine-grained historyCloud Computing - Lecture 09: DataAcquisition, migration and flow management40

Apache Nifi Data PipelineKey FeaturesSecurity System to System User to System Multi-tenant AuthorizationExtensible Architecture Extension (e.g. having custom processor) Site-to-Site Communication ProtocolCloud Computing - Lecture 09: DataAcquisition, migration and flow management41

Apache Nifi Data PipelineNiFi ArchitectureSrc: https://www.tutorialspoint.com/apache nifi/apache nifi basic concepts.htmCloud Computing - Lecture 09: DataAcquisition, migration and flow management42

Apache Nifi Data PipelineNiFi Architecture - Repositorieswhat happened to a particular dataobject (FlowFile) is kept in here.History of each FlowFile is storedhere.stores the metadata of theFlowFiles during the activeflow.holds the actual content of theFlowFiles.Src: https://www.tutorialspoint.com/apache nifi/apache nifi basic concepts.htmCloud Computing - Lecture 09: DataAcquisition, migration and flow management43

Apache Nifi Key conceptsKey concepts1. FlowFile represents each object moving through the system Include: data record (pointer to data payload)2. Processor Processors actually perform the work E.g. processor to send email, upload data to S3 bucket, Readingdata from FTP server, etc3. Process Group Group of processors, connection, input/output, etc4. Event5. Data provenanceCloud Computing - Lecture 09: DataAcquisition, migration and flow management44

Apache Nifi – An exampleIDEProcessorsBufferConnectionCloud Computing - Lecture 09: DataAcquisition, migration and flow management45

Apache Nifi – An exampleIDEProcessorsBufferConnectionCloud Computing - Lecture 09: DataAcquisition, migration and flow management46

Apache Nifi Data Pipeline1. Major componentsI.Processors (execute the task)II. Queue (between processors)2. Additional componentsI.Input PortII. Output PortIII. Process Group (Groupism of multiple components such asprocessors)IV. Remote Process GroupV. TemplateCloud Computing - Lecture 09: DataAcquisition, migration and flow management47

Apache Nifi - Processors1. Major componentsI. Processors283 processorsCloud Computing - Lecture 09: DataAcquisition, migration and flow management48

Apache Nifi - Processors1. Major componentsI. ProcessorsDifferent States of a Processor:Start, Stop, Enable, & DisableDisable processor can not be started.When a group of Processors is started, this (disabled) Processorshould be excludedCloud Computing - Lecture 09: DataAcquisition, migration and flow management49

Apache Nifi – Processors Setting1. Major componentsI.ProcessorsConfiguring a ProcessorSETTING:Penalty duration: Time to wait, when the thedata can not be processed for some reason.Yield Duration: Time to wait, when the processcan not progress.Bulletin level: Level of bulletin, Nifi will displayin the user interface. (e.g. Warn, error, info, debug)Failure & SuccessCloud Computing - Lecture 09: DataAcquisition, migration and flow management50

Apache Nifi – Processors Scheduling1. Major componentsI. ProcessorsConfiguring a ProcessorScheduling :Time vs Event vs CRON DrivenConcurrent Tasks: Number of FlowFilesshould be processed by this Processor at the same time.Cloud Computing - Lecture 09: DataAcquisition, migration and flow management51

Apache Nifi – Processors Properties1. Major componentsI. ProcessorsConfiguring a ProcessorProperties : Provides a mechanism toconfigure Processor-specificbehavior. There are no defaultproperties.Cloud Computing - Lecture 09: DataAcquisition, migration and flow management52

Apache Nifi – Processor categoriesDifferent categories of processors Data Ingestion Processors: GetFile, GetHTTP, GetFTP, etc Routing and Mediation Processors: RouteOnAttribute,RouteOnContent, ControlRate, RouteText, etc. Database Access Processors: ExecuteSQL, PutSQL,PutDatabaseRecord, ListDatabaseTables, etc. Attribute Extraction Processors: UpdateAttribute,EvaluateJSONPath, ExtractText, AttributesToJSON, etc System Interaction Processors: ExecuteScript, ExecuteProcess,ExecuteGroovyScript, ExecuteStreamCommand, etcCloud Computing - Lecture 09: DataAcquisition, migration and flow management53

Apache Nifi – Processor categoriesDifferent categories of processors Data Transformation Processors: ReplaceText,JoltTransformJSON, etc Sending Data Processors: PutEmail, PutSFTP, PutFile, PutFTP,etc. Splitting and Aggregation Processors: SplitText,SplitJson, SplitXml, MergeContent, SplitContent, etc. HTTP Processors: InvokeHTTP , ListenHTTP, etc AWS Processors: GetSQS, PutSNS, PutS3Object, FetchS3Object, etc.Cloud Computing - Lecture 09: DataAcquisition, migration and flow management54

Apache Nifi - Queue1. Major componentsII. Queue To handle the largeamount of datainflow. Possible to see thecontent, ID,Filename, FileSize etcof a flowfileCloud Computing - Lecture 09: DataAcquisition, migration and flow management56

Apache Nifi - Flow TemplateTemplates: Can be thought of as a reusable sub-flow. Any properties that are identified as being Sensitive Properties(such as a password that is configured in a Processor) will not beadded to the template.Download TemplateCreate TemplateCloud Computing - Lecture 09: DataAcquisition, migration and flow management57

Apache Nifi - Flow TemplateTemplates:Add TemplateUpload TemplateCloud Computing - Lecture 09: DataAcquisition, migration and flow management58

Apache Nifi – Data Provenance Snapshots of each FlowFile.Event type, FlowFile Lineage Graph,Provenance event DetailsIn-depth discovery of the chain of events.Cloud Computing - Lecture 09: DataAcquisition, migration and flow management59

Apache Nifi – Data ProvenanceList of EventsProvenanceevent detailsEvent type: RECEIVE, SEND, DROP,JOIN, CONTENT MODIFIED,ATTRIBUTES MODIFIED, FORK,CLONE, ROUTE, etc.Cloud Computing - Lecture 09: DataAcquisition, migration and flow managementFlowFileLineage Graph60

Apache Nifi – Data ProvenanceProvenance event DetailsCloud Computing - Lecture 09: DataAcquisition, migration and flow management61

Apache Nifi – Data ProvenanceFlowFile Lineage GraphEventFlowFileEvent whose graph wasselected (red color)Timestamp of the eventCloud Computing - Lecture 09: DataAcquisition, migration and flow management62

What next ?Cloud Computing - Lecture 09: DataAcquisition, migration and flow management64

Let’s move to lab session (Introduction data pipelines using Apache NiFi )Cloud Computing - Lecture 09: DataAcquisition, migration and flow management65

References1.Lyko, Klaus, Marcus Nitzschke, and Axel-Cyrille Ngonga Ngomo. "Big data acquisition." New Horizons for a DataDriven Economy. Springer, Cham, 2016. 39-61.2.Casale, G., Artač, M., van den Heuvel, W. et al. RADON: rational decomposition and orchestration for serverlesscomputing. SICS Softw.-Inensiv. Cyber-Phys. Syst. (2019). o, K., Nitzschke, M., Ngonga Ngomo, AC. (2016). Big Data Acquisition. In: Cavanillas, J., Curry, E.,Wahlster, W. (eds) New Horizons for a Data-Driven Economy. Springer, Cham.https://doi.org/10.1007/978-3-319-21569-3 d Computing - Lecture 09: DataAcquisition, migration and flow management66

Thank youCloud Computing - Lecture 09: DataAcquisition, migration and flow management67

Data acquisition, migration and flow management 05 April 2022 Chinmaya Dehury chinmaya.dehury@ut.ee. Outlines Data Acquisition Data Migration . May process data in real-time or in batch manner. Data Pipeline properties 1.Low Event Latency: query recent event data within mins/secs