Transcription
Tackling Big Data Using MATLABAlka NairApplication Engineer 2015 The MathWorks, Inc.1
Building Machine Learning Models with Big DataAccessPreprocess,Exploration &Model DevelopmentScale up & Integrate withProduction Systems2
Case study: Predict Air QualityFactorsAffectingAir QualityMy WeatherPage TemperaturePressureRelative HumidityDew PointWind speedWind 3
4
Building Machine Learning Models with Big DataAccessPreprocess, Exploration& Model DevelopmentScale up & Integrate withProduction Systems5
Challenges in Modeling and Deploying Big Data ApplicationsAccessPreprocess,Exploration & ModelDevelopmentScale up & Integratewith Production Systems Distributed Data Storage Preprocessing and Visualizing Big Data Different Data Sources &Types Parallelizing Jobs and Scaling upComputations to ClusterManaging Different APIs for DataSources and Data Formats Rewriting Algorithms to Use BigData Platforms Parallelizing Code to Scale up toUse Cluster and Cloud Compute Enterprise leveldeploymentOverhead in Moving theAlgorithm to Production6
Wouldn’t it be nice if you could: Easily access data however it is stored Prototype algorithms quickly using small data sets Scale up to big data sets running on large clusters Using the same intuitive MATLAB syntax you are used to7
Building machine learning models with big dataAccessPreprocess,Exploration &Model DevelopmentScale up & Integrate withProduction Systems8
Access and Manage Big DataDifferent Data TypesDifferent Data Sources Hadoop Distributed FileSystem (HDFS) Text Amazon S3 Images Windows Azure Blob Spreadsheet Custom File FormatsStorage Relational Database HDFS on Hortonworks orClouderaDatastoresDifferent Applications MapReduce Image Segmentation Image Classification Denoising Images Predictive Maintenance9
cessCluster ofMachinesMemoryCluster ofMachinesMemoryOne or more files10
Air Quality Data on Local Folder11
Accessing and Processing different types of dataImage CollectionMDFFilesTabularTextDatastoreText files containing column-oriented data, includingCSV filesImageDatastoreImage files, including formats that are supportedby imread such as JPEG and PNGSpreadsheetDatastoreSpreadsheet files with a supported Excel formatsuch as .xlsxMDFDatastoreDatastore for collection of MDF filesCustom DatastoreDatastore for custom or proprietary format12
You have 1 TB of data you’ve never seen before. How do youaccess this data?13
Historical files are on HDFS and real time data are availablethrough an API TemperaturePressureRelative HumidityDew PointWind SpeedWind DirectionOzoneCONO2SO214
Access air quality data using datastore15
Preview the data and adjust properties to best represent thedata of interest16
Access data from anywhere with minimal changesLocal disk17
Datastores enable big data workflowsDeep Learning18
Datastores enable big data workflowsPredictiveMaintenance19
Datastores enable big data workflowsFleetAnalytics20
Datastores: Access Big Data with Minimal ChangesDifferent Data TypesDifferent Data Sources Hadoop Distributed FileSystem (HDFS) Text Amazon S3 Images Windows Azure Blob Spreadsheet Custom File Formats Storage Relational Database HDFS on Hortonworks orCloudera Different Applications MapReduce Image Segmentation Image Classification Denoising Images Predictive Maintenance 21
Building machine learning models with big dataAccessPreprocess,Exploration &Model DevelopmentScale up & Integrate withProduction Systems22
You have 1TB of data you’ve never seen before. How do youvisualize and process the data?23
Use tall arrays to work with the data like any MATLAB array24
Introduction to Tall Arrays Tall Arrays for Big Data Visualization and Preprocessing Machine Learning for Big Data Using Tall Arrays25
Tall arrays SingleMachineMemoryData is in one or more filesFiles stacked verticallyTypically tabular dataChallenge Data doesn’t fit into memory(even cluster memory) Takes a lot of time for even simpleoperations on dataCluster ofMachinesMemory26
Tall arrays (new R2016b) tall arrayProcessSingleMachineMemoryCreate tall table from datastoreds datastore('*.csv')tt tall(ds) SingleMachineMemoryOperate on whole tall tablejust like ordinary tableDatastoreCluster ofMachinesMemorysummary(tt)max(tt.EndTime – tt.StartTime)27
tall achineMemoryProcessSingleMachineMemoryWith Parallel Computing Toolbox,process several “chunks” at once Can scale up to clusters withMATLAB Distributed Computing Server Cluster ofMachinesMemory28
Use a Spark-enabled Hadoop cluster and MATLABSupport for many other platforms through reference architectures29
It’s easy to run MATLAB code on Spark HadoopSpark ConnectionCluster Config for SparkHadoop Access30
MATLAB Documentation for31
Summary for tall arraysLocal disk,Shared folders,DatabasesRun on Compute Clustersor Spark Hadoop (HDFS),for large scale analysisProcess out-of-memory data onyour Desktop to explore,analyze, gain insights and todevelop analyticsUse Parallel ComputingToolbox for increasedperformanceMATLAB Distributed Computing Server,Spark HadoopDevelop your code locally using Tall Arrays orMapReduce only onceUse the same code to scale up tocluster32
Create a tall array for each datastoreozone33
Execution model makes operations more efficient on big datatt : tall array Deferred evaluation–Commands are not executed rightaway–Operations are added to a queueExecution triggers include:– gather function– summary function– Machine learning models– Plotting34
Execution model makes operations more efficient on big dataUnnecessary results are notcomputed35
Introduction to Tall Arrays Tall Arrays for Big Data Visualization and Preprocessing Machine Learning for Big Data Using Tall Arrays36
Explore Big Data with Tall ram2ksdensity37
Explore Big Data with Tall Visualizations38
Get a summary of the datatt – tall table39
Use data types to best represent the data40
Managing Big and Messy Time-stamped Data41
Use the results of explorations to help make decisions- Synchronize to dailydata- By location42
Synchronize all data to daily times43
Clean messy data using common preprocessing functions44
Use familiar MATLAB functions on tall arraysFunctions Supported with Tall Arrays45
You don’t need to leave MATLAB to monitor large jobs46
Save preprocessed data47
Introduction to Tall Arrays Tall Arrays for Big Data Visualization and Preprocessing Machine Learning for Big Data Using Tall Arrays48
Predict air qualityAir Quality IndexAir Quality LabelRegressionClassification49
How do you know which model to use? Try them all 50
Use apps for model exploration on a subset of dataAir Quality IndexAir Quality LabelRegression LearnerClassification Learner51
Validate and Compare Machine Learning Models52
Validate and Compare Machine Learning Models53
Validate and Compare Machine Learning Models54
Validate and Compare Machine Learning Models55
Scale up with tall machine learning models Linear Regression (fitlm) Logistic & Generalized Linear Regression (fitglm) Discriminant Analysis Classification (fitcdiscr) K-means Clustering (kmeans) Principal Component Analysis (pca) Partition for Cross Validation (cvpartition) Linear Support Vector Machine (SVM) Classification (fitclinear)Naïve Bayes Classification (fitcnb)Random Forest Ensemble Classification (TreeBagger)Lasso Linear Regression (lasso) Linear Support Vector Machine (SVM) Regression (fitrlinear) Single Classification Decision Tree (fitctree) Linear SVM Classification with Random Kernel Expansion (fitckernel) Gaussian Kernel Regression (fitrkernel)56
Training Machine Learning Model against Spark for Air QualityClassification57
Train and validate with tall data for Air Quality Index Prediction58
Select the most important features59
Introduction to Tall Arrays Tall Arrays for Big Data Visualization and Preprocessing Machine Learning for Big Data Using Tall Arrays61
Building machine learning models with big dataAccessPreprocess,Exploration &Model DevelopmentScale up & Integrate withProduction Systems62
63
Predict air quality for given locationCurrent WeatherMy mlwww.myweather.com/stats.htmlYour Weather ConditionsGet weather conditions for your nd:0176032F76%SSW 13 mphUse MATLAB model running on Spark in Python webframework64
Integrate analytics with systemsEmbedded HardwareC, C HDLPLCEnterprise parkC/C Java Python.NETMATLABProductionServerMATLABRuntime65
Package and test MATLAB code66
67
Package and test MATLAB code68
Call MATLAB in production environmentAirQual.ctf69
MATLAB Production Server Server software– Manages packaged MATLAB programs and worker poolEnterpriseApplication MATLAB Runtime libraries– Single server can use runtimesMATLAB Production ServerMPS ClientLibraryfrom different releases RESTful JSON interface Lightweight client librariesApplications/DatabaseServersRequest Broker&ProgramManagerRESTfulJSONMATLABRuntime– C/C , .NET, Python, and Java70
MATLAB for Modeling and Deploying Big Data ApplicationsAccessPreprocess,Exploration & ModelDevelopmentScale up & Integratewith Production Systems Distributed Data Storage Preprocessing and Visualizing Big Data Different Data Sources &Types Parallelizing Jobs and Scaling upComputations to ClusterEasily Access Datahowever/wherever it is storedusing DatastorePrototype and easily scale upalgorithms to Big Data platformsusing the familiar MATLAB Syntaxwith Tall Arrays Enterprise leveldeploymentSeamless integration withEnterprise level systemsusing MATLAB ProductionServer71
How do you get started? Try Tall Array Based Processing on Your Own Set of Big Data Refer to the example mentioned below to get mlOther ne-learningeBook72
MathWorks Training g/73
Speaker DetailsContact MathWorks IndiaEmail: Alka.Nair@mathworks.inProducts/Training Enquiry BoothLinkedIn: https://www.linkedin.com/in/alka-nair-Call: 080-6632-60001820501a/Email: info@mathworks.in Share your experience with MATLAB & Simulink on Social Media Use #MATLABEXPO Share your session feedback:Please fill in your feedback for this session in the feedback form74
Building machine learning models with big data. Access Model Development . Use MATLAB model running on Spark in Python web framework . 65. Integrate analytics with systems. MATLAB Runtime. C/C Excel Add-in Java Hadoop/ Spark.NET MATLAB Production Server Standalone Application. Enterprise Systems . Python C, C HDL PLC. Embedded Hardware .