Tackling Big Data Using MATLAB - MathWorks

Transcription

Tackling Big Data Using MATLABAlka NairApplication Engineer 2015 The MathWorks, Inc.1

Building Machine Learning Models with Big DataAccessPreprocess,Exploration &Model DevelopmentScale up & Integrate withProduction Systems2

Case study: Predict Air QualityFactorsAffectingAir QualityMy WeatherPage TemperaturePressureRelative HumidityDew PointWind speedWind 3

4

Building Machine Learning Models with Big DataAccessPreprocess, Exploration& Model DevelopmentScale up & Integrate withProduction Systems5

Challenges in Modeling and Deploying Big Data ApplicationsAccessPreprocess,Exploration & ModelDevelopmentScale up & Integratewith Production Systems Distributed Data Storage Preprocessing and Visualizing Big Data Different Data Sources &Types Parallelizing Jobs and Scaling upComputations to ClusterManaging Different APIs for DataSources and Data Formats Rewriting Algorithms to Use BigData Platforms Parallelizing Code to Scale up toUse Cluster and Cloud Compute Enterprise leveldeploymentOverhead in Moving theAlgorithm to Production6

Wouldn’t it be nice if you could: Easily access data however it is stored Prototype algorithms quickly using small data sets Scale up to big data sets running on large clusters Using the same intuitive MATLAB syntax you are used to7

Building machine learning models with big dataAccessPreprocess,Exploration &Model DevelopmentScale up & Integrate withProduction Systems8

Access and Manage Big DataDifferent Data TypesDifferent Data Sources Hadoop Distributed FileSystem (HDFS) Text Amazon S3 Images Windows Azure Blob Spreadsheet Custom File FormatsStorage Relational Database HDFS on Hortonworks orClouderaDatastoresDifferent Applications MapReduce Image Segmentation Image Classification Denoising Images Predictive Maintenance9

cessCluster ofMachinesMemoryCluster ofMachinesMemoryOne or more files10

Air Quality Data on Local Folder11

Accessing and Processing different types of dataImage CollectionMDFFilesTabularTextDatastoreText files containing column-oriented data, includingCSV filesImageDatastoreImage files, including formats that are supportedby imread such as JPEG and PNGSpreadsheetDatastoreSpreadsheet files with a supported Excel formatsuch as .xlsxMDFDatastoreDatastore for collection of MDF filesCustom DatastoreDatastore for custom or proprietary format12

You have 1 TB of data you’ve never seen before. How do youaccess this data?13

Historical files are on HDFS and real time data are availablethrough an API TemperaturePressureRelative HumidityDew PointWind SpeedWind DirectionOzoneCONO2SO214

Access air quality data using datastore15

Preview the data and adjust properties to best represent thedata of interest16

Access data from anywhere with minimal changesLocal disk17

Datastores enable big data workflowsDeep Learning18

Datastores enable big data workflowsPredictiveMaintenance19

Datastores enable big data workflowsFleetAnalytics20

Datastores: Access Big Data with Minimal ChangesDifferent Data TypesDifferent Data Sources Hadoop Distributed FileSystem (HDFS) Text Amazon S3 Images Windows Azure Blob Spreadsheet Custom File Formats Storage Relational Database HDFS on Hortonworks orCloudera Different Applications MapReduce Image Segmentation Image Classification Denoising Images Predictive Maintenance 21

Building machine learning models with big dataAccessPreprocess,Exploration &Model DevelopmentScale up & Integrate withProduction Systems22

You have 1TB of data you’ve never seen before. How do youvisualize and process the data?23

Use tall arrays to work with the data like any MATLAB array24

Introduction to Tall Arrays Tall Arrays for Big Data Visualization and Preprocessing Machine Learning for Big Data Using Tall Arrays25

Tall arrays SingleMachineMemoryData is in one or more filesFiles stacked verticallyTypically tabular dataChallenge Data doesn’t fit into memory(even cluster memory) Takes a lot of time for even simpleoperations on dataCluster ofMachinesMemory26

Tall arrays (new R2016b) tall arrayProcessSingleMachineMemoryCreate tall table from datastoreds datastore('*.csv')tt tall(ds) SingleMachineMemoryOperate on whole tall tablejust like ordinary tableDatastoreCluster ofMachinesMemorysummary(tt)max(tt.EndTime – tt.StartTime)27

tall achineMemoryProcessSingleMachineMemoryWith Parallel Computing Toolbox,process several “chunks” at once Can scale up to clusters withMATLAB Distributed Computing Server Cluster ofMachinesMemory28

Use a Spark-enabled Hadoop cluster and MATLABSupport for many other platforms through reference architectures29

It’s easy to run MATLAB code on Spark HadoopSpark ConnectionCluster Config for SparkHadoop Access30

MATLAB Documentation for31

Summary for tall arraysLocal disk,Shared folders,DatabasesRun on Compute Clustersor Spark Hadoop (HDFS),for large scale analysisProcess out-of-memory data onyour Desktop to explore,analyze, gain insights and todevelop analyticsUse Parallel ComputingToolbox for increasedperformanceMATLAB Distributed Computing Server,Spark HadoopDevelop your code locally using Tall Arrays orMapReduce only onceUse the same code to scale up tocluster32

Create a tall array for each datastoreozone33

Execution model makes operations more efficient on big datatt : tall array Deferred evaluation–Commands are not executed rightaway–Operations are added to a queueExecution triggers include:– gather function– summary function– Machine learning models– Plotting34

Execution model makes operations more efficient on big dataUnnecessary results are notcomputed35

Introduction to Tall Arrays Tall Arrays for Big Data Visualization and Preprocessing Machine Learning for Big Data Using Tall Arrays36

Explore Big Data with Tall ram2ksdensity37

Explore Big Data with Tall Visualizations38

Get a summary of the datatt – tall table39

Use data types to best represent the data40

Managing Big and Messy Time-stamped Data41

Use the results of explorations to help make decisions- Synchronize to dailydata- By location42

Synchronize all data to daily times43

Clean messy data using common preprocessing functions44

Use familiar MATLAB functions on tall arraysFunctions Supported with Tall Arrays45

You don’t need to leave MATLAB to monitor large jobs46

Save preprocessed data47

Introduction to Tall Arrays Tall Arrays for Big Data Visualization and Preprocessing Machine Learning for Big Data Using Tall Arrays48

Predict air qualityAir Quality IndexAir Quality LabelRegressionClassification49

How do you know which model to use? Try them all 50

Use apps for model exploration on a subset of dataAir Quality IndexAir Quality LabelRegression LearnerClassification Learner51

Validate and Compare Machine Learning Models52

Validate and Compare Machine Learning Models53

Validate and Compare Machine Learning Models54

Validate and Compare Machine Learning Models55

Scale up with tall machine learning models Linear Regression (fitlm) Logistic & Generalized Linear Regression (fitglm) Discriminant Analysis Classification (fitcdiscr) K-means Clustering (kmeans) Principal Component Analysis (pca) Partition for Cross Validation (cvpartition) Linear Support Vector Machine (SVM) Classification (fitclinear)Naïve Bayes Classification (fitcnb)Random Forest Ensemble Classification (TreeBagger)Lasso Linear Regression (lasso) Linear Support Vector Machine (SVM) Regression (fitrlinear) Single Classification Decision Tree (fitctree) Linear SVM Classification with Random Kernel Expansion (fitckernel) Gaussian Kernel Regression (fitrkernel)56

Training Machine Learning Model against Spark for Air QualityClassification57

Train and validate with tall data for Air Quality Index Prediction58

Select the most important features59

Introduction to Tall Arrays Tall Arrays for Big Data Visualization and Preprocessing Machine Learning for Big Data Using Tall Arrays61

Building machine learning models with big dataAccessPreprocess,Exploration &Model DevelopmentScale up & Integrate withProduction Systems62

63

Predict air quality for given locationCurrent WeatherMy mlwww.myweather.com/stats.htmlYour Weather ConditionsGet weather conditions for your nd:0176032F76%SSW 13 mphUse MATLAB model running on Spark in Python webframework64

Integrate analytics with systemsEmbedded HardwareC, C HDLPLCEnterprise parkC/C Java Python.NETMATLABProductionServerMATLABRuntime65

Package and test MATLAB code66

67

Package and test MATLAB code68

Call MATLAB in production environmentAirQual.ctf69

MATLAB Production Server Server software– Manages packaged MATLAB programs and worker poolEnterpriseApplication MATLAB Runtime libraries– Single server can use runtimesMATLAB Production ServerMPS ClientLibraryfrom different releases RESTful JSON interface Lightweight client librariesApplications/DatabaseServersRequest Broker&ProgramManagerRESTfulJSONMATLABRuntime– C/C , .NET, Python, and Java70

MATLAB for Modeling and Deploying Big Data ApplicationsAccessPreprocess,Exploration & ModelDevelopmentScale up & Integratewith Production Systems Distributed Data Storage Preprocessing and Visualizing Big Data Different Data Sources &Types Parallelizing Jobs and Scaling upComputations to ClusterEasily Access Datahowever/wherever it is storedusing DatastorePrototype and easily scale upalgorithms to Big Data platformsusing the familiar MATLAB Syntaxwith Tall Arrays Enterprise leveldeploymentSeamless integration withEnterprise level systemsusing MATLAB ProductionServer71

How do you get started? Try Tall Array Based Processing on Your Own Set of Big Data Refer to the example mentioned below to get mlOther ne-learningeBook72

MathWorks Training g/73

Speaker DetailsContact MathWorks IndiaEmail: Alka.Nair@mathworks.inProducts/Training Enquiry BoothLinkedIn: https://www.linkedin.com/in/alka-nair-Call: 080-6632-60001820501a/Email: info@mathworks.in Share your experience with MATLAB & Simulink on Social Media Use #MATLABEXPO Share your session feedback:Please fill in your feedback for this session in the feedback form74

Building machine learning models with big data. Access Model Development . Use MATLAB model running on Spark in Python web framework . 65. Integrate analytics with systems. MATLAB Runtime. C/C Excel Add-in Java Hadoop/ Spark.NET MATLAB Production Server Standalone Application. Enterprise Systems . Python C, C HDL PLC. Embedded Hardware .