Big Data And Machine Learning Using MATLAB - MathWorks

Transcription

Big Data and Machine LearningUsing MATLABSeth DeLand & Amit DoshiMathWorks 2015 The MathWorks, Inc.1

Data AnalyticsTurn large volumes of complex data into actionable informationsource: Gartner2

Customer Example: Gas Natural FenosaEnergy Production OptimizationUser StoryOpportunity Allocate demand among power plants to minimizegeneration costsAnalytics Use Data: Central database for historical power consumptionand price data, weather forecasts, and parameters for eachpower plant Machine Learning: Develop price simulation scenarios Optimization: minimize production costBenefit Reduced generation costs White-box solution for optimizing power generation3

Unit CommitmentPredictive and Prescriptive AnalyticsPrescriptive AnalyticsHistoricalWeather DataPredictive AnalyticsUnitCommitmentScheduleLoad ForecastHistoricalLoad DataGeneratorParameters4

Big Data Analytics WorkflowAccess and ExploreDataPreprocess DataDevelop PredictiveModelsIntegrate Analytics withSystemsFilesWorking withMessy DataModel Creation e.g.Machine LearningDesktop AppsDatabasesData prise nEmbedded Devicesand Hardware5

Example: Working with Big Data in MATLAB Objective: Create a model to predict the cost of a taxi ride in New York City Inputs:– Monthly taxi ride log files– The local data set is small ( 20 MB)– The full data set is big ( 21 GB) Approach:– Access Data– Preprocess and explore data– Develop and validate predictive model (linear fit) Work with subset of data for prototyping and then run on spark enabled hadoop with full data– Integrate analytics into a webapp6

Example: Working with Big Data in MATLAB7

Demo: Taxi Fare Predictor Web App8

Big Data Analytics Workflow: Data Access and Pre-processAccess and ExploreDataPreprocess DataDevelop PredictiveModelsIntegrate Analytics withSystemsFilesWorking withMessy DataModel Creation e.g.Machine LearningDesktop AppsDatabasesData prise nEmbedded Devicesand Hardware9

Data Access and Pre-processing – ChallengesChallenges Data aggregation––Different sources (files, web, etc.)Different types (images, text, audio, etc.) Data clean up–––Poorly formatted filesIrregularly sampled dataRedundant data, outliers, missing data etc. Data specific processing––Signals: Smoothing, resampling, denoising,Wavelet transforms, etc.Images: Image registration, morphologicalfiltering, deblurring, etc. Dealing with out of memory data (big data)Data preparation accounts for about 80% of the work of datascientists - Forbes10

Data Analytics Workflow: Big Data Access and Pre-processing11

Next: Access Big Data from MATLAB datastore– Tabular text files– Images– Excel spreadsheets– (SQL) Databases– HDFS (Hadoop)– S3 - Amazon12

Get data in MATLAB13

What if the data is saved in HDFS?14

Or Data is stored in a Database15

Data Access: SummaryBusiness and Transactional Data Repositories – SQL, NoSQL, etc. File I/O – Text, Spreadsheet, etc. Web Sources – RESTful, JSON, etc.Engineering, Scientific and FieldData Real-Time Sources – Sensors,GPS, etc.Servers and DatabasesHardware File I/O – Image, Audio, etc. Communication Protocols – OPC(OLE for Process Control), CAN(Controller Area Network), etc.CJavaFortranPythonSoftware16

Process data which doesn't fit into memoryAccess and ExploreDataPreprocess DataDevelop PredictiveModelsIntegrate Analytics withSystemsFilesWorking withMessy DataModel Creation e.g.Machine LearningDesktop AppsDatabasesData prise nEmbedded Devicesand Hardware17

Pre-processing Big Datatall arrays in New data type designed for data that doesn’t fit into memory Lots of observations (hence “tall”) Looks like a normal MATLAB array– Supports numeric types, tables, datetimes, strings, etc – Supports several hundred functions for basic math, stats, indexing, etc.– Statistics and Machine Learning Toolbox support(clustering, classification, etc.)18

tall arrays Automatically breaks data up intosmall “chunks” that fit in memory Tall arrays scan through thedataset one “chunk” at a time Processing code for tall arrays isthe same as ordinary arraysSingleMachineMemorytall arrayProcessSingleMachineMemory19

tall arrays SingleMachineMemorytall ineMemoryWith Parallel Computing Toolbox,process several “chunks” at onceCan scale up to clusters withMATLAB Distributed ComputingServerCluster ofMachinesMemory20

Demo: Working with Tall Arrays21

Data Access and pre-processing – challenges and solutionMATLAB makes it easy towork with business andengineering dataChallenges Data aggregation––1Different sources (files, web, etc.)Different types (images, text, audio, etc.) Data clean up–––Poorly formatted filesIrregularly sampled dataRedundant data, outliers, missing data etc. Data specific processing––Signals: Smoothing, resampling, denoising,Wavelet transforms, etc.Images: Image registration, morphologicalfiltering, deblurring, etc. Dealing with out of memory data (big data)DatabasesFilesSignals Point and click tools to accessvariety of data sources High-performance environmentfor big dataImages Built-in algorithms for datapreprocessing including sensor,image, audio, video and otherreal-time data22

Data Analytics Workflow: Develop Predictive Models using Big DataAccess and ExploreDataPreprocess DataDevelop PredictiveModelsIntegrate Analytics withSystemsFilesWorking withMessy DataModel Creation e.g.Machine LearningDesktop AppsDatabasesData prise nEmbedded Devicesand Hardware23

Machine LearningMachine learning uses data and produces a program to perform a taskTask: Human Activity DetectionMachine Learning ApproachStandard ApproachMachineLearningComputerProgramHand Written ProgramIf X acc 0.5then “SITTING”If Y acc 4 and Z acc 5then “STANDING” Formula or Equation𝑌𝑎𝑐𝑡𝑖𝑣𝑖𝑡𝑦 𝛽1 𝑋𝑎𝑐𝑐 𝛽2 𝑌𝑎𝑐𝑐 𝛽3 𝑍𝑎𝑐𝑐 𝑚𝑜𝑑𝑒𝑙: Inputs ���𝑚𝑜𝑑𝑒𝑙 (𝑠𝑒𝑛𝑠𝑜𝑟 𝑑𝑎𝑡𝑎, 𝑎𝑐𝑡𝑖𝑣𝑖𝑡𝑦)24

Consider Machine/Deep Learning WhenBecause algorithms canProblem is too complex for hand written rules or equationslearn complex nonlinear relationshipsSpeech RecognitionObject RecognitionEngine Health MonitoringProgram needs to adapt with changing dataupdate as more databecomes availableWeather ForecastingEnergy Load ForecastingStock Market PredictionProgram needs to scalelearn efficiently fromvery large data setsIoT AnalyticsTaxi AvailabilityAirline Flight Delays25

Different Types of LearningType of LearningCategories of AlgorithmsClassification Output is a choice between classes(True, False) (Red, Blue, Green)SupervisedLearningMachineLearningDevelop predictivemodel based on bothinput and output dataUnsupervisedLearningRegressionClustering Output is a real number(temperature, stock prices) No output - find natural groups andpatterns from input data onlyDiscover an internalrepresentation frominput data only26

Different Types of LearningType of LearningCategories of ssionLinearRegressionGLMDiscriminantAnalysisNaive ingDevelop predictivemodel based on bothinput and output dataUnsupervisedLearningDiscover an internalrepresentation frominput data onlyClusteringSVR,GPRkMeans, kmedoidsFuzzy cisionTreesNeuralNetworksGaussianMixtureHidden MarkovModel27

Machine Learning with Big Data Descriptive statistics (skewness,tabulate, crosstab, cov, grpstats, ) Linear classification methods for SVMand logistic regression (fitclinear) K-means clustering (kmeans) Random forest ensembles ofclassification trees (TreeBagger) Visualization (ksdensity, binScatterPlot;histogram, histogram2) Dimensionality reduction (pca, pcacov,factoran) Naïve Bayes classification (fitcnb) Regularized regression (lasso) Prediction applied to tall arrays Linear and generalized linear regression(fitlm, fitglm) Discriminant analysis (fitcdiscr)28

Demo: Training a Machine Learning Model29

Demo: Training a Machine Learning Model30

Regression Learner31

Regression LearnerApp to apply advanced regression methods to your data Added to Statistics and Machine LearningToolbox in R2017a Point and click interface – no codingrequired Quickly evaluate, compare and selectregression models Export and share MATLAB code ortrained models32

Classification LearnerApp to apply advanced classification methods to your data Added to Statistics and Machine LearningToolbox in R2015a Point and click interface – no codingrequired Quickly evaluate, compare and selectclassification models Export and share MATLAB code ortrained models33

and Many More MATLAB Apps for Data AnalyticsDistribution FittingSystem IdentificationSignal AnalysisWavelet Design and AnalysisNeural Net FittingNeural Net Pattern RecognitionTraining Image Labelerand many more 34

Tuning Machine Learning ModelsGet more accurate models in less timeAutomatically select bestmachine leaning “features”Automatically fine-tunemachine learning parametersSelect best “features”to keep in model fromover 400 candidatesNCA: Neighborhood Component AnalysisHyperparameter Tuning35

Machine Learning HyperparametersHyperparametersTune a typical set ofhyperparameters for this modelTune allhyperparameters for this model36

Bayesian Optimization in Action37

Big Data Analytics Workflow: DevelopingPredictive modelsChallenges Lack of data science expertise Feature Extraction – How to transformdata to best represent the system?–– LanguageRequires subject matter expertiseNo right way of designing featuresEntails a lot of iteration – Trial and errorDifficult to evaluate features Easy to use appsModel Development–– Apps2Feature Selection – What attributes orsubset of data to use?–– MATLAB enablesdomain experts todo Data ScienceMany different modelsModel Validation and TuningTime required to conduct the analysis Wide breadth of tools to facilitatedomain specific analysis Examples/videos to get started Automatic MATLAB codegeneration High speed processing of largedata sets38

Back to our example: Working with Big Data in MATLAB Objective: Create a model to predict the cost of a taxi ride in New York City Inputs:– Monthly taxi ride log files– The local data set is small ( 20 MB)– The full data set is big ( 25 GB) Approach:– Acecss Data– Preprocess and explore data– Develop and validate predictive model (linear fit) Work with subset of data for prototypingScale to full data set on a cluster39

Data Analytics Workflow: Develop Predictive Models using Big DataAccess and ExploreDataPreprocess DataDevelop PredictiveModelsIntegrate Analytics withSystemsFilesWorking withMessy DataModel Creation e.g.Machine LearningDesktop AppsDatabasesData prise nEmbedded Devicesand Hardware40

Demo: Taxi Fare Predictor Web App41

MATLAB Production Server Server software– Manages packaged MATLABprograms and worker poolEnterpriseApplication MATLAB Runtime libraries– Single server can use runtimesfrom different releases RESTful JSON interfaceMATLAB Production ServerMPS ClientLibraryApplications/DatabaseServersRequest Broker&ProgramManagerRESTfulJSONMATLABRuntime Lightweight client libraries– C/C , .NET, Python, and Java42

Integrate analytics with systemsMATLAB Analyticsrun anywhereEnterprise SystemsEmbedded HardwareC, C kC/C Java Python.NETMATLABProductionServerMATLABRuntime43

Product Support for SparkIntegrate with applications: From MATLAB desktop: Web & MobileApplicationsAccess data from HDFSRun “tall” functions onSpark/Hadoop using MDCSDeploy MATLAB programs using “tall”Develop deployable applications forSpark using MATLAB API for SparkEnterpriseApplicationsDEVELOPMENT TOOLSMATLABCompilerMATLAB DistributedComputing ServerSparkMATLABRuntimeYARN44

Deployment Offerings Program usingMATLAB API for SparkProgram using tallDeploy “tall” programs– Create Standalone Applications: MATLAB CompilerMATLABCompiler MATLAB API for Spark– Create Standalone Applications: MATLAB CompilerEdge NodeSince the Standalonemust run on a LinuxEdge Node, you mustcompile on LinuxStandaloneApplication– Functionality beyond tall arrays– For advanced programmers familiar with SparkSpark– Local install of Spark to run code in MATLAB MATLABRuntimeYARN : Data Operating SystemInstalled on same machine as MATLAB – single node, Linux45

Data Analytics WorkflowAccess and ExploreDataPreprocess DataDevelop PredictiveModelsIntegrate Analyticswith SystemsFilesWorking withMessy DataModel Creation e.g.Machine LearningDesktop AppsDatabasesData prise nEmbedded Devicesand HardwareMATLAB Analytics workwith business andengineering data1MATLAB enablesdomain experts to doData Science2MATLAB Analyticsrun anywhere346

Resources to learn and get /big-dataeBook47

MathWorks Services Consulting– Integration– Data analysis/visualization– Unify workflows, models, datawww.mathworks.com/services/consulting/ Training– Classroom, online, on-site– Data Processing, Visualization, Deployment, Parallel Computingwww.mathworks.com/services/training/48

MathWorks Training g/49

Speaker DetailsEmail:Contact MathWorks IndiaProducts/Training Enquiry Boothseth.deland@mathworks.comCall: : oshihttps://www.linkedin.com/in/seth-delandYour feedback is valued.Please complete the feedback form provided to you.50

-For advanced programmers familiar with Spark -Local install of Spark to run code in MATLAB Installed on same machine as MATLAB -single node, Linux Standalone Application Edge Node MATLAB Runtime MATLAB Compiler Program using tall Program using MATLAB API for Spark Since the Standalone must run on a Linux Edge Node, you must compile on Linux