Transcription
Big Data and Machine LearningUsing MATLABSeth DeLand & Amit DoshiMathWorks 2015 The MathWorks, Inc.1
Data AnalyticsTurn large volumes of complex data into actionable informationsource: Gartner2
Customer Example: Gas Natural FenosaEnergy Production OptimizationUser StoryOpportunity Allocate demand among power plants to minimizegeneration costsAnalytics Use Data: Central database for historical power consumptionand price data, weather forecasts, and parameters for eachpower plant Machine Learning: Develop price simulation scenarios Optimization: minimize production costBenefit Reduced generation costs White-box solution for optimizing power generation3
Unit CommitmentPredictive and Prescriptive AnalyticsPrescriptive AnalyticsHistoricalWeather DataPredictive AnalyticsUnitCommitmentScheduleLoad ForecastHistoricalLoad DataGeneratorParameters4
Big Data Analytics WorkflowAccess and ExploreDataPreprocess DataDevelop PredictiveModelsIntegrate Analytics withSystemsFilesWorking withMessy DataModel Creation e.g.Machine LearningDesktop AppsDatabasesData prise nEmbedded Devicesand Hardware5
Example: Working with Big Data in MATLAB Objective: Create a model to predict the cost of a taxi ride in New York City Inputs:– Monthly taxi ride log files– The local data set is small ( 20 MB)– The full data set is big ( 21 GB) Approach:– Access Data– Preprocess and explore data– Develop and validate predictive model (linear fit) Work with subset of data for prototyping and then run on spark enabled hadoop with full data– Integrate analytics into a webapp6
Example: Working with Big Data in MATLAB7
Demo: Taxi Fare Predictor Web App8
Big Data Analytics Workflow: Data Access and Pre-processAccess and ExploreDataPreprocess DataDevelop PredictiveModelsIntegrate Analytics withSystemsFilesWorking withMessy DataModel Creation e.g.Machine LearningDesktop AppsDatabasesData prise nEmbedded Devicesand Hardware9
Data Access and Pre-processing – ChallengesChallenges Data aggregation––Different sources (files, web, etc.)Different types (images, text, audio, etc.) Data clean up–––Poorly formatted filesIrregularly sampled dataRedundant data, outliers, missing data etc. Data specific processing––Signals: Smoothing, resampling, denoising,Wavelet transforms, etc.Images: Image registration, morphologicalfiltering, deblurring, etc. Dealing with out of memory data (big data)Data preparation accounts for about 80% of the work of datascientists - Forbes10
Data Analytics Workflow: Big Data Access and Pre-processing11
Next: Access Big Data from MATLAB datastore– Tabular text files– Images– Excel spreadsheets– (SQL) Databases– HDFS (Hadoop)– S3 - Amazon12
Get data in MATLAB13
What if the data is saved in HDFS?14
Or Data is stored in a Database15
Data Access: SummaryBusiness and Transactional Data Repositories – SQL, NoSQL, etc. File I/O – Text, Spreadsheet, etc. Web Sources – RESTful, JSON, etc.Engineering, Scientific and FieldData Real-Time Sources – Sensors,GPS, etc.Servers and DatabasesHardware File I/O – Image, Audio, etc. Communication Protocols – OPC(OLE for Process Control), CAN(Controller Area Network), etc.CJavaFortranPythonSoftware16
Process data which doesn't fit into memoryAccess and ExploreDataPreprocess DataDevelop PredictiveModelsIntegrate Analytics withSystemsFilesWorking withMessy DataModel Creation e.g.Machine LearningDesktop AppsDatabasesData prise nEmbedded Devicesand Hardware17
Pre-processing Big Datatall arrays in New data type designed for data that doesn’t fit into memory Lots of observations (hence “tall”) Looks like a normal MATLAB array– Supports numeric types, tables, datetimes, strings, etc – Supports several hundred functions for basic math, stats, indexing, etc.– Statistics and Machine Learning Toolbox support(clustering, classification, etc.)18
tall arrays Automatically breaks data up intosmall “chunks” that fit in memory Tall arrays scan through thedataset one “chunk” at a time Processing code for tall arrays isthe same as ordinary arraysSingleMachineMemorytall arrayProcessSingleMachineMemory19
tall arrays SingleMachineMemorytall ineMemoryWith Parallel Computing Toolbox,process several “chunks” at onceCan scale up to clusters withMATLAB Distributed ComputingServerCluster ofMachinesMemory20
Demo: Working with Tall Arrays21
Data Access and pre-processing – challenges and solutionMATLAB makes it easy towork with business andengineering dataChallenges Data aggregation––1Different sources (files, web, etc.)Different types (images, text, audio, etc.) Data clean up–––Poorly formatted filesIrregularly sampled dataRedundant data, outliers, missing data etc. Data specific processing––Signals: Smoothing, resampling, denoising,Wavelet transforms, etc.Images: Image registration, morphologicalfiltering, deblurring, etc. Dealing with out of memory data (big data)DatabasesFilesSignals Point and click tools to accessvariety of data sources High-performance environmentfor big dataImages Built-in algorithms for datapreprocessing including sensor,image, audio, video and otherreal-time data22
Data Analytics Workflow: Develop Predictive Models using Big DataAccess and ExploreDataPreprocess DataDevelop PredictiveModelsIntegrate Analytics withSystemsFilesWorking withMessy DataModel Creation e.g.Machine LearningDesktop AppsDatabasesData prise nEmbedded Devicesand Hardware23
Machine LearningMachine learning uses data and produces a program to perform a taskTask: Human Activity DetectionMachine Learning ApproachStandard ApproachMachineLearningComputerProgramHand Written ProgramIf X acc 0.5then “SITTING”If Y acc 4 and Z acc 5then “STANDING” Formula or Equation𝑌𝑎𝑐𝑡𝑖𝑣𝑖𝑡𝑦 𝛽1 𝑋𝑎𝑐𝑐 𝛽2 𝑌𝑎𝑐𝑐 𝛽3 𝑍𝑎𝑐𝑐 𝑚𝑜𝑑𝑒𝑙: Inputs ���𝑚𝑜𝑑𝑒𝑙 (𝑠𝑒𝑛𝑠𝑜𝑟 𝑑𝑎𝑡𝑎, 𝑎𝑐𝑡𝑖𝑣𝑖𝑡𝑦)24
Consider Machine/Deep Learning WhenBecause algorithms canProblem is too complex for hand written rules or equationslearn complex nonlinear relationshipsSpeech RecognitionObject RecognitionEngine Health MonitoringProgram needs to adapt with changing dataupdate as more databecomes availableWeather ForecastingEnergy Load ForecastingStock Market PredictionProgram needs to scalelearn efficiently fromvery large data setsIoT AnalyticsTaxi AvailabilityAirline Flight Delays25
Different Types of LearningType of LearningCategories of AlgorithmsClassification Output is a choice between classes(True, False) (Red, Blue, Green)SupervisedLearningMachineLearningDevelop predictivemodel based on bothinput and output dataUnsupervisedLearningRegressionClustering Output is a real number(temperature, stock prices) No output - find natural groups andpatterns from input data onlyDiscover an internalrepresentation frominput data only26
Different Types of LearningType of LearningCategories of ssionLinearRegressionGLMDiscriminantAnalysisNaive ingDevelop predictivemodel based on bothinput and output dataUnsupervisedLearningDiscover an internalrepresentation frominput data onlyClusteringSVR,GPRkMeans, kmedoidsFuzzy cisionTreesNeuralNetworksGaussianMixtureHidden MarkovModel27
Machine Learning with Big Data Descriptive statistics (skewness,tabulate, crosstab, cov, grpstats, ) Linear classification methods for SVMand logistic regression (fitclinear) K-means clustering (kmeans) Random forest ensembles ofclassification trees (TreeBagger) Visualization (ksdensity, binScatterPlot;histogram, histogram2) Dimensionality reduction (pca, pcacov,factoran) Naïve Bayes classification (fitcnb) Regularized regression (lasso) Prediction applied to tall arrays Linear and generalized linear regression(fitlm, fitglm) Discriminant analysis (fitcdiscr)28
Demo: Training a Machine Learning Model29
Demo: Training a Machine Learning Model30
Regression Learner31
Regression LearnerApp to apply advanced regression methods to your data Added to Statistics and Machine LearningToolbox in R2017a Point and click interface – no codingrequired Quickly evaluate, compare and selectregression models Export and share MATLAB code ortrained models32
Classification LearnerApp to apply advanced classification methods to your data Added to Statistics and Machine LearningToolbox in R2015a Point and click interface – no codingrequired Quickly evaluate, compare and selectclassification models Export and share MATLAB code ortrained models33
and Many More MATLAB Apps for Data AnalyticsDistribution FittingSystem IdentificationSignal AnalysisWavelet Design and AnalysisNeural Net FittingNeural Net Pattern RecognitionTraining Image Labelerand many more 34
Tuning Machine Learning ModelsGet more accurate models in less timeAutomatically select bestmachine leaning “features”Automatically fine-tunemachine learning parametersSelect best “features”to keep in model fromover 400 candidatesNCA: Neighborhood Component AnalysisHyperparameter Tuning35
Machine Learning HyperparametersHyperparametersTune a typical set ofhyperparameters for this modelTune allhyperparameters for this model36
Bayesian Optimization in Action37
Big Data Analytics Workflow: DevelopingPredictive modelsChallenges Lack of data science expertise Feature Extraction – How to transformdata to best represent the system?–– LanguageRequires subject matter expertiseNo right way of designing featuresEntails a lot of iteration – Trial and errorDifficult to evaluate features Easy to use appsModel Development–– Apps2Feature Selection – What attributes orsubset of data to use?–– MATLAB enablesdomain experts todo Data ScienceMany different modelsModel Validation and TuningTime required to conduct the analysis Wide breadth of tools to facilitatedomain specific analysis Examples/videos to get started Automatic MATLAB codegeneration High speed processing of largedata sets38
Back to our example: Working with Big Data in MATLAB Objective: Create a model to predict the cost of a taxi ride in New York City Inputs:– Monthly taxi ride log files– The local data set is small ( 20 MB)– The full data set is big ( 25 GB) Approach:– Acecss Data– Preprocess and explore data– Develop and validate predictive model (linear fit) Work with subset of data for prototypingScale to full data set on a cluster39
Data Analytics Workflow: Develop Predictive Models using Big DataAccess and ExploreDataPreprocess DataDevelop PredictiveModelsIntegrate Analytics withSystemsFilesWorking withMessy DataModel Creation e.g.Machine LearningDesktop AppsDatabasesData prise nEmbedded Devicesand Hardware40
Demo: Taxi Fare Predictor Web App41
MATLAB Production Server Server software– Manages packaged MATLABprograms and worker poolEnterpriseApplication MATLAB Runtime libraries– Single server can use runtimesfrom different releases RESTful JSON interfaceMATLAB Production ServerMPS ClientLibraryApplications/DatabaseServersRequest Broker&ProgramManagerRESTfulJSONMATLABRuntime Lightweight client libraries– C/C , .NET, Python, and Java42
Integrate analytics with systemsMATLAB Analyticsrun anywhereEnterprise SystemsEmbedded HardwareC, C kC/C Java Python.NETMATLABProductionServerMATLABRuntime43
Product Support for SparkIntegrate with applications: From MATLAB desktop: Web & MobileApplicationsAccess data from HDFSRun “tall” functions onSpark/Hadoop using MDCSDeploy MATLAB programs using “tall”Develop deployable applications forSpark using MATLAB API for SparkEnterpriseApplicationsDEVELOPMENT TOOLSMATLABCompilerMATLAB DistributedComputing ServerSparkMATLABRuntimeYARN44
Deployment Offerings Program usingMATLAB API for SparkProgram using tallDeploy “tall” programs– Create Standalone Applications: MATLAB CompilerMATLABCompiler MATLAB API for Spark– Create Standalone Applications: MATLAB CompilerEdge NodeSince the Standalonemust run on a LinuxEdge Node, you mustcompile on LinuxStandaloneApplication– Functionality beyond tall arrays– For advanced programmers familiar with SparkSpark– Local install of Spark to run code in MATLAB MATLABRuntimeYARN : Data Operating SystemInstalled on same machine as MATLAB – single node, Linux45
Data Analytics WorkflowAccess and ExploreDataPreprocess DataDevelop PredictiveModelsIntegrate Analyticswith SystemsFilesWorking withMessy DataModel Creation e.g.Machine LearningDesktop AppsDatabasesData prise nEmbedded Devicesand HardwareMATLAB Analytics workwith business andengineering data1MATLAB enablesdomain experts to doData Science2MATLAB Analyticsrun anywhere346
Resources to learn and get /big-dataeBook47
MathWorks Services Consulting– Integration– Data analysis/visualization– Unify workflows, models, datawww.mathworks.com/services/consulting/ Training– Classroom, online, on-site– Data Processing, Visualization, Deployment, Parallel Computingwww.mathworks.com/services/training/48
MathWorks Training g/49
Speaker DetailsEmail:Contact MathWorks IndiaProducts/Training Enquiry Boothseth.deland@mathworks.comCall: : oshihttps://www.linkedin.com/in/seth-delandYour feedback is valued.Please complete the feedback form provided to you.50
-For advanced programmers familiar with Spark -Local install of Spark to run code in MATLAB Installed on same machine as MATLAB -single node, Linux Standalone Application Edge Node MATLAB Runtime MATLAB Compiler Program using tall Program using MATLAB API for Spark Since the Standalone must run on a Linux Edge Node, you must compile on Linux