Transcription
Operational Machine LearningUsing Microsoft Technologies for Applied Data ScienceKhalid M. Salama, Ph.D.Business Insights & AnalyticsHitachi Consulting UK
Outline Introduction to Data Science From Experimental Data Science to Operational Machine Learning MS Technologies for Data Science & Advanced Analytics Demos & Screenshots Concluding Remarks
Introduction to Data Science andMachine Learning
Data Science and Machine LearningWhat?“Data mining, an interdisciplinary subfield of computer science, is thecomputational process of automatic discovering interesting and usefulpatterns in large data sets”Other Related Technologies:ArtificialIntelligence Visualization Big DataStatisticsDatabases High Performance Computing Cloud Computing Others.MachineLearningDataScienceOtherTechnologies
Data Science and Machine LearningWhy?The objective of datascience is to provide youwith actionable insights tosupport decision making .ChurnanalysisSocial tracking andservicesVision AnalyticsWeatherforecasting forbusiness planningLegaldiscovery nganalysisPricing analysisFrauddetectionPersonalizedInsurance
Data Science and Machine LearningHow?Classification LearningTime Series AnalysisBuild a model that can predict the target classof an input caseAnalysis of temporal data to forecastfuture valuesRegression ModelingProbabilistic ModelingCompute the probability of an event to occurgiven a set of conditionsBuild a model that can estimate the responsevalue given an input caseCluster AnalysisSimilarity AnalysisDiscover natural groupings within thedata pointsIdentify similar cases to a given input casebased on the input featuresAssociation Rule DiscoveryExtract frequent patterns presentin the dataIF . AND . AND . THEN AELSE IF . AND . THEN CELSE IF . AND . THEN B.ELSE CCollaborative FilteringFiltering of information using techniquesinvolving collaboration viewpoints
From Experimental Data Science toOperational Machine Learning
Data Science ActivitiesExperimentation vs. OperationalizationExploratory Data AnalysisData Analysis & ExperimentationCollect Data InteractiveBlend Easy to performPrepareVisualize Rich VisualizationsLearningDatasetML ExperimentAlgorithm SelectionParameter TuningReport of Visuals &FindingsTraining & TestingModelDecision!
Data Science ActivitiesExperimentation vs. OperationalizationOperational ML PipelinesBatchAutomated ML Pipeline Pipelined (ETL Integration) ScalableData IngestionDataProcessingModel TrainingDeploy Apps IntegrationScoringModelWeb APIsExportTrainPredictOnline AppsReal-time
Microsoft Advanced AnalyticsTechnologies
Microsoft Advanced AnalyticsCortana Intelligence Suitehttps://gallery.cortanaintelligence.com/
Microsoft Advanced AnalyticsData Science, Machine Learning, & IntelligenceAzure MachineLearningMicrosoft R Server – SQLServer R ServicesData Mining – SQL ServerAnalysis ServicesSpark ML – AzureHDInsightCognitive Features – AzureData Lake AnalyticsAzure CognitiveServicesMicrosoft Bot Framework
Microsoft Azure Machine Learning
Azure Machine LearningMS Cloud-native Data Science Cloud-based Machine Learning Services Interactive Data Science StudioLimitations Rich built-in functionality Only Cloud-based (Data Regulations) Imports data from everywhere Scalability – Maximum dataset size 10GB Easy to develop and productionize – Web Services Microsoft R Open is not supported, yet Extensible via R and Python scripts No Source ControlRetrain ModelAzure MachineLearningInputImport DataPublishWeb ServicesBuild and deploymodels in the cloudBatch ScoringResult
Azure Machine LearningReal-time PredictionsAzure ML WebServiceSendInputSend Results(Input, Output)Consume messagesSend data pointsAppReceiveOutputEvent HubStream AnalyticsPower BI
Azure Machine LearningBuilt-in Features
Azure Machine LearningAlgorithms Cheat Sheet
Azure Machine LearningML Studio
Azure Machine LearningWeb Service
Azure Machine LearningStream Analytics Integration
Azure Machine LearningAzureML R Library
Microsoft R Server
Microsoft R ServerR in Microsoft WorldMicrosoft R Open (MRO) Based on latest Open Source R (3.2.2.) - Built, tested, and distributed by Microsoft More efficient and multi-threaded computation Enhanced by Intel Math Kernel Library (MKL) to speed up linear algebra functions Compatible with all R-related software
Microsoft R ServerComparisonCRANMROMRSData sizeIn-memoryIn-memoryIn-memory & diskEfficiencySingle threadedMulti-threadedMulti-threaded, parallelprocessing 1:N serversSupportCommunityCommunityCommunity CommercialFunctionality7500 innovative analyticpackages7500 innovative analyticpackages7500 innovative packages commercial parallel highspeed functionsLicenceOpen SourceOpen SourceCommercial license.
Microsoft R ServerComponents and Compute ContextsMS R ClientMicrosoft R ServerScale & Deploy Installed on Windows or Linux ScaleR - Optimized for parallel execution onBig Data, to eliminate memory limitations. ConnectR – Provides access to local filesystems, hdfs, hive, sqlserver, Teradata, etc. DistributeR - Adaptable parallel executionframework to enable running on different(distributed) compute contexts. Operationalization (msrdeploy) – Deploythe model as a Web API.CRAN & MS R perationalization(msrdeploy)Different Compute ContextsRStudio RTVS
Microsoft R ServerMicrosoft R Server – ScaleR ExampleCheck EnvironmentLoad XDFPrepare Data – Process XDFBuild Predictive ModelPerform Prediction
Microsoft R ServerMicrosoft R Server – ScaleR Functionality
SQL Server (in-database)R Services
SQL Server R ServicesIn-database Analytics R Services (in-database) – Keep your analytics close to the data T-SQL Script – Can be encapsulated in Stored Procedures Models are built, trained, saved as part of the ETL process (SSIS)Limitations Used for batch prediction (as part of the ETL process) Not supported in Azure SQL DB/DW, yet Visual Studio SQL Database Project, Source Controlled, etc. Not suitable for Interactive Data Science Uses Microsoft ScaleR libraries Only R, no python, yet.Data SourcesProcessDataTrain RModelSerializeStore ModelsMaintainModelsTraining PipelineEXECUTE sp execute external scriptETL Using SSISProcessDataLoad ModelPerformPredictionStore ResultsPrediction Pipeline
SQL Server R ServicesT-SQL ScriptConfigureBuild and Save ModelModel SummaryPredictionPrediction Output
Microsoft Analysis ServicesData Mining
SQL Server Analysis ServicesData Mining Process data from many OLEDB and ODBC data sources Easy to build, interpret, deploy, and productionizeLimitations SSIS Support – Tasks to Train & Predict Limited Extensibility Interactive Visuals for model interpretation Limited Algorithms & Functionalities Excel Integration – Data Mining Add-in No Azure PaaS ServiceAzure SQL DW/DBSQL ServerAnalysis ServicesBuild ModelOnline AppsDMX QueryResultBatch ScoringRetrain ModelExplore/Interpret Model
SQL Server Analysis ServicesOverviewData Source ViewMining Structure Decision Tress Naïve-Bayes Linear Regression Neural Networks Association Rules Clustering Sequence Clustering Time SeriesMining AlgorithmMining Model
SQL Server Analysis ServicesVisualizing Models
SQL Server Analysis ServicesExcel Data Mining Add-in
Azure Cognitive Services
Azure Cognitive ServicesReady-to-use Intelligence
Azure Cognitive ServicesSetup a Cognitive Services APIahttps://www.microsoft.com/cognitive-services/
Cognitive Features in Azure DataLake Analytics
Azure Data Lake AnalyticsCognitive Features Pre-built intelligence – Text & Image Analysis Integrated with your data processing pipelines (DLA) Used for batch recognition (not singleton real-time) Scheduled & Automated using Azure Data FactoryLimitations R & Python Extensions! Limited Features Scalable – Suitable for Big Data Not suitable for real-time scoringData Processing & PattenRecognitionSource Data(Text, Images, etc.)Enterprise Data WarehouseInputOutputPolybaseIngestData LakeAnalytics JobsAzure SQL DWData Lake StoreData Lake StoreAzure Data Factory
Azure Data Lake AnalyticsFirst-time Installation
Azure Data Lake AnalyticsU-SQL Script
Azure Data Lake AnalyticsExecution & Output
Spark ML on HDInsight
Spark ML on HDInsightScalable ML for Big Data Rich Spark ML Libraries Scalable, distributed, in-memory Extensible – Python, R, Java, Scala Suitable for Big Data - Batch Model Training and ScoringLimitations Spark Streaming for Real-time predictions Expensive to keep it up & running Scheduled & Automated Using Azure Data Factory Slow to spin-upEnterprise Data WarehouseIngestPolybaseLoadSaveSource DataHDInsightAzure Data Factory Process DataBuild ModelSave ModelLoad ModelPerform PredictionsSave ResultsAzure SQL DW
Spark ML on HDInsightSpark ML PipelinesSpark ML standardizes APIs for machine learning algorithms to make it easier to combinemultiple task into a single pipeline, or workflow. Transformers – used for data pre-processing. Input: DataFrame - Output:DataFrame Estimators – ML algorithm used to build a predictive model. Input: DataFrame - Output: Model. Parameters – Configurations for Transformers and Estimators Pipeline – Chains Transformers and EstimatorsML PipelineDataset(DataFrame)Transformer A(pre-processing) Transformer Z(pre-processing)ParametersEstimator(ML LearningAlgorithm)ModelEvaluation
Spark ML on HDInsightSpark ML FunctionalityEstimators (supervised)TransformersText Feature Extraction TF-IDF (HashingTF and IDF) Word2Vec CountVectorizer Tokenizer StopWordsRemover n-gramFeature Selection VectorSlicer RFormula ChiSqSelectorDimensionality Reduction PCAFeatures Vector Preparation VectorAssembler VectorIndexer StringIndexer IndexToStringFeature Type Conversion Binarizer Discrete Cosine Transform (DCT) OneHotEncoder Bucketizer QuantileDiscretizerFeature Scaling Normalizer StandardScaler MinMaxScalerFeature Construction SQLTransformer ElementwiseProduct PolynomialExpansionClassification Decision Trees – Ensembles Naïve-Bayes SVMRegression Linear Regression SVMOther (Unsupervised) ClusteringCollaborative FilteringFrequent Pattern Mining
Spark ML on HDInsightSpark ML - Example
Spark ML on HDInsightBigDL – Intel’s Distributed Deep Learning ning/
Concluding RemarksInteractive Data ScienceStudioExtensibility Azure ML Spark on HDI Azure ML Microsoft R ServerPre-built IntelligenceML Pipelining Azure Cognitive Services Azure Data LakeAnalytics Spark on HDI Azure Data LakeAnalytics SQL Server R Services Data Mining SSASBuilt-inFeatures Azure ML Spark on HDIIntegration withOperational Apps Azure ML Azure Cognitive Services Microsoft ROperationalizationRich Model Interpretability SSAS Data Mining Microsoft R ServerScalability (Big Data) Microsoft R Server Spark on HDI
My BackgroundApplying Computational Intelligence in Data Mining Honorary Research Fellow, School of Computing , University of Kent. Ph.D. Computer Science, University of Kent, Canterbury, UK. 28 published journal and conference papers in the fields of AI and MLhttps://www.researchgate.net/profile/Khalid 2017
Azure Machine Learning Build and deploy models in the cloud Import Data Publish Result Input Web Services Batch Scoring Retrain Model . Azure Machine Learning Real-time Predictions App Event Hub Stream Analytics Power BI Azure ML Web Service Send data points Consume messages Send Input Receive Output Send Results (Input, Output) Azure Machine Learning Built-in Features . Azure Machine Learning .