Machine Learning With MATLAB --classification

Transcription

7/27/2017Machine Learning withMATLAB--classificationStanley Liang, PhDYork UniversityClassification the definition In machine learning and statistics,classification is the problem of identifyingto which of a set of categories (sub‐populations) a new observation belongs,on the basis of a training set of datacontaining observations (or instances)whose category membership is known Steps for classification1.2.3.4.Data prepare – preprocessing, creatingtraining / test setTrainingCross ValidationModel deployment1

7/27/2017Our data set Titanic disaster dataset Iris dataset 891 rows 150 rows Binary classification Multi‐class (3) classification Features / predictors Features / predictors––––––––Class: cabin classSex: gender of the passengerAgeFareSepal LengthSepal WidthPetal LengthPetal Width Label / response Label / response– Species ‐‐ string– Survived: 0‐ dead, 1‐survivedOur data set Pima Indians Diabetes Data (NIDDK) Wholesale Customers 768 rows 440 rows Binary classification – diabetes or not Binary / multiclass (2 categorical) Features / predictors ‐ 8––––––––preg: # of pregnant timesplas: plasma glucose concentrationpres: diastolic BP (mmHg)skin: triceps skinfold thickness (mm)test: 2‐Hour serum insulin (mu U/ml)mass: body mass indexpedi: diabetes pedigree function (numeric)age Label / response: 1‐diabetes, 0‐no Continuous variables (6): the monetary units(m.u.) spent on the products–––––Fresh ‐ fresh productsMilk ‐ diary productsGrocery ‐ grocery productsFrozen ‐ frozen productsDetergents Paper ‐detergents and paperproducts– Delicatessen ‐ delicatessen products Categorical variables (2)– Channel: 1‐Horeca, 2‐Retail– Region: 1 ‐Lisbon, 2‐Oporto, 3‐Other2

7/27/2017The workflow of ClassificationOptimizing a model Because of the prior knowledge you have about the dataor after looking at the classification results, you may wantto customize the classifier. You can update and customize the model by settingdifferent options using the fitting functions. Set the options by providing additional inputs for theoption name and the option value. model – ʹoptionNameʹ ‐‐ Name of the option, e.g., ʹCostʹ.– optionValue ‐‐ Value to be set to the option specified,e.g., [0 10; 2 0] ‐‐ change the Cost Matrix3

7/27/2017k-Nearest Neighbor Overview Function – fitcknn Performance––––Fit Time: fastPrediction Time: fast, (Data Size) 2Memory Overhead: SmallCommon Properties:– ʹNumNeighborsʹ – Number of neighbors used for classification.– ʹDistanceʹ – Metric used for calculating distances between neighbors.– ʹDistanceWeightʹ – Weighting given to different neighbors.– Special Notes– For normalizing the data, use the ʹStandardizeʹ option. –1– The cosine distance metric works well for “wide” data (more predictors than observations) and datawith many predictors.Decision Trees Function – fitctree Performance– Fit Time ‐ Size of the data– Prediction Time – Fast– Memory Overhead – small Common Properties– ʹSplitCriterionʹ – Formula used to determine optimal splits at each level– ʹMinLeafSizeʹ – Minimum number of observations in each leaf node.– ʹMaxNumSplitsʹ – Maximum number of splits allowed in the decision tree. Special Notes– Trees are a good choice when there is a significant amount of missing data.4

7/27/2017Naïve Bayes k‐NN and decision trees do not make anyassumptions about the distribution of theunderlying data. If we assume that the data comes from acertain underlying distribution, we cantreat the data as a statistical sample. Thiscan reduce the influence of the outliers onour model. A naïve Bayes classifier assumes theindependence of the predictors withineach class. This classifier is a good choicefor relatively simple problems. Function – fitcnb Performance Fit Time:– Normal Dist. ‐ Fast; Kernel Dist. – Slow Prediction Time:– Normal Dist. ‐ Fast; Kernel Dist. – Slow Memory Overhead:– Normal Dist. ‐ Small; Kernel Dist. ‐ Moderate tolarge Common Properties– ʹDistributionʹ – Distribution used to calculateprobabilities– ʹWidthʹ – Width of the smoothing window (whenʹDistributionʹ is set to ʹkernelʹ)– ʹKernelʹ – Type of kernel to use (whenʹDistributionʹ is set to ʹkernelʹ).– Special Notes– Naive Bayes is a good choice when there is asignificant amount of missing data.Discriminant Analysis Similar to naive Bayes, discriminant analysis works byassuming that the observations in each prediction classcan be modeled with a normal probability distribution. There is no assumption of independence in eachpredictor. A multivariate normal distribution is fitted to each class. Fit Time: Fast; size of the data Prediction Time: Fast; size of the data Memory Overhead: Linear DA ‐ Small; Quadratic DA ‐Moderate to large; number of predictors Common Properties‐ ʹDiscrimTypeʹ ‐ Type of boundary used.‐ ʹDeltaʹ ‐ Coefficient threshold for includingpredictors in a linear boundary. (Default 0.)‐ ʹGammaʹ ‐ Regularization to use when estimatingthe covariance matrix for linear DA. Linear Discriminant Analysis– The default classification assumes that thecovariance for each response class is assumed tobe the same. This results in linear boundariesbetween classes.– DaModel fitcdiscr(dataTrain,ʹresponseʹ); Quadratic Discriminant Analysis– Give up equal covariance assumption, aquadratic boundary will be drawn betweenclasses– daModel �quadraticʹ); Linear discriminant analysis works well for “wide”data (more predictors than observations).5

7/27/2017Support Vector Machines SVM will calculate the closes boundary that cancorrectly separate different groups of data Fit Time: Fast; square of the size of the data Prediction Time: Very Fast; square of the size of thedata Memory Overhead: Moderate ʹKernelFunctionʹ – Variable transformation to apply. ʹKernelScaleʹ – Scaling applied before the kerneltransformation. ʹBoxConstraintʹ – Regularization parameter controllingthe misclassification penalty Multiclass Support Vector Machines– The underlying calculations forclassification with support vectormachines are binary by nature. Youcan perform multiclass SVMclassification by creating an error‐correcting output codes (ECOC)classifier.– First, Create a template for a binaryclassifier– Second, Create multiclass SVMclassifier – Use the function fitecocto create a multiclass SVMclassifier. SVMs use a distance based algorithm. For data is notnormalized, use the ʹStandardizeʹ option. Linear SVMs work well for “wide” data (morepredictors than observations). Gaussian SVMs oftenwork better on “tall” data (more observations thanpredictors).Cross Validation To compare model performance, we can calculate the loss for each methodand pick the method with minimum loss. The loss is calculated on a specific test data. It is possible that a learningalgorithm performs well on that particular test data but does not generalizewell to other data The general idea of cross validation is to repeat the above process bycreating different training and test data, fit the model to each training data,and calculate the loss using the corresponding test data.6

7/27/2017Keyword – value pairs for cross validation mdl optionValueʹ) ‐‐ ʹCrossValʹ : ʹonʹ ‐‐ 10‐fold cross validation‐‐ ʹHoldoutʹ : scalar from 0 to 1 ‐‐ Holdout with the given fraction reserved forvalidation.‐‐ ʹKFoldʹ : k (scalar) ‐‐ k‐fold cross validation‐‐ ʹLeaveoutʹ : ʹonʹ ‐‐ Leave‐one‐out cross validation if you already have a partition created using the cvpartition function, you canalso provide that to the fitting function. part cvpartition(y,ʹKFoldʹ,k); mdl art); To evaluate a cross‐validated model, use the kfoldLoss function to compute theloss kfoldLoss(mdl)Strategies to reduce predictors High‐dimensional Data Machine learning problems often involve high‐dimensional datawith hundreds or thousands of predictors, e.g. Facial recognition,Predicting weather Learning algorithms are often computation intensive andreducing the number of predictors can have significant benefits incalculation time and memory consumption. Reducing the number of predictors results in simpler modelswhich can be generalized and are easier to interpret. Two common ways:– Feature transformation ‐‐ Transform the coordinate space of the observedvariables.– Feature selection ‐‐ Choose a subset of the observed variables7

7/27/2017Feature Transformation Principal Component Analysis (PCA) transforms an n‐dimensionalfeature space into a new n‐dimensional space of orthogonalcomponents. The components are ordered by the variation explained inthe data. PCA can therefore be used for dimensionality reduction by discardingthe components beyond a chosen threshold of explained variance. In the following example, the input X has 11 columns but first 9principal components explain more than 95% of variance.Feature Selection The data often contains predictors which do not have any relationship withthe response. These predictors should not be included in a model. Forexample, the patient‐id in the heart health data does not have any relationshipwith the risk of heart disease. In the decision tree model, one of the methods, predictorImportance, can beused to identify the predictor variables that are important for creating anaccurate model. Sequential Feature Selection– to incrementally add predictors to the model as long as there is reduction in the predictionerror.8

7/27/2017Ensemble Learning Classification trees are considered weak learners, meaning that they arehighly sensitive to the data used to train them. Thus, two slightly differentsets of training data can produce two completely different trees and,consequently, different predictions. However, this weakness can be harnessed as a strength by creating severaltrees (or, following the analogous naming, a forest). New observations canthen be applied to all the trees and the resulting predictions can becompared. To improve the classifier, we can ensemble learning methods.9

Machine Learning with MATLAB--classification Stanley Liang, PhD York University Classification the definition In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub‐ populations) a new observation belongs, on the basis of a training set of data