INTRODUCTION TO Machine Learning - Knight Foundation School Of .

Transcription

Lecture Slides forINTRODUCTION TOMachine Learning2nd EditionETHEM ALPAYDIN, modified by Leonardo Bobadillaand some parts fromhttp://www.cs.tau.ac.il/ apartzin/MachineLearning/ The MIT Press, r/ ethem/i2ml2e

OutlineThis class: Ch 5: Multivariate MethodsMultivariate Data Parameter Estimation Estimation of Missing Values Multivariate Classification Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MITPress (V1.0)

CHAPTER 5:MultivariateMethods

Multivariate Distribution4 Assume all members of class came from joindistributionCan learn distributions from data P(x C)Assign new instance for most probable classP(C x) using Bayes ruleAn instance described by a vector ofcorrelated parametersRealm of multivariate distributionsMultivariate normalBased on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Multivariate Data5 Multiple measurements (sensors)d inputs/features/attributes: d-variateN instances/observations/examples X 11 2X1 X N X 1XXX1222N2X X N X d 1d2dBased on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Multivariate Parameters6Mean : E [ x] μ [ µ1 ,.,µd ]TCovariance: σij Cov( X i , X j )Correlation : Corr ( X i , X j ) ρij [Σ Cov( X ) E ( X μ )( X μ )T]σijσi σ j σ 12 σ 12 σ 1d 2σσ σ22d 21 2 σ d 1 σ d 2 σ d Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Parameter Estimationtx t 1 iNSample mean m : mi N,i 1,.,d(x NCovariancematrix S : sijCorrelation matrix R : rij t 1ti)( mi xtj mj)Nsijsi s jBased on E Alpaydın 2004 Introduction to Machine Learning The MIT7 Press (V1.1)

Estimation of Missing Values8 What to do if certain instances have missingattributes?Ignore those instances: not a good idea if thesample is smallUse ‘missing’ as an attribute: may giveinformationImputation: Fill in the missing value––Mean imputation: Use the most likely value (e.g.,mean)Imputation by regression: Predict based on otherBased on for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)attributes

Multivariate Normal9 Have d-attributesOften can assume each one distributednormallyAttributes might be dependant/correlatedJoint distribution of correlated severalvariables––P(X1 x1, X2 x2, Xd xd) ?X1 is normally distributed with mean µi andvariance iLecture Notes for E Alpaydın 2004 Introduction to MachineLearning The MIT Press (V1.1)

Multivariate Normal10x Nd( μ, Σ ) 1 T 1p ( x) exp ( x μ) Σ ( x μ) 1/ 2d/ 2 2 ( 2π) Σ1 Mahalanobis distance: (x – μ)T –1 (x – μ)2 variables are correlatedDivided by inverse of covariance (large)Contribute less to Mahalanobis distanceContribute more to the probabilityLecture Notes for E Alpaydın 2004 Introduction to MachineLearning The MIT Press (V1.1)

Bivariate Normal11Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Multivariate NormalDistribution12 Mahalanobis distance: (x – μ)T –1 (x –μ)measures the distance from x to μ in terms of (normalizes for difference in variances andcorrelations)2 Bivariate: d 2p ( x1 ,x2 ) 12πσ1σ2 σ1ρσ1σ2 Σ 2 σ2 ρσ1σ2 122 exp z1 2ρz1z 2 z 2 221 ρ 21 ρ (zi ( xi µi ) / σi())Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Bivariate Normal13Lecture Notes for E Alpaydın 2004 Introduction to MachineLearning The MIT Press (V1.1)

Bivariate NormalLecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press(V1.0)14

Independent Inputs: NaiveBayes15 If xi are independent, offdiagonals of are0, Mahalanobis distance reduces toweighted (by 1/σi) Euclidean distance:2ddx μ11iip ( x ) p i ( x i ) exp d2 i 1 σ ii 1d /2( 2π ) σ i[ ( )]i 1 If variances are also equal, reduces toEuclidean distanceBased on Introduction to Machine Learning The MIT Press (V1.1)

Projection Distribution16 Example: vector of 3 featuresMultivariate normal distributionProjection to 2 dimensional space (e.g. XYplane) Vectors of 2 featuresProjection are also multivariate normaldistributionProjection of d-dimensional normal tok-dimensional space is k-dimensional normalBased on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

1D projection17Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Multivariate Classification18 Assume members of class from a singlemultivariate distributionMultivariate normal is a good choice–––Easy to analyzeModel many natural phenomenaModel a class as having single prototype source(mean) slightly randomly changedBased on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Example19 Matching cars to customersEach cat defines a class of matchingcustomersCustomers described by (age, income)There is a correlation between age andincomeAssume each class is multivariate normalNeed to learn P(x C) from dataUse Bayes to compute P(C x)Lecture Notes for E Alpaydın 2004 Introduction to MachineLearning The MIT Press (V1.1)

Parametric Classification If p (x Ci ) N ( μi , i ) 1 T 1p( x Ci ) exp ( x μi ) Σi ( x μi ) 1/ 2d/ 2 2 ( 2π) Σi1 Discriminant functions areP ( x Ci ) P ( Ci )g i ( x ) log P ( Ci x ) log log p ( x Ci ) log P ( Ci ) log P ( x)P( x)d11T log2π log Σ i ( x μ i ) Σ i 1 ( x μ i ) log P ( Ci ) LogP( x)222 Need to know Covariance Matrix and mean to computediscriminant functions.Can ignore P(x) as the same for all classesBased on E Alpaydın 2004 Introduction to Machine Learning The MIT 20Press (V1.1)

Estimation of Parameters21P̂ (C i ) mi tr t iNt tr t i xtr t ir (x S tit it)( mi x mitr t it)T11T 1gi ( x) log Si ( x mi ) Si ( x mi ) log P̂ (C i )22Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Covariance Matrix per Class Quadratic discriminant()11 T 1 1T 1gi ( x) log Si x Si x 2xT Si mi mi Si mi log P̂ (C i )22T xT Wi x wi x wi 0where1 1Wi Si2 1wi Si mi1 T 11wi 0 mi Si mi log Si log P̂ (C i )22 Requires estimation of K*d*(d 1)/2 parameters for covariancematrixBased on E Alpaydın 2004 Introduction to Machine Learning The MIT22Press (V1.1)

discriminant:P (C1 x ) 0.5likelihoodsposterior for C123Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Common Covariance Matrix S24 If not enough data can assume all classeshave same common sample covariancematrix SS P̂ (C i )SiiDiscriminant reduces to a lineardiscriminant (xTS-1x is common to alldiscriminant and can be removed)1Tgi ( x) ( x mi ) S 1 ( x mi ) log P̂ (C i )2Tgi ( x ) w i x wi 01where w i S 1m i wi 0 miT S 1mi log Pˆ ( Ci )2Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Common Covariance Matrix S25Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Diagonal S26 When xj j 1,.d, are independent, isdiagonalp (x Ci) j p (xj Ci)(Naive Bayes’assumption)1 x mijgi ( x) 2 j 1 s jdtj2 log P̂ (C i ) Classify based on weighted Euclideandistance (in sj units) to the nearest meanBased on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Diagonal S27variances may bedifferentBased on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Diagonal S, equal variances28 Nearest mean classifier: Classify based onEuclidean distance to the nearest meang i ( x ) x mi 22s2 log P̂ ( C i )d 21t¿ 2 ( x j mij ) log P̂ ( C i )2s j 1Each mean can be considered a prototype ortemplate and this is template matchingLecture Notes for E Alpaydın 2004 Introduction to MachineLearning The MIT Press (V1.1)

Diagonal S, equal variances29*?Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Model SelectionAssumptionCovariance matrix No of parametersShared, HypersphericSi S s 2I1Shared, Axis-alignedSi S, with sij 0dShared, HyperellipsoidalSi SDifferent,HyperellipsoidalSid(d 1)/2K d(d 1)/2 As we increase complexity (less restricted S),bias decreases and variance increases Assume simple models (allow some bias) tocontrol variance (regularization)Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press(V1.0)30

Model Selection31 Different covariance matrix for each classHave to estimate many parametersSmall bias , large varianceCommon covariance matrices, diagonalcovariance etc. reduce number of parametersIncrease bias but control varianceIn-between states?Lecture Notes for E Alpaydın 2004 Introduction to MachineLearning The MIT Press (V1.1)

Regularized DiscriminantAnalysis(RDA)32 a b 0: Quadratic classifiera 0, b 1:Shared Covariance, linear classifiera 1,b 0: Diagonal CovarianceChoose best a,b by cross validationLecture Notes for E Alpaydın 2004 Introduction to MachineLearning The MIT Press (V1.1)

Model Selection: Example33Lecture Notes for E Alpaydın 2004 Introduction to MachineLearning The MIT Press (V1.1)

Model Selection34Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Discrete Features Binary features: pij p( xj 1 Ci )if xj are independent (Naive Bayes’)dp( x Ci ) p (1 pij ) (1 xj )j 1xjijthe discriminant is lineargi ( x ) logp( x Ci ) logP( Ci ) [ xj logpij (1 xj ) log(1 pij ) ] logP( Ci )jEstimated parametersp̂ij t tx t j ritr t iLecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press(V1.0)35

Multivariate Regressionr g( x w0 ,w1 ,.,wd ) εtt Multivariate linear modelw0 w1 x1t w2 x2t wd xdt [1E( w0 ,w1 ,.,wd X ) t rt w0 w1 x1t wd xdt2Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press(V1.0)]236

Multivariate Regressionw0 w1 x1t w2 x2t wd xdtl[1E( w0 ,w1 ,.,wd X ) t rt w0 w1 x1t wd xdt2]2 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press(V1.0)37

CHAPTER 6:DimensionalityReduction

Dimensionality of input39 Number of Observables (e.g. age and income)If number of observables is increased–––More time to computeMore memory to store inputs and intermediateresultsMore complicated explanations (knowledge fromlearning) –No simple visualization –Regression from 100 vs. 2 parameters2D vs. 10D graphNeed much more data (curse of dimensionality) Basedon E1-dAlpaydın2004isIntroductionto MachineLearningof TheMIT Press (V1.1)1M ofinputsnot equalto 1 inputdimension1M

Dimensionality reduction40 Some features (dimensions) bear little or noruseful information (e.g. color of hair for a carselection)–– Can drop some featuresHave to estimate which features can be droppedfrom dataSeveral features can be combined togetherwithout loss or even with gain of information (e.g.income of all family members for loan application)Some features can be combined togetherBased –on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)Have to estimate which features to combine from–

Feature Selection vs Extraction41 Feature selection: Choosing k d importantfeatures, ignoring the remaining d – k– Subset selection algorithmsFeature extraction: Project the original xi , i 1,.,d dimensions to new k d dimensions, zj, j 1,.,k–––Principal Components Analysis (PCA)Linear Discriminant Analysis (LDA)Factor Analysis (FA)Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Usage42 Have data of dimension dReduce dimensionality to k d–– Discard unimportant featuresCombine several features in oneUse resulting k-dimensional data set for––Learning for classification problem (e.g.parameters of probabilities P(x C)Learning for regression problem (e.g.parameters for model y g(x Thetha)Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Subset selection43 Have initial set of features of size dThere are 2 d possible subsetsNeed a criteria to decide which subset is thebestA way to search over the possible subsetsCan’t go over all 2 d possibilitiesNeed some heuristicsBased on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

“Goodness” of feature set44 Supervised–– Train using selected subsetEstimate error on validation data setUnsupervised––Look at input only(e.g. age, income andsavings)Select subset of 2 that bear most of theinformation about the personBased on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Mutual Information45 Have a 3 random variables(features) X,Y,Z and have toselect 2 which gives most informationIf X and Y are “correlated” then much of the informationabout of Y is already in XMake sense to select features which are “uncorrelated”Mutual Information (Kullback–Leibler Divergence ) is moregeneral measure of “mutual information”Can be extended to n variables (information variables x1,. xnhave about variable xn 1)Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Subset-selection46 Forward search––––– Start from empty set of featuresTry each of remaining featuresEstimate classification/regression error for addingspecific featureSelect feature that gives maximum improvement invalidation errorStop when no significant improvementBackward search––Start with original set of size dDrop features with smallest impact on errorBased on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Subset Selection There are 2 d subsets of d features Forward search: Add the best feature at each step– Set of features F initially Ø.– At each iteration, find the best new featurej argmini E ( F– xi )Add xj to F if E ( F xj ) E ( F ) Hill-climbing O(d 2) algorithm Backward search: Start with all features andremove one at a time, if possible. Floating search (Add k, remove l)Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press(V1.0)47

Floating Search48 Forward and backward search are “greedy”algorithms–– Select best options at single stepDo not always achieve optimum valueFloating search––Two types of steps: Add k, remove lMore computationsBased on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Feature Extraction49 Face recognition problem––– Training data input: pairs of Image Label(name)Classifier input: ImageClassifier output: Label(Name)Image: Matrix of 256X256 65536 values inrange 0.256Each pixels bear little information so can’tselect 100 best onesAverage of pixels around specific positionsmay give an indication about an eye color.Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Projection50 Find a projection matrix w from d-dimensionalto k-dimensional vectors that keeps error lowBased on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

PCA: Motivation51 Assume that d observables are linearcombination of k d vectorszi wi1xi1 wikxidWe would like to work with basis as it haslesser dimension and have all(almost)required informationWhat we expect from such basis––Uncorrelated or otherwise can be reducedfurtherHave large variance (e.g. wi1 have largeBased on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)variation) or otherwise bear no information

PCA: Motivation52Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

PCA: Motivation53 Choose directions such that a total varianceof data will be maximum– Choose directions that are orthogonal– Maximize Total VarianceMinimize correlationChoose k d orthogonal directions whichmaximize total varianceBased on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

PCA54 Choosing only directions: Maximize variance subject to a constrain usingLagrange MultipliersTaking DerivativesEigenvector. Since want to maximizewe should choose an eigenvector with largesteigenvalueBased on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

PCA55 d-dimensional feature spaced by d symmetric covariance matrix estimatedfrom samplesSelect k largest eigenvalue of the covariancematrix and associated k eigenvectorsThe first eigenvector will be a direction withlargest varianceBased on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

What PCA does56z WT(x – m)where the columns of W are the eigenvectorsof , and m is sample meanCenters the data at the origin and rotates theaxesBased on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

How to choose k ?57 Proportion of Variance (PoV) explainedλ1 λ 2 λ kλ1 λ 2 λk λ dwhen λi are sorted in descending order Typically, stop at PoV 0.9 Scree graph plots of PoV vs k, stop at“elbow”Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

58Lecture Notes for E Alpaydın 2004 Introduction to MachineLearning The MIT Press (V1.1)

PCA59 PCA is unsupervised (does not take into accountclass information)Can take into account classes : Karhuned-LoeveExpansion–– Estimate Covariance Per ClassTake average weighted by priorCommon Principle Components–Assume all classes have same eigenvectors(directions) but different variancesLecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

PCA60 Does not try to explain noise– Large noise can become new dimension/largestPCInterested in resulting uncorrelated variableswhich explain large portion of total samplevarianceSometimes interested in explained sharedvariance (common factors) that affect dataBased on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Parametric Classification If p (x C i ) N ( μi, i ) Discriminant functions are Need to know Covariance Matrix and mean to compute discriminant functions. Can ignore P(x) as the same for all classes π i i T / i i i d/ px C x μ 1 x μ 2 1 2 2 1 exp 2 1 Σ Σ ( ) ( ) ( ) ( ) ( ) ( ) ( ) 1 ( ) ( ) log log log log log ( )