Lecture 10 Pattern Recognition & Learning II

Transcription

Lecture 10Pattern Recognition & Learning II UW CSE vision faculty

Flashback: Sigmoidal NetworksOutput v g (w T u) g ( wi ui )iwInput nodesu (u1u2u3)TThe most commonly useddifferentiable function:Sigmoid function:1g (a ) 1 e βa1Ψg(a)(a)aNon-linear “squashing” function: Squashes input to be between 0and 1. The parameter β controls the slope.

How do we learn the weights?Given training examples (um,dm) (m 1, , N), define a sumof squared output errors function (also called a cost functionor “energy” function)1mm 2E (w ) (d v )2 mwhere v g (w u )mTm

How do we learn the weights?We would like to choose w that minimize E – how?1mm 2E (w ) (d v )2 mwhere v g (w u )mTm

Gradient-Descent Learning (“Hill-Climbing”)Idea: Change w in proportion to –dE/dw(why does this work?)dEw w εdwmdEdv (d m v m ) (d m v m ) g ′(w T u m )u mdwdwmmDerivative of sigmoid

“Stochastic” Gradient DescentWhat if the inputs only arrive one-by-one?Stochastic gradient descent approximates sum over allinputs with an “on-line” running sum:dE1w w εdwdE1 (d m v m ) g ′(w T u m )u mdwdelta errorAlso known asthe “delta rule”or “LMS (leastmean square)rule”

Recall from Last Time: Classification ProblemImageClassification problem: Given a training dataset of (inputimage, output class) pairs, build a classifier that outputs aclass for any new input image

Example: Face DetectionHow do we build a classifier to distinguishbetween faces and other objects?

The Classification Problemdenotes 1 (faces)Facesdenotes -1 (other)Other objectsIdea: Find a separating hyperplane (line in this case)

Recall from Last Time: PerceptronArtificial “neuron”: Input vector x and output value v Weight vector w Threshold bInput x1 x 2 M xn 1 xn w1 1 if wi xi b iOutput v 1 if wi xi bi w2wnOutput vWeighted SumThreshold b

PerceptronEquivalently:v sign( wi xi b) sign(w x b)iwhere sign(z) 1 if z 0 and -1 if z 0Input x1 x 2 M xn 1 xn w1w2wnOutput vWeighted SumThreshold b

Perceptrons and Classification Weighted sum forms a linear hyperplane w x b 0i ii Due to threshold function, everything on one side of thishyperplane is labeled as class 1 (output 1) andeverything on other side is labeled as class 2 (output -1)

Separating HyperplaneFaces w x b 0i iidenotes 1 (faces)denotes -1 (other)Other objectsNeed to choose w and b based on training data

Separating Hyperplanesdenotes 1 (faces)Facesdenotes -1 (other)Other objectsDifferent choices of w and b give different hyperplanes(This and next few slides adapted from Andrew Moore’s)

Which hyperplane is best?denotes 1 (faces)Facesdenotes -1 (other)Other objects

How about the one right in the middle?Intuitively, this boundaryseems good because it isrobust to minorperturbations of datapoints near the boundary(output does not switchfrom 1 to -1)

MarginDefine the marginof a linearclassifier as thewidth that theboundary could beincreased bybefore hitting adatapoint.

Maximum Margin and Support Vector MachineSupport Vectorsare thosedatapoints thatthe marginpushes upagainstThe maximummargin classifier iscalled a SupportVector Machine (inthis case, a LinearSVM or LSVM)

Why Maximum Margin? Robust to smallperturbations of datapoints near boundary There exists theoryshowing this is best forgeneralization to newpoints (see onlinetutorial on classwebpage) Empirically works great

Support Vector MachinesSuppose the training data points (x i , yi ) satisfy :w x i b 1 for yi 1w x i b 1 for yi 1This can be rewritten asyi (w x i b ) 1We can always do this by rescalingw and b, without affecting theseparating hyperplane:w x b 0

Estimating the MarginThe margin is given by (see Burges tutorial online):Class 2Class 1mMargin can be calculated based on expression for distance from a point to a line, see,e.g., imensional.html

Learning the Maximum Margin ClassifierWant to maximize margin:2/ w subject to yi (w x i b ) 1, iEquivalent to finding w and b that minimize:12w subject to yi (w x i b ) 1, i2Constrained optimization problem that can be solvedusing Lagrange multiplier method

Learning the Maximum Margin ClassifierUsing Lagrange formulation and Lagrangian multipliers αi,we get (see Burges tutorial online):w α i yi x iiwhere the αi are obtained by maximizing:1α iα j yi y j (x i x j ) i α i 2 i, jsubject to α i 0 and α i yi 0iThis is a quadratic programming (QP) problem- A global maximum can always be found

Geometrical Interpretationxi with non-zero αi are called support vectorsα8 0.6 α10 0α7 0α5 0α4 0α9 0α2 0α1 0.8α6 1.4α3 0

What if data is not linearly separable?

Approach 1: Soft Margin SVMsξAllow errors ξ i (deviations frommargin)Trade off margin with errors.12Minimize:w C ξ i subject to :2iyi (w x i b ) 1 ξ i and ξ i 0, i

What if data is not linearly separable:Other Ideas?u1u2 XOR-1-111-1-1-11-1111u21X-1(1,1)1u1-1Can we do something to the inputs?

Another ExampleNot linearly separable

What if data is not linearly separable?Approach 2: Map original input space to higher-dimensionalfeature space; use linear classifier in higher-dim. spaceΦ: x φ(x)

Problem with high dimensional spacesΦ: x φ(x)Computation in high-dimensional feature space can be costlyThe high dimensional projection function Φ(x) may be toocomplicated to computeKernel trick to the rescue!

The Kernel TrickRecall the SVM optimization problem: Maximize1α α iα j yi y j (x i x j ) i i 2 i, jsubject to α i 0 and α i yi 0iInsight:The data points only appear as inner product No need to compute φ(x) explictly! Just replace inner product xi xj with a kernel functionK(xi,xj) φ(xi) φ(xj) E.g., Gaussian kernel K(xi,xj) exp(- xi-xj 2/2σ2) E.g., Polynomial kernel K(xi,xj) (xi xj 1)d

An Example for φ(.) and K(.,.)Suppose φ(.) is given as followsAn inner product in the feature space isSo, if we define the kernel function as follows, there is noneed to compute φ(.) explicitlyThis use of kernel function to avoid computing φ(.) explicitlyis known as the kernel trick

Summary: Steps for Classification using SVMsPrepare the data matrixSelect the kernel function to useSelect parameters of the kernel function You can use the values suggested by the SVM software, or usecross-validationExecute the training algorithm and obtain the αiClassify new data using the learned αi

Face Detection using SVMsKernel used: Polynomial of degree 2(Osuna, Freund, Girosi, 1998)

Support Vectors

Another Problem: Skin DetectionskinSkin pixels have a distinctive range of colors Corresponds to region(s) in RGB color space– for visualization, only R and G components are shown aboveSkin classifier A pixel X (R,G,B) is skin if it is in the skin region But how to find this region?(This and next few slides adapted from Steve Seitz’s slides)

Skin detection as a classification problemLearn the skin region from labeled examples Manually label pixels in one or more “training images” as skin or not skin Plot the training data in RGB space– skin pixels shown in orange, non-skin pixels shown in blue– some skin pixels may be outside the region, non-skin pixels inside. Why?Skin classifier Given X (R,G,B): determine if it is skin or not

Skin classification techniquesPossible classification techniques Nearest neighbor (or K-NN)– find labeled pixel closest to X Find plane/curve that separates the two classes– E.g., Support Vector Machines (SVM) Probabilistic approach– fit a probability density/distribution model to each class

ProbabilityBasic probability X is a random variable P(X) is the probability that X achieves a certain valuecalled a PDF-probability distribution/density function orcontinuous Xdiscrete X Conditional probability: P(X Y)– probability of X given that we already know Y

Probabilistic skin classificationNow we can model uncertainty Each pixel has a probability of being skin or not skinSkin classifier Given X (R,G,B): how to determine if it is skin or not? Choose interpretation of highest probability– set X to be a skin pixel if and only ifWhere do we getand?

Learning conditional PDF’sWe can calculate P(R skin) from a set of training images It is simply a histogram over the pixels in the training images– each bin Ri contains the proportion of skin pixels with color RiThis doesn’t work as well in higher-dimensional spaces. Why not?Approach: fit parametric PDF functions common choice is rotated Gaussian– center– covariance

Learning conditional PDF’sWe can calculate P(R skin) from a set of training images It is simply a histogram over the pixels in the training images– each bin Ri contains the proportion of skin pixels with color RiBut this isn’t quite what we want Why not? How to determine if a pixel is skin? We want P(skin R) not P(R skin) How can we get it?

Bayes ruleIn terms of our problem:what we measure(likelihood)what we want(posterior)domain knowledge(prior)normalization termWhat could we use for the prior P(skin)? Could use domain knowledge– P(skin) may be larger if we know the image contains a person– for a portrait, P(skin) may be higher for pixels in the center Could learn the prior from the training set. How?– P(skin) may be proportion of skin pixels in training set

Bayesian estimationlikelihoodBayesian estimationposterior (unnormalized) minimize probability of misclassification Goal is to choose the label (skin or skin) that maximizes the posterior– this is called Maximum A Posteriori (MAP) estimation

Bayesian estimationlikelihoodBayesian estimationposterior (unnormalized) minimize probability of misclassification Goal is to choose the label (skin or skin) that maximizes the posterior– this is called Maximum A Posteriori (MAP) estimation Suppose the prior is uniform: P(skin) P( skin) 0.5– in this case,– maximizing the posterior is equivalent to maximizing the likelihood»if and only if– this is called Maximum Likelihood (ML) estimation

Skin detection results(Jones & Rehg, 1999)

Next Time: ColorThings to do: Work on Project 2 Read Chap. 6Bayes rules!Reverand Thomas BayesNonconformist minister(1702-1761)

Given training examples (u m,d) (m 1, , N), define a sum of squared output errors function (also called a cost function or “energy” function) ( ) 2 2 1 ( ) m m E w m d v where m v g w u T m ( ) How do we learn the weights? We would like to choose w that minimize E – how? ( ) 2 2 1 ( ) m m E w m d v where m v g w u T m ( ) Idea: Change w in proportion to –dE/dw .