CS 188: Artificial Intelligence - University Of California, Berkeley

Transcription

CS 188: Artificial IntelligencePerceptrons and Logistic RegressionNathan LambertUniversity of California, Berkeley

Last Times§ Classification: given inputs x,predict labels (classes) yY§ Naïve BayesF1F2§ Parameter estimation:§ MLE, MAP, priors§ Laplace smoothing§ Training set, held-out set, test setFn

Linear Classifiers

Feature VectorsHello,Do you want free printrcartriges? Why pay morewhen you can get themABSOLUTELY FREE! Just# freeYOUR NAMEMISSPELLEDFROM FRIEND.::::2020PIXEL-7,12PIXEL-7,13.NUM LOOPS.: 1: 0: 1SPAMor “2”

Some (Simplified) Biology§ Very loose inspiration: human neurons

Linear Classifiers§ Inputs are feature values§ Each feature has a weight§ Sum is the activation§ If the activation is:§ Positive, output 1§ Negative, output -1f1f2f3w1w2w3S 0?

Weights§ Binary case: compare features to a weight vector§ Learning: figure out the weight vector from examples# freeYOUR NAMEMISSPELLEDFROM FRIEND.: 4:-1: 1:-3Dot productpositivemeans the positive class# freeYOUR NAMEMISSPELLEDFROM FRIEND.::::2020# freeYOUR NAMEMISSPELLEDFROM FRIEND.::::0111

Decision Rules

Binary Decision Rule§ In the space of feature vectorsExamples are pointsAny weight vector is a hyperplaneOne side corresponds to Y 1Other corresponds to Y -1money§§§§BIAS : -3free : 4money : 2.21001free

Binary Decision Rule§ In the space of feature vectorsExamples are pointsAny weight vector is a hyperplaneOne side corresponds to Y 1Other corresponds to Y -1money§§§§BIAS : -3free : 4money : 2.21001free

Binary Decision Rule§ In the space of feature vectorsExamples are pointsAny weight vector is a hyperplaneOne side corresponds to Y 1Other corresponds to Y -1money§§§§BIAS : -3free : 4money : 2.2 1 SPAM1-1 HAM001free

Weight Updates

Learning: Binary Perceptron§ Start with weights 0§ For each training instance:§ Classify with current weights§ If correct (i.e., y y*), no change!§ If wrong: adjust the weight vector

Learning: Binary Perceptron§ Start with weights 0§ For each training instance:§ Classify with current weights§ If correct (i.e., y y*), no change!§ If wrong: adjust the weight vectorby adding or subtracting thefeature vector. Subtract if y* is -1.Before: w fAfter: wf y*f ff f 0

Examples: Perceptron§ Separable Case

Multiclass Decision Rule§ If we have multiple classes:§ A weight vector for each class:§ Score (activation) of a class y:§ Prediction highest score winsBinary multiclass where the negative class has weight zero

Learning: Multiclass Perceptron§ Start with all weights 0§ Pick up training examples one by one§ Predict with current weights§ If correct, no change!§ If wrong: lower score of wronganswer, raise score of right answer

Example: Multiclass Perceptron“win the vote”[1 1 0 1 1]“win the election”“win the game”1BIASwingamevotethe.:::::100000-10-1-1-2[1 1 0 0 1][1 1 1 0 -110BIASwingamevotethe.:::::000000

Properties of Perceptrons§ Separability: true if some parameters get the trainingset perfectly correctSeparable§ Convergence: if the training is separable, perceptronwill eventually converge (binary case)§ Mistake Bound: the maximum number of mistakes(binary case) related to the margin or degree ofseparabilityNon-Separable

Problems with the Perceptron§ Noise: if the data isn’t separable,weights might thrash§ Averaging weight vectors overtime can help (averagedperceptron)§ Mediocre generalization: finds a“barely” separating solution§ Overtraining: test / held-outaccuracy usually rises, then falls§ Overtraining is a kind of overfitting

Improving the Perceptron

Non-Separable Case: Deterministic DecisionEven the best linear boundary makes at least one mistake

Non-Separable Case: Probabilistic Decision0.9 0.10.7 0.30.5 0.50.3 0.70.1 0.9

How to get probabilistic decisions?§ Perceptron scoring: z w · f (x)§ If z w · f (x) very positive à want probability going to1z w · f (x)§ Ifvery negative à want probability goingto 0§ Sigmoid function1(z) 1 ez

A 1D Exampledefinitely bluenot suredefinitely redprobability increases exponentiallyas we move away from boundarynormalizer

The Soft Max

Best w?§ Maximum likelihood estimation:max ll(w) maxwwith:wP (y(i)P (y(i)Xi(i)log P (y x ; w)1(i) 1 x ; w) Logistic Regression(i)1 ew·f (x(i) )1(i)1 x ; w) 11 ew·f (x(i) )

Separable Case: Deterministic Decision – ManyOptions

Separable Case: Probabilistic Decision – ClearPreference0.7 0.30.5 0.50.3 0.70.7 0.30.5 0.50.3 0.7

Multiclass Logistic Regression§ Recall Perceptron:§ A weight vector for each class:§ Score (activation) of a class y:§ Prediction highest score wins§ How to make the scores into probabilities?z1z1 , z2 , z3 !original activationsez1ez2ez3e,, ez2 ez3 ez1 ez2 ez3 ez1 ez2 ez3softmax activations

Best w?§ Maximum likelihood estimation:max ll(w) maxwwith:wXi(i) Multi-Class Logistic Regression(i)log P (y x ; w)wy(i) ·f (x(i) )eP (y x ; w) P(i)(i)y(i) )w·f(xye

Next Lecture§ Optimization§ i.e., how do we solve:max ll(w) maxwwXi(i)(i)log P (y x ; w)

Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just . Binary multiclass where the negative class has weight zero. Learning: Multiclass Perceptron . Options. Separable Case: Probabilistic Decision -Clear Preference 0.5 0.5 0.3 0.7 0.7 0.3 0.5 0.5 0.3 0.7