Transcription
CS 188: Artificial IntelligencePerceptrons and Logistic RegressionNathan LambertUniversity of California, Berkeley
Last Times§ Classification: given inputs x,predict labels (classes) yY§ Naïve BayesF1F2§ Parameter estimation:§ MLE, MAP, priors§ Laplace smoothing§ Training set, held-out set, test setFn
Linear Classifiers
Feature VectorsHello,Do you want free printrcartriges? Why pay morewhen you can get themABSOLUTELY FREE! Just# freeYOUR NAMEMISSPELLEDFROM FRIEND.::::2020PIXEL-7,12PIXEL-7,13.NUM LOOPS.: 1: 0: 1SPAMor “2”
Some (Simplified) Biology§ Very loose inspiration: human neurons
Linear Classifiers§ Inputs are feature values§ Each feature has a weight§ Sum is the activation§ If the activation is:§ Positive, output 1§ Negative, output -1f1f2f3w1w2w3S 0?
Weights§ Binary case: compare features to a weight vector§ Learning: figure out the weight vector from examples# freeYOUR NAMEMISSPELLEDFROM FRIEND.: 4:-1: 1:-3Dot productpositivemeans the positive class# freeYOUR NAMEMISSPELLEDFROM FRIEND.::::2020# freeYOUR NAMEMISSPELLEDFROM FRIEND.::::0111
Decision Rules
Binary Decision Rule§ In the space of feature vectorsExamples are pointsAny weight vector is a hyperplaneOne side corresponds to Y 1Other corresponds to Y -1money§§§§BIAS : -3free : 4money : 2.21001free
Binary Decision Rule§ In the space of feature vectorsExamples are pointsAny weight vector is a hyperplaneOne side corresponds to Y 1Other corresponds to Y -1money§§§§BIAS : -3free : 4money : 2.21001free
Binary Decision Rule§ In the space of feature vectorsExamples are pointsAny weight vector is a hyperplaneOne side corresponds to Y 1Other corresponds to Y -1money§§§§BIAS : -3free : 4money : 2.2 1 SPAM1-1 HAM001free
Weight Updates
Learning: Binary Perceptron§ Start with weights 0§ For each training instance:§ Classify with current weights§ If correct (i.e., y y*), no change!§ If wrong: adjust the weight vector
Learning: Binary Perceptron§ Start with weights 0§ For each training instance:§ Classify with current weights§ If correct (i.e., y y*), no change!§ If wrong: adjust the weight vectorby adding or subtracting thefeature vector. Subtract if y* is -1.Before: w fAfter: wf y*f ff f 0
Examples: Perceptron§ Separable Case
Multiclass Decision Rule§ If we have multiple classes:§ A weight vector for each class:§ Score (activation) of a class y:§ Prediction highest score winsBinary multiclass where the negative class has weight zero
Learning: Multiclass Perceptron§ Start with all weights 0§ Pick up training examples one by one§ Predict with current weights§ If correct, no change!§ If wrong: lower score of wronganswer, raise score of right answer
Example: Multiclass Perceptron“win the vote”[1 1 0 1 1]“win the election”“win the game”1BIASwingamevotethe.:::::100000-10-1-1-2[1 1 0 0 1][1 1 1 0 -110BIASwingamevotethe.:::::000000
Properties of Perceptrons§ Separability: true if some parameters get the trainingset perfectly correctSeparable§ Convergence: if the training is separable, perceptronwill eventually converge (binary case)§ Mistake Bound: the maximum number of mistakes(binary case) related to the margin or degree ofseparabilityNon-Separable
Problems with the Perceptron§ Noise: if the data isn’t separable,weights might thrash§ Averaging weight vectors overtime can help (averagedperceptron)§ Mediocre generalization: finds a“barely” separating solution§ Overtraining: test / held-outaccuracy usually rises, then falls§ Overtraining is a kind of overfitting
Improving the Perceptron
Non-Separable Case: Deterministic DecisionEven the best linear boundary makes at least one mistake
Non-Separable Case: Probabilistic Decision0.9 0.10.7 0.30.5 0.50.3 0.70.1 0.9
How to get probabilistic decisions?§ Perceptron scoring: z w · f (x)§ If z w · f (x) very positive à want probability going to1z w · f (x)§ Ifvery negative à want probability goingto 0§ Sigmoid function1(z) 1 ez
A 1D Exampledefinitely bluenot suredefinitely redprobability increases exponentiallyas we move away from boundarynormalizer
The Soft Max
Best w?§ Maximum likelihood estimation:max ll(w) maxwwith:wP (y(i)P (y(i)Xi(i)log P (y x ; w)1(i) 1 x ; w) Logistic Regression(i)1 ew·f (x(i) )1(i)1 x ; w) 11 ew·f (x(i) )
Separable Case: Deterministic Decision – ManyOptions
Separable Case: Probabilistic Decision – ClearPreference0.7 0.30.5 0.50.3 0.70.7 0.30.5 0.50.3 0.7
Multiclass Logistic Regression§ Recall Perceptron:§ A weight vector for each class:§ Score (activation) of a class y:§ Prediction highest score wins§ How to make the scores into probabilities?z1z1 , z2 , z3 !original activationsez1ez2ez3e,, ez2 ez3 ez1 ez2 ez3 ez1 ez2 ez3softmax activations
Best w?§ Maximum likelihood estimation:max ll(w) maxwwith:wXi(i) Multi-Class Logistic Regression(i)log P (y x ; w)wy(i) ·f (x(i) )eP (y x ; w) P(i)(i)y(i) )w·f(xye
Next Lecture§ Optimization§ i.e., how do we solve:max ll(w) maxwwXi(i)(i)log P (y x ; w)
Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just . Binary multiclass where the negative class has weight zero. Learning: Multiclass Perceptron . Options. Separable Case: Probabilistic Decision -Clear Preference 0.5 0.5 0.3 0.7 0.7 0.3 0.5 0.5 0.3 0.7