CS 330 - Machine Learning - Pacific Lutheran University

Transcription

CS 330 - Machine Learning- Intro and Bayesian decision theoryInstructor: Renzhi CaoComputer Science DepartmentPacific Lutheran UniversityFall 2018Special appreciation to Ian Goodfellow, Joshua Bengio, Aaron Courville, Michael Nielsen, Andrew Ng,Katie Malone, Sebastian Thrun, Ethem Alpaydin, Christopher Bishop, Geoffrey Hinton, Tom Mitchell.

AnnouncementLab 1 is due todayUpdates on group, check course website

Previews class Questions about labs?

Bayesian decision theoryProbability BasicsExample:We have two six-sided dice. When they are tolled, it could end up withthe following occurrence: (A) dice 1 lands on side “1”, (B) dice 2 lands on side “4”,and (C) Two dice sum to eight. Answer the following questions:

Bayesian decision theoryProbability BasicsProbability theory is nothing but common sense reduced tocalculation.

Bayesian decision theoryExample:Tossing a coinHow to describe it?Random variable X.X is 1 when coin is head up, 0 when tail upWhat kind of distributions for X?Bernoulli distribution

Bayesian decision theoryExample:Tossing a coinRandomly sample N times, t head-up. p is the probability of head-upP(X) (1-p)N-t * ptHow to calculate p?We can count the frequency: p # heads/ # tosses

Bayesian decision theoryExample:Tossing a coinWe can make predictions based on p Guess on head-up if p 0.5 Guess on tail-up if p 0.5 Randomly guess if p 0.5

Bayesian decision theoryP(x head) 0.5P(x tail) 0.5Not cheatingP(x head) 0.3P(x tail) 0.7Cheating

Bayesian decision theoryx describes the coin type, y describes the toss is head or tail.Given a coin x, calculate probability of a toss P(y x)Not very interestingWe often find ourselves in a situation where we know P(y x), and we need toknow P(x y) ? Head, Head, Tail, Head, Head, Head, TailCheating?Not cheating?

Bayesian decision theory

Bayesian decision theoryBayes’s rulePosterior probabilityPrior probabilityEvidenceLikelihood

Bayesian decision theoryGiven two events x and y, and suppose that P(y) 0. ThenConditional probability tableCancer(1%)Not Cancer(99%)Positive0.90.09Negative0.10.911% people have cancerif the patient has cancer, 90% possibility that the test is positiveIf don’t, 9% is tuitive-and-short-explanation-of-bayes-theorem/

Bayesian decision theoryx describes whether patient has cancer, y describes the testresultP(x 1 y 1) - given patient’s test result is positive, what’s theprobability of patient has uitive-and-short-explanation-of-bayes-theorem/

Bayesian decision theoryCancer(1%)Not Cancer(99%)Positive0.90.09Negative0.10.91P( x 1, y 1) ?P( x 1, y 1) P(x 1) * P(y 1 x 1) 0.01 * 0.9 0.009P( x 1, y 0) ?P( x 1, y 0) P(x 1) * P(y 0 x 1) 0.01 * 0.1 0.001P( x 0, y 1) ?P( x 0, y 1) P(x 0) * P(y 1 x 0) 0.99 * 0.09 0.0891P( x 0, y 0) ?P( x 0, y 0) P(x 0) * P(y 0 x 0) 0.99 * 0.91 itive-and-short-explanation-of-bayes-theorem/

Bayesian decision theoryNew tableCancer(1%)Not Cancer(99%)PositiveTrue Pos 0.009False Pos 0.0891NegativeFalse Neg 0.001True Neg 0.9009P( x 1 y 1) ?P(x 1 y 1) 0.009 / (0.009 0.0891) 0.0917. Only around 9%Suppose we have 100 people, only 1 has cancer, and mostly get positive test (90%).For the rest 99 people, 99*0.09 8.92, so around 9 of them get positive test.So we have around 10 people get positive test, and only one of them has cancer.It’s 10%, close to 0.0917 as we -intuitive-and-short-explanation-of-bayes-theorem/

Bayesian decision theoryCancer(1%)Not Cancer(99%)Positive0.90.09Negative0.10.91Directly use Bayesian itive-and-short-explanation-of-bayes-theorem/

Previews classJoint probability tableCancerNot CancerPositiveTrue Pos 0.009False Pos 0.0891NegativeFalse Neg 0.001True Neg 0.9009True positive correctly identifiedFalse positive incorrectly identifiedTrue negative correctly rejectedFalse negative incorrectly rejected

Bayesian decision theoryPracticeR: it’s raining outsideT: the tree is wetP(R) 0.9P(T R)R!RT0.70.2!Τ0.30.8What is P(R T) ?

Bayesian decision theoryR: it’s raining outsideT: the tree is wetP(T R)R!RT0.70.2P(R) 0.9!Τ0.30.8

Bayesian decision theory Understand the examples Don’t get confused about P(AB) and P(A B)

Bayesian decision theoryBayes’ Rule: More Complicated Suppose that B1, B2, Bk, each even is independent. Suppose that P(Bi) 0 andP(A) 0. Then

Bayesian decision theoryBayes’ Rule: More ComplicatedWhich Bi should we choose? Maximum A Posterior (MAP) classification ruleChoose Bt if P(Bt A) argmax P(Bi A)

Bayesian decision theoryNaive Bayes Classifier

Bayesian decision theoryNaive Bayes Classifier

Bayesian decision theoryNaive Bayes algorithm

Bayesian decision theory Example: Play Tennis

Bayesian decision theory Learning PhaseOutlookPlay Yes Play NoTemperaturePlay YesPlay 3/92/5Cool3/91/5HumidityPlay Yes Play NoWindPlay YesPlay ay Yes) 9/14P(Play No) 5/14

Bayesian decision theory Prediction Phase– Given a new instance, predict its labelx’ (Outlook Sunny, Temperature Cool, Humidity High, Wind Strong)– Look up tables achieved in the learning phraseP(Outlook Sunny Play Yes) 2/9P(Outlook Sunny Play No) 3/5P(Temperature Cool Play Yes) 3/9P(Huminity High Play Yes) 3/9P(Temperature Cool Play No) 1/5P(Huminity High Play No) 4/5P(Wind Strong Play Yes) 3/9P(Wind Strong Play No) 3/5P(Play Yes) 9/14P(Play No) 5/14– Decision making with the MAP ruleP(Yes x’) [P(Sunny Yes)P(Cool Yes)P(High Yes)P(Strong Yes)]P(Play Yes) 0.0053P(No x’) [P(Sunny No) P(Cool No)P(High No)P(Strong No)]P(Play No) 0.0206Given the fact P(Yes x’) P(No x’), we label x’ to be “No”.

Bayesian decision theoryNaive Bayes algorithm for Continuous Algorithm: Continuous-valued Features– Numberless values taken by a continuous-valued feature– Conditional probability often modeled with the normal distribution– Learning Phase:Output n * L normal distributions, where n is total number of featurevalues and L is total number of classes.– Test Phase: Given an unknown instance x’ Instead of looking-up tables, calculate conditional probabilities with allthe normal distributions achieved in the learning phrase Apply the MAP rule to assign a label (the same as done for the discretecase)

Bayesian decision theoryNaive Bayes algorithm for Continuous Example: Continuous-valued Features– Temperature is naturally of continuous value.Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1, 19.8No: 27.3, 30.1, 17.4, 29.5, 15.1– Estimate mean and variance for each class– Learning Phase: output two Gaussian models for P(temp C)

Bayesian decision theoryIssues For example: P(outlook overcast no) 0 in the play-tennis dataset

Bayesian decision theoryIssues

Bayesian decision theory Example: P(outlook overcast no) 0 in the play-tennis dataset– Adding m “virtual” examples (m: up to 1% of #training example) In this dataset, # of training examples for the “no” class is 5.We can only add m 1 “virtual” example in our m-esitmate remedy.– The “outlook” feature can takes only 3 values. So p 1/3.– Re-estimate P(outlook no) with them-estimate

Bayesian decision theoryApplications Example: Bayesian Spam FilteringOne clever application of Bayes’ Theorem is in spam filtering. We have Event A: The message is spam. Test X: The message contains certain words (X)1. Given “test result” (the presence of certain words) predict spam usingBayesian.2. Spam filtering based on a blacklist is flawed. Bayesian gives a probabilityinstead of simply yes or no decision.3. As the filter gets trained with more and more messages, it updates theprobabilities that certain words lead to spam messages.

Bayesian decision theoryPracticeHand-out, due on next Tuesday.

CS 330 - Machine Learning - Intro and Bayesian decision theory Instructor: Renzhi Cao Computer Science Department Pacific Lutheran University Fall 2018 1 Special appreciation to Ian Goodfellow, Joshua Bengio, Aaron Courville, Michael Nielsen, Andrew Ng, Katie Malone, Sebastian Thrun, Ethem Alpaydin, Christopher Bishop, Geoffrey Hinton, Tom .