CSC 411: Lecture 06: Decision Trees

Transcription

CSC 411: Lecture 06: Decision TreesRichard Zemel, Raquel Urtasun and Sanja FidlerUniversity of TorontoZemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees1 / 39

TodayDecision TreesIIentropyinformation gainZemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees2 / 39

Another Classification IdeaWe learned about linear classification (e.g., logistic regression), and nearestneighbors. Any other idea?Pick an attribute, do a simple testConditioned on a choice, pick another attribute, do another testIn the leaves, assign a class with majority voteDo other branches as wellZemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees3 / 39

Another Classification IdeaGives axes aligned decision boundariesZemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees4 / 39

Decision Tree: ExampleYesYesZemel, Urtasun, Fidler (UofT)NoNoYesCSC 411: 06-Decision TreesNo5 / 39

Decision Tree: ClassificationZemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees6 / 39

Example with Discrete InputsWhat if the attributes are discrete?Attributes:Zemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees7 / 39

Decision Tree: Example with Discrete InputsThe tree to decide whether to wait (T) or not (F)Zemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees8 / 39

Decision TreesYesYesNoNoYesNoInternal nodes test attributesBranching is determined by attribute valueLeaf nodes are outputs (class assignments)Zemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees9 / 39

Decision Tree: AlgorithmChoose an attribute on which to descend at each levelCondition on earlier (higher) choicesGenerally, restrict only one dimension at a timeDeclare an output value when you get to the bottomIn the orange/lemon example, we only split each dimension once, but that isnot requiredZemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees10 / 39

Decision Tree: Classification and RegressionEach path from root to a leaf defines a region Rm of input spaceLet {(x (m1 ) , t (m1 ) ), . . . , (x (mk ) , t (mk ) )} be the training examples that fall intoRmClassification tree:I discrete outputI leaf value y m typically set to the most common value in{t (m1 ) , . . . , t (mk ) }Regression tree:I continuous outputI leaf value y m typically set to the mean value in {t (m1 ) , . . . , t (mk ) }Note: We will only talk about classification[Slide credit: S. Russell]Zemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees11 / 39

ExpressivenessDiscrete-input, discrete-output case:I Decision trees can express any function of the input attributesI E.g., for Boolean functions, truth table row path to leaf:Continuous-input, continuous-output case:I Can approximate any function arbitrarily closelyTrivially, there is a consistent decision tree for any training set w/ one pathto leaf for each example (unless f nondeterministic in x) but it probablywon’t generalize to new examplesNeed some kind of regularization to ensure more compact decision trees[Slide credit: S. Russell]Zemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees12 / 39

How do we Learn a DecisionTree?How do we construct a useful decision tree?Zemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees13 / 39

Learning Decision TreesLearning the simplest (smallest) decision tree is an NP complete problem [if youare interested, check: Hyafil & Rivest’76]Resort to a greedy heuristic:IIIStart from an empty decision treeSplit on next best attributeRecurseWhat is best attribute?We use information theory to guide us[Slide credit: D. Sontag]Zemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees14 / 39

Choosing a Good AttributeWhich attribute is better to split on, X1 or X2 ?Idea: Use counts at leaves to define probability distributions, so we can measureuncertaintyZemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees15 / 39

Choosing a Good AttributeWhich attribute is better to split on, X1 or X2 ?IIIDeterministic: good (all are true or false; just one class in the leaf)Uniform distribution: bad (all classes in leaf equally probable)What about distributons in between?Note: Let’s take a slight detour and remember concepts from information theory[Slide credit: D. Sontag]Zemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees16 / 39

We Flip Two Different CoinsSequence 1:0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 . ?Sequence 2:0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 0 1 . ?1620Zemel, Urtasun, Fidler (UofT)versus801CSC 411: 06-Decision Trees10117 / 39

Quantifying UncertaintyEntropy H:H(X ) Xp(x) log2 p(x)x X8/94/95/9011/90188 111 log2 log2 99 9924 554 log2 log2 0.9999 99How surprised are we by a new value in the sequence?How much information does it convey?Zemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees18 / 39

Quantifying UncertaintyH(X ) Xp(x) log2 p(x)x Xentropy1.00.80.60.40.20.2Zemel, Urtasun, Fidler (UofT)0.40.60.8CSC 411: 06-Decision Trees1.0probability p of heads19 / 39

Entropy“High Entropy”:IIIVariable has a uniform like distributionFlat histogramValues sampled from it are less predictable“Low Entropy”IIIDistribution of variable has many peaks and valleysHistogram has many lows and highsValues sampled from it are more predictableThis slide seems wrong[Slide credit: Vibhav Gogate]Zemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees20 / 39

Entropy of a Joint DistributionExample: X {Raining, Not raining}, Y {Cloudy, Not cloudy}Cloudy'Not'Cloudy'24/100'1/100'Not'Raining' 25/100'50/100'Raining'H(X , Y ) XXp(x, y ) log2 p(x, y )x X y Y 24241125255050log2 log2 log2 log2100100 100100 100100 100100 1.56bitsZemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees21 / 39

Specific Conditional EntropyExample: X {Raining, Not raining}, Y {Cloudy, Not cloudy}Cloudy'Not'Cloudy'24/100'1/100'Not'Raining' 25/100'50/100'Raining'What is the entropy of cloudiness Y , given that it is raining?XH(Y X x) p(y x) log2 p(y x)y Y 242411log2 log22525 2525 0.24bitsWe used: p(y x) Zemel, Urtasun, Fidler (UofT)p(x,y )p(x) ,and p(x) PCSC 411: 06-Decision Treesyp(x, y )(sum in a row)22 / 39

Conditional EntropyCloudy'Not'Cloudy'24/100'1/100'Not'Raining' 25/100'50/100'Raining'The expected conditional entropy:H(Y X ) Xp(x)H(Y X x)x X XXp(x, y ) log2 p(y x)x X y YZemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees23 / 39

Conditional EntropyExample: X {Raining, Not raining}, Y {Cloudy, Not cloudy}Cloudy'Not'Cloudy'24/100'1/100'Not'Raining' 25/100'50/100'Raining'What is the entropy of cloudiness, given the knowledge of whether or not itis raining?H(Y X ) Xp(x)H(Y X x)x X13H(cloudy is raining) H(cloudy not raining)44 0.75 bits Zemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees24 / 39

Conditional EntropySome useful properties:IIIIIH is always non-negativeChain rule: H(X , Y ) H(X Y ) H(Y ) H(Y X ) H(X )If X and Y independent, then X doesn’t tell us anything about Y :H(Y X ) H(Y )But Y tells us everything about Y : H(Y Y ) 0By knowing X , we can only decrease uncertainty about Y :H(Y X ) H(Y )Zemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees25 / 39

Information GainCloudy'Not'Cloudy'24/100'1/100'Not'Raining' 25/100'50/100'Raining'How much information about cloudiness do we get by discovering whether itis raining?IG (Y X ) H(Y ) H(Y X ) 0.25 bitsAlso called information gain in Y due to XIf X is completely uninformative about Y : IG (Y X ) 0If X is completely informative about Y : IG (Y X ) H(Y )How can we use this to construct our decision tree?Zemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees26 / 39

Constructing Decision TreesYesYesNoNoYesNoI made the fruit data partitioning just by eyeballing it.We can use the information gain to automate the process.At each level, one must choose:1. Which variable to split.2. Possibly where to split it.Choose them based on how much information we would gain from thedecision! (choose attribute that gives the highest gain)Zemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees27 / 39

Decision Tree Construction AlgorithmSimple, greedy, recursive approach, builds up tree node-by-node1. pick an attribute to split at a non-terminal node2. split examples into groups based on attribute value3. for each group:IIIif no examples – return majority from parentelse if all examples in same class – return classelse loop to step 1Zemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees28 / 39

Back to Our ExampleAttributes:Zemel, Urtasun, Fidler (UofT)[from: Russell & Norvig]CSC 411: 06-Decision Trees29 / 39

Attribute SelectionIG (Y ) H(Y ) H(Y X ) 2244IG (type) 1 H(Y Fr.) H(Y It.) H(Y Thai) H(Y Bur.) 012121212 2462 4IG (Patrons) 1 H(0, 1) H(1, 0) H( , ) 0.541121212 6 6 Zemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees30 / 39

Which Tree is Better?Zemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees31 / 39

What Makes a Good Tree?Not too small: need to handle important but possibly subtle distinctions indataNot too big:IIComputational efficiency (avoid redundant, spurious attributes)Avoid over-fitting training examplesOccam’s Razor: find the simplest hypothesis (smallest tree) that fits theobservationsInductive bias: small trees with informative nodes near the rootZemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees32 / 39

Decision Tree MiscellanyProblems:IIIYou have exponentially less data at lower levelsToo big of a tree can overfit the dataGreedy algorithms don’t necessarily yield the global optimumIn practice, one often regularizes the construction process to try to get smallbut highly-informative treesDecision trees can also be used for regression on real-valued outputs, but itrequires a different formalismZemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees33 / 39

Comparison to k-NNK-Nearest NeighborsDecision TreesDecision boundaries: piece-wiselinearTest complexity:non-parametric, few parametersbesides (all?) training examplesZemel, Urtasun, Fidler (UofT)Decision boundaries:axis-aligned, tree structuredTest complexity: attributes andsplitsCSC 411: 06-Decision Trees34 / 39

Applications of Decision Trees: XBox!Decision trees are in XBox[J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, A. Blake. Real-Time Human PoseZemel, Urtasun,CSC 411: 06-Decision TreesRecognitionin PartsFidlerfrom a(UofT)Single Depth Image. CVPR’11]35 / 39

Applications of Decision Trees: XBox!Decision trees are in XBox: Classifying body partsZemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees36 / 39

Applications of Decision Trees: XBox!Trained on million(s) of examplesZemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees37 / 39

Applications of Decision Trees: XBox!Trained on million(s) of examplesResults:Zemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees38 / 39

Applications of Decision TreesCan express any Boolean function, but most useful when function dependscritically on few attributesBad on: parity, majority functions; also not well-suited to continuousattributesPractical Applications:IIIFlight simulator: 20 state variables; 90K examples based on expertpilot’s actions; auto-pilot treeYahoo Ranking ChallengeRandom Forests: Microsoft Kinect Pose EstimationZemel, Urtasun, Fidler (UofT)CSC 411: 06-Decision Trees39 / 39

Expressiveness Discrete-input, discrete-output case: I Decision trees can express any function of the input attributes I E.g., for Boolean functions, truth table row !path to leaf: Continuous-input, continuous-output case: I Can approximate any function arbitrarily closely Trivially, there is a consistent decision tree for any training set w/ one path