Applied Analytics & Predictive Modeling

Transcription

Applied Analytics andPredictive ModelingSpring 2021Lecture-9Lydia Manikondamanikl@rpi.eduSome of the slides adapted from Intro to Data Mining Tan et al. 2nd edition

Today’s agenda Decision trees Class exercises on building a decision tree manually

Decision Trees

Example of a Decision TreeSplitting AttributesMaritalStatusAnnual DefaultedIncome o3NoSingle70KNo4YesMarried120KNo5NoDivorced 95K6NoMarried7YesDivorced 60KYesHomeOwnerYesNoNOMarStSingle, DivorcedNoIncomeNo 80KNONO 80KYES10Training DataMarriedModel: Decision Tree

Another Example of Decision l DefaultedIncome rried120KNo5NoDivorced 95KYes6NoMarriedNo7YesDivorced e 80KNO 80KYESThere could be more than one tree thatfits the same data!

Apply Model to Test DataStart from the root of tree.Test DataHomeOwnerYesHomeOwnerNoNoNOMarStSingle, DivorcedIncome 80KNO10MarriedNO 80KYESMaritalStatusMarriedAnnual DefaultedIncome Borrower80K?

Apply Model to Test DataTest DataHomeOwnerYesNOHomeOwnerNoNoMarStSingle, DivorcedIncome 80KNO10MarriedNO rower80K?

Apply Model to Test DataTest DataHomeOwnerYesHomeOwnerNoNo10NOMarStSingle, DivorcedIncome 80KNOMarriedNO 80KYESMaritalStatusMarriedAnnual DefaultedIncome Borrower80K?

Apply Model to Test DataTest DataHomeOwnerYesHomeOwnerNoNoNOMarStSingle, DivorcedIncome 80KNO10MarriedNO 80KYESMaritalStatusMarriedAnnual DefaultedIncome Borrower80K?

Apply Model to Test DataTest DataHomeOwnerYesNOHomeOwnerNoNoMarSt10Single, DivorcedIncome 80KNOMarriedNO rower80K?

Apply Model to Test DataHomeOwnerYesNOTest DataNoHomeOwnerMarStSingle, DivorcedIncome 80KNOMarriedNoMaritalStatusMarriedAnnual DefaultedIncome Borrower80K?10NO 80KYESAssign Defaulted to “No”

Decision Tree Classification InductionLearnModelModel10Training YesLarge110K?14NoSmall95K?15NoLarge67K?10Test SetAttrib3ApplyModelClassDeductionDecisionTree

Decision Tree Induction Many Algorithms: Hunt’s Algorithm (one of the earliest)CARTID3, C4.5SLIQ,SPRINT

General Structure of the Hunt‘s algorithm Let Dt be the set of training records thatreach a node t General Procedure: If Dt contains records that belong the same class yt,then t is a leaf node labeled as y If Dt contains records that belong to more than one class,use an attribute test to split the data into smaller subsets.Recursively apply the procedure to each subset.IDHomeOwnerMaritalStatusAnnual DefaultedIncome 0KNo4YesMarried120KNo5NoDivorced 95KYes6NoMarriedNo7YesDivorced es60K10Dt?

Hunt’s algorithmHomeOwnerYesNoDefaulted NoDefaulted No(a)Defaulted No(b)HomeOwnerYesHomeOwnerYesDefaulted NoNoSingle,DivorcedDefaulted ted NoNoDefaulted No 80KDefaulted NoDefaulted NoMaritalStatusAnnual DefaultedIncome 0KNo4YesMarried120KNo5NoDivorced 95KYes6NoMarriedNo7YesDivorced es60K 80KDefaulted Yes10(c)HomeOwnerMarriedAnnualIncomeMarriedID(d)

Hunt’s algorithmHomeOwnerYesNoDefaulted NoDefaulted No(a)Defaulted No(b)HomeOwnerYesHomeOwnerYesDefaulted NoNoSingle,DivorcedDefaulted ted NoNoDefaulted No 80KDefaulted NoDefaulted No 80KDefaulted Yes10(c)HomeOwnerMaritalStatusAnnual DefaultedIncome 0KNo4YesMarried120KNo5NoDivorced 95KYes6NoMarriedNo7YesDivorced es60KMarriedAnnualIncomeMarriedID(d)

Hunt’s algorithmHomeOwnerYesNoDefaulted NoDefaulted No(a)Defaulted No(b)HomeOwnerYesHomeOwnerYesDefaulted NoNoSingle,DivorcedDefaulted ted NoNoDefaulted No 80KDefaulted NoDefaulted NoMaritalStatusAnnual DefaultedIncome 0KNo4YesMarried120KNo5NoDivorced 95KYes6NoMarriedNo7YesDivorced es60K 80KDefaulted Yes10(c)HomeOwnerMarriedAnnualIncomeMarriedID(d)

Hunt’s No3NoSingle70KNo4YesMarried120KNo5NoDivorced 95KYes6NoMarriedNo7YesDivorced 220KNoMarried8NoSingle85KYesDefaulted lted NoDefaulted ted ted NoAnnualIncomeMarriedDefaulted No 80K 80K10Defaulted No(c)60KNoDefaulted NoNoAnnual DefaultedIncome BorrowerNoDefaulted No(a)MaritalStatusDefaulted Yes(d)

Design Issues of Decision Tree Induction How should training records be split?– Method for specifying test condition– depending on attribute types– Measure for evaluating the goodness of a test condition How should the splitting procedure stop?– Stop splitting if all the records belong to the same class or have identicalattribute values– Early termination

Methods for Expressing Test Conditions Depends on attribute types– Binary– Nominal– Ordinal– Continuous Depends on number of ways to split– 2-way split– Multi-way split

Test Condition for Nominal Attributes Multi-way split:– Use as many partitions asdistinct values.MaritalStatusSingle Binary split:– Divides values into two Married,Divorced}{Single, {Divorced}Married}

Test Condition for Ordinal Attributes Multi-way split:– Use as many partitions asdistinct values.ShirtSizeSmallMediumShirtSize Binary split:– Divides values into two subsets– Preserve order property amongattribute values{Small,Medium}{Large,Extra Large}LargeExtra LargeShirtSize{Small} {Medium, Large,Extra Large}ShirtSizeThis grouping violatesorder property{Small,Large}{Medium,Extra Large}

Test Condition for Continuous AttributesAnnualIncome 80K?AnnualIncome? 10KYes 80KNo[10K,25K)(i) Binary split[25K,50K)[50K,80K)(ii) Multi-way split

Splitting Based on Continuous Attributes Different ways of handling Discretization to form an ordinal categorical attributeRanges can be found by equal interval bucketing, equal frequency bucketing (percentiles),or clustering. Static – discretize once at the beginning Dynamic – repeat at each node Binary Decision: (A v) or (A v) consider all possible splits and finds the best cut can be more compute intensive

How to determine the best splitBefore Splitting: 10 records of class 0, 10records of class 0: 6C1: 4C0: 4C1: 6C0: 1C1: 3C0: 8C1: 0C0: 1C1: 7C0: 1C1: 0.Which test condition is the best?c10C0: 1C1: 0c11C0: 0C1: 1c20.C0: 0C1: 1

How to determine the best split Greedy approach:– Nodes with purer class distribution are preferred Need a measure of node impurity:C0: 5C1: 5High degree of impurityC0: 9C1: 1Low degree of impurity

Measures of Node Impurity Gini IndexGINI (t ) 1 [ p( j t )]2j EntropyEntropy(t ) p( j t ) log p( j t )j Misclassification errorError (t ) 1 max P(i t )i

Finding the best split1. Compute impurity measure (P) before splitting2. Compute impurity measure (M) after splitting1. Compute impurity measure of each child node2. M is the weighted impurity of children3. Choose the attribute test condition that produces the highest gainGain P – Mor equivalently, lowest impurity measure after splitting (M)

Measure of Impurity: Entropy Entropy at a given node t:Entropy(t ) p( j t ) log p( j t )j (NOTE: p( j t) is the relative frequency of class j at node t). Maximum (log nc) when records are equally distributed among all classesimplying least information Minimum (0.0) when all records belong to one class, implying mostinformation Entropy based computations are quite similar to the GINI indexcomputations

Computing Entropy of a Single NodeEntropy(t ) p( j t ) log p( j t )2jC1C206P(C1) 0/6 0C1C215P(C1) 1/6C1C224P(C1) 2/6P(C2) 6/6 1Entropy – 0 log 0 – 1 log 1 – 0 – 0 0P(C2) 5/6Entropy – (1/6) log2 (1/6) – (5/6) log2 (5/6) 0.65P(C2) 4/6Entropy – (2/6) log2 (2/6) – (4/6) log2 (4/6) 0.92

Computing Information Gain after Splitting Information Gainn GAIN Entropy( p) Entropy(i ) n ksplitii 1Parent Node, p is split into k partitions; ni is number of records in partition i Choose the split that achieves most reduction (maximizes GAIN) Used in ID3 and C4.5 decision tree algorithms

Class exerciseage 30 3031 40 40 40 4031 40 30 30 40 3031 4031 40 40income student credit ratinghighno fairhighno excellenthighno fairmediumno fairlowyes fairlowyes excellentlowyes excellentmediumno fairlowyes fairmediumyes fairmediumyes excellentmediumno excellenthighyes fairmediumno excellentbuys le from Han & KamberData Mining: Concepts andTechniques

Attribute Selection by Information Gain Computation Class P: buys computer “yes” Class N: buys computer “no” I(p, n) I(9, 5) 0.940 Compute the entropy for age:age 30 3031 40 40 40 4031 40 30 30 40 3031 4031 40 40age 3030 40 40income student credit airlowyes fairlowyes excellentlowyes excellentmediumnofairlowyes fairmediumyes fairmediumyes excellentmediumnoexcellenthighyes fairmediumnoexcellentpi243buys computernonoyesyesyesnoyesnoyesyesyesyesyesnoni I(pi, ni)3 0.9710 02 0.97154E (age) I (2,3) I (4,0)14145 I (3,2) 0.694145I (2,3) means “age 30” has 5 out of1414 samples, with 2 yes’es and 3no’s. HenceGain(age) I ( p, n) E (age) 0.246Similarly,Gain(income) 0.029Gain( student) 0.151Gain(credit rating ) 0.048

Output: A Decision Tree for “buys computer”age? 30overcast30.40student?yes 40credit rating?noyesexcellentfairnoyesnoyes

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer mannerAt start, all the training examples are at the rootAttributes are categorical (if continuous-valued, they are discretized in advance)Examples are partitioned recursively based on selected attributesTest attributes are selected on the basis of a heuristic or statistical measure (e.g., informationgain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed forclassifying the leaf There are no samples left

Other Attribute Selection Measures Gini index (CART, IBM IntelligentMiner) All attributes are assumed continuous-valued Assume there exist several possible split values for each attribute May need other tools, such as clustering, to get the possible split values Can be modified for categorical attributes

GINI Index (IBM IntelligentMiner) If a data set T contains examples from n classes, gini index, gini(T) isndefined asgini(T ) 1 p 2j 1jwhere pj is the relative frequency of class j in T. If a data set T is split into two subsets T1 and T2 with sizes N1 and N2respectively, the gini index of the split data contains examples from nclasses, the gini index gini(T) is defined asgini split (T ) N 1 gini( ) N 2 gini( )T1T2NN The attribute provides the smallest ginisplit(T) is chosen to split thenode (need to enumerate all possible splitting points for eachattribute).

Applied Analytics and Predictive Modeling Spring 2021 Lecture-9 Lydia Manikonda manikl@rpi.edu Some of the slides adapted from Intro to Data Mining Tan et al. 2nd edition. Today's agenda Decision trees Class exercises on building a decision tree manually . Decision Trees. Example of a Decision Tree ID Home Owner