CSE 5243 INTRO. TO DATA MINING - GitHub Pages PDF Free Download

1y ago

17 Views

1 Downloads

2.54 MB

63 Pages

Report/dmca

Download PDF

Transcription

CSE 5243 INTRO. TO DATA MININGClassification (Basic Concepts)Yu Su, CSE@The Ohio State UniversitySlides adapted from UIUC CS412 by Prof. Jiawei Han and OSU CSE5243 by Prof. Huan Sun

Classification: Basic Concepts2 Classification: Basic Concepts Decision Tree Induction Model Evaluation and Selection Practical Issues of Classification Bayes Classification Methods Techniques to Improve Classification Accuracy: Ensemble Methods SummaryThis classNext class

Classification: Basic Concepts3 Classification: Basic Concepts Decision Tree Induction Model Evaluation and Selection Practical Issues of Classification Bayes Classification Methods Techniques to Improve Classification Accuracy: Ensemble Methods Summary

Decision Tree Induction: An ExampleTraining data set: Buys computerq The data set follows an example of Quinlan’sID3 (Playing Tennis)q Resulting tree:age?qovercast31.40 30student?nono4 40credit rating?yesyesyesexcellentnofairyesage 30 3031 40 40 40 4031 40 30 30 40 3031 4031 40 40income student credit rating buys computerhighno fairnohighno excellentnohighno fairyesmediumno fairyeslowyes fairyeslowyes excellentnolowyes excellentyesmediumno fairnolowyes fairyesmedium yes fairyesmedium yes excellentyesmediumno excellentyeshighyes fairyesmediumno excellentno

Algorithm for Decision Tree Induction 5Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g.,information gain)Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning—majority voting isemployed for classifying the leaf There are no samples left

Algorithm Outline Split (node, {data tuples})A the best attribute for splitting the {data tuples} Decision attribute for this node A For each value of A, create new child node For each child node / subset: If one of the stopping conditions is satisfied: STOPn Else: Split (child node, {subset})n6ID3 algorithm: how it workshttps://www.youtube.com/watch?v XhOdSLlE5c

Algorithm Outline Split (node, {data tuples})A the best attribute for splitting the {data tuples} Decision attribute for this node A For each value of A, create new child node For each child node / subset: If one of the stopping conditions is satisfied: STOPn Else: Split (child node, {subset})n7ID3 algorithm: how it workshttps://www.youtube.com/watch?v XhOdSLlE5c

Brief Review of Entropy Entropy (Information Theory) A measure of uncertainty associated with a random number Calculation: For a discrete random variable Y taking m distinct values {y1, y2, , ym} Interpretationn Higher entropy higher uncertaintyn Lower entropy lower uncertaintyConditional entropym 28

Attribute Selection Measure: Information Gain (ID3/C4.5)q Selectthe attribute with the highest information gainq Let pi be the probability that an arbitrary tuple in D belongs to class Ci,estimated by Ci, D / D q Expected information (entropy) needed to classify a tuple in D:mInfo ( D) -å pi log 2 ( pi )i 1q Informationneeded (after using A to split D into v partitions) to classify D:v Dj j 1 D InfoA ( D) åq Information Info( D j )gained by branching on attribute AGain(A) Info(D) - InfoA(D)9

Attribute Selection: Information Gain Class P: buys computer “yes”Class N: buys computer “no”age 30 3031 40 40 40 4031 40 30 30 40 3031 4031 40 4010income student credit ratinghighno fairhighno excellenthighno fairmediumno fairlowyes fairlowyes excellentlowyes excellentmediumno fairlowyes fairmediumyes fairmediumyes excellentmediumno excellenthighyes fairmediumno excellentInfo( D) I (9,5) buys og 2 ( ) - log 2 ( ) 0.9401414 1414Look at “age”:age 3031 40 40Infoage ( D) pi243ni302I(pi, ni)0.97100.97154I (2,3) I (4,0)14145I (3,2) 0.69414

Attribute Selection: Information Gain Class P: buys computer “yes”Class N: buys computer “no”age 30 3031 40 40 40 4031 40 30 30 40 3031 4031 40 4011income student credit ratinghighno fairhighno excellenthighno fairmediumno fairlowyes fairlowyes excellentlowyes excellentmediumno fairlowyes fairmediumyes fairmediumyes excellentmediumno excellenthighyes fairmediumno excellentInfo( D) I (9,5) buys ge ( D) 9955log 2 ( ) - log 2 ( ) 0.9401414 141454I (2,3) I (4,0)14145I (3,2) 0.69414Gain(age ) Info ( D) - Infoage ( D) 0.246Similarly,Gain(income) 0.029Gain( student ) 0.151Gain(credit rating ) 0.048

Recursive Procedureage 30 3031 40 40 40 4031 40 30 30 40 3031 4031 40 4012income student credit airlowyes fairlowyes excellentlowyes excellentmediumnofairlowyes fairmediumyes fairmediumyes excellentmediumnoexcellenthighyes fairmediumnoexcellentbuys computernonoyesyesyesnoyesnoyesyesyesyesyesno1. After selecting age at the root node,we will create three child nodes.2. One child node is associated with red datatuples.3. How to continue for this child node?Now, you will make D {red data tuples}and then select the best attribute to further splitD.A recursive procedure.

How to Select Test Attribute? Depends on attribute typesNominal Ordinal Continuous Depends on number of ways to split2-way split Multi-way split 13

Splitting Based on Nominal Attributes Multi-way split: Use as many partitions as distinct values.CarTypeFamilyLuxurySports Binary split: Divides values into two subsets. Need to find y}OR{Family,Luxury}CarType{Sports}

Splitting Based on Continuous AttributesTaxableIncome 80K?TaxableIncome? 10KYes 80KNo[10K,25K)(i) Binary split15[25K,50K)[50K,80K)(ii) Multi-way split

How to Determine the Best Split Greedy approach: Nodes with homogeneous class distribution are preferredIdeally, data tuples at that node belong to the same class.C0: 5C1: 516C0: 9C1: 1Non-homogeneous,Homogeneous,High degree of impurityLow degree of impurity

Rethink about Decision Tree Classification Greedy approach: Nodes with homogeneous class distribution are preferredNeed a measure of node impurity:C0: 5C1: 517C0: 9C1: 1Non-homogeneous,Homogeneous,High degree of impurityLow degree of impurity

Measures of Node Impurity Entropy:Higher entropy higher uncertainty, higher node impurity Why entropy is used in information gain 18 Gini Index Misclassification error

Gain Ratio for Attribute Selection (C4.5) Information gain measure is biased towards attributes with a largenumber of valuesC4.5 (a successor of ID3) uses gain ratio to overcome the problem(normalization to information gain)v Dj j 1 D SplitInfo A ( D) -å19 log 2 ( Dj D ) The entropy of the partitioning, or the potential information generated bysplitting D into v partitions. GainRatio(A) Gain(A)/SplitInfo(A) (normalizing Information Gain)

Gain Ratio for Attribute Selection (C4.5) C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization toinformation gain)v D D SplitInfo A ( D) -åj 1 j D log 2 (j D )GainRatio(A) Gain(A)/SplitInfo(A)Ex.Gain(income) 0.029 (from last class, slide 27) 20gain ratio(income) 0.029/1.557 0.019The attribute with the maximum gain ratio is selected as the splitting attribute

Gini Index (CART, IBM IntelligentMiner) If a data set D contains examples from n classes, gini index, gini(D) is defined asngini( D) 1- å p 2j ,j 1 where pj is the relative frequency of class j in DIf a data set D is split on A into two subsets D1 and D2, the gini index after the split isdefined as D D gini A ( D) 211 D gini( D1) 2 D gini( D 2)Reduction in impurity:Dgini( A) gini( D) - giniA ( D)The attribute provides the smallestchosen to split the node.gini A (D)(or, the largest reduction in impurity) is

Binary Attributes: Computing Gini Index Splits into two partitionsEffect of weighing partitions:– Larger and Purer Partitions are sought for.ngini( D) 1- å p 2jj 1ParentB?YesNode N1Gini(N1) 1 – (5/7)2 – (2/7)2 0.194Gini(N2) 1 – (1/5)2 – (4/5)2 0.52823NoC1C2N152Gini ?N214Node N2C16C26Gini 0.500

Binary Attributes: Computing Gini Index Splits into two partitionsEffect of weighing partitions:– Prefer Larger and Purer Partitions.ngini( D) 1- å p 2jj 1ParentB?YesNode N1Gini(N1) 1 – (5/7)2 – (2/7)2 0.194Gini(N2) 1 – (1/5)2 – (4/5)2 0.52824NoC1C2N152N214Gini 0.333Node N2C16C26Gini ?Gini(Children) 7/12 * 0.194 5/12 * 0.528 0.333weighting

Categorical Attributes: Computing Gini Index For each distinct value, gather counts for each class in the datasetUse the count matrix to make decisionsTwo-way split(find best partition of values)Multi-way splitCarTypeFamily Sports orts}Luxury}22150.419

Continuous Attributes: Computing Gini Index or Information Gain To discretize the attribute values Use Binary Decisions based on one splitting valueSeveral Choices for the splitting value Number of possible splitting values Number of distinct values -1 Typically, the midpoint between each pair of adjacent values is considered as apossible split pointn 26(ai ai 1)/2 is the midpoint between the values of ai and ai 1Each splitting value has a count matrix associated with it Class counts in each of the partitions, A v and A ³ vSimple method to choose best v For each v, scan the database to gather count matrix and compute its Gini index Computationally Inefficient! Repetition of work.Tid Refund MaritalStatusTaxableIncome o4YesMarried120KNo5NoDivorced 95KYes6NoMarriedNo7YesDivorced es10TaxableIncome 80K?YesNo60K

Continuous Attributes:Computing Gini Index or expected information requirementFirst decide the splitting value to discretize the attribute: For efficient computation: for each attribute,Step 1: Sort the attribute on ble IncomeStep 1:Sorted ValuesPossible Splitting Values605575658572908095879297110122172Use midpoint230 750.4000.420

Continuous Attributes:Computing Gini Index or expected information requirementFirst decide the splitting value to discretize the attribute: For efficient computation: for each attribute,Step 1: Sort the attribute on valuesStep 2: Linearly scan these values, each time updating the count ble IncomeStep 1:Step 2:Sorted ValuesPossible Splitting Values605575658572908095879297110122172230 750.4000.420For each splitting value, get its count matrix: how many data tuples have:(a) Taxable income 65 with class label “Yes” , (b) Taxable income 65 with class label “No”, (c) Taxable income 65 with class label “Yes”,(d) Taxable income 65 with class label “No”.

Continuous Attributes:Computing Gini Index or expected information requirementFirst decide the splitting value to discretize the attribute: For efficient computation: for each attribute,Step 1: Sort the attribute on valuesStep 2: Linearly scan these values, each time updating the count ble IncomeStep 1:Step 2:Sorted ValuesPossible Splitting Values605575658572908095879297110122172230 750.4000.420For each splitting value, get its count matrix: how many data tuples have:(a) Taxable income 72 with class label “Yes” , (b) Taxable income 72 with class label “No”, (c) Taxable income 72 with class label “Yes”,(d) Taxable income 72 with class label “No”.

Continuous Attributes:Computing Gini Index or expected information requirementFirst decide the splitting value to discretize the attribute: For efficient computation: for each attribute,Step 1: Sort the attribute on valuesStep 2: Linearly scan these values, each time updating the count ble IncomeStep 1:Step 2:Sorted ValuesPossible Splitting Values605575658572908095879297110122172230 750.4000.420For each splitting value, get its count matrix: how many data tuples have:(a) Taxable income 80 with class label “Yes” , (b) Taxable income 80 with class label “No”, (c) Taxable income 80 with class label “Yes”,(d) Taxable income 80 with class label “No”.

Continuous Attributes:Computing Gini Index or expected information requirementFirst decide the splitting value to discretize the attribute: For efficient computation: for each attribute,Step 1: Sort the attribute on valuesStep 2: Linearly scan these values, each time updating the count ble IncomeStep 1:Step 2:Sorted ValuesPossible Splitting Values605575658572908095879297110122172230 750.4000.420For each splitting value, get its count matrix: how many data tuples have:(a) Taxable income 172 with class label “Yes” , (b) Taxable income 172 with class label “No”, (c) Taxable income 172 with class label“Yes”, (d) Taxable income 172 with class label “No”.

Continuous Attributes:Computing Gini Index or expected information requirementFirst decide the splitting value to discretize the attribute: For efficient computation: for each attribute,Step 1: Sort the attribute on valuesStep 2: Linearly scan these values, each time updating the count matrixStep 3: Computing Gini index and choose the split position that has the least Gini le IncomeStep 1:Step 2:Step 3:Sorted ValuesPossible Splitting Values60705575658572908095879297110122172230 4000.420For each splitting value v (e.g., 65), compute its Gini index:giniTaxable Income ( D) 32 D1 D gini( D1) 2 gini( D 2) Here D1 and D2 are two partitions based on v: D1 has D D taxable income v and D2 has v

Continuous Attributes:Computing Gini Index or expected information requirementFirst decide the splitting value to discretize the attribute: For efficient computation: for each attribute,Step 1: Sort the attribute on valuesStep 2: Linearly scan these values, each time updating the count matrixStep 3: Computing Gini index and choose the split position that has the least Gini le IncomeStep 1:Step 2:Step 3:Sorted ValuesPossible Splitting Values60705575658572908095879297110122172230 4000.420For each splitting value v (e.g., 72), compute its Gini index:giniTaxable Income ( D) 33 D1 D gini( D1) 2 gini( D 2) Here D1 and D2 are two partitions based on v: D1 has D D taxable income v and D2 has v

Continuous Attributes:Computing Gini Index or expected information requirementFirst decide the splitting value to discretize the attribute: For efficient computation: for each attribute,Step 1: Sort the attribute on valuesStep 2: Linearly scan these values, each time updating the count matrixStep 3: Computing Gini index and choose the split position that has the least Gini le IncomeStep 1:Step 2:Step 3:Sorted ValuesPossible Splitting Values60705575658572908095879297110122172230 4000.420Choose this splitting value ( 97) with the least Gini index to discretize Taxable Income34

Continuous Attributes:Computing Gini Index or expected information requirementFirst decide the splitting value to discretize the attribute: For efficient computation: for each attribute,Step 1: Sort the attribute on valuesStep 2: Linearly scan these values, each time updating the count matrixStep 3: Computing expected information requirement and choose the split position that has the least le IncomeStep 1:Sorted ValuesPossible Splitting ValuesStep 2:Step 3:If Information Gain is usedfor attribute selection,3560705575658572908095879297110122172230 nfo?Similarly to calculating Gini index, for each splitting value, compute Info {Taxable Income}:2 Dj j 1 D InfoTaxable - Income ( D ) å Info( D j )

Continuous Attributes:Computing Gini Index or expected information requirementFirst decide the splitting value to discretize the attribute: For efficient computation: for each attribute,Step 1: Sort the attribute on valuesStep 2: Linearly scan these values, each time updating the count matrixStep 3: Computing Gini index and choose the split position that has the least Gini le IncomeStep 1:Step 2:Step 3:Sorted ValuesPossible Splitting Values60705575658572908095879297110122172230 4000.420Choose this splitting value ( 97 here) with the least Gini index or expected information requirement to discretize Taxable Income36

Continuous Attributes:Computing Gini Index or expected information requirementFirst decide the splitting value to discretize the attribute: For efficient computation: for each attribute,Step 1: Sort the attribute on valuesStep 2: Linearly scan these values, each time updating the count matrixStep 3: Computing Gini index and choose the split position that has the least Gini le IncomeStep 1:Step 2:Step 3:37Sorted ValuesPossible Splitting Values60705575658572908095879297110122172230 4000.420At each level of the decision tree, for attribute selection, (1) First, discretize a continuous attribute by deciding the splitting value;(2) Then, compare the discretized attribute with other attributes in terms of Gini Index reduction or Information Gain.

Continuous Attributes:Computing Gini Index or expected information requirementFirst decide the splitting value to discretize the attribute: For each attribute,only scan the datatuples onceFor efficient computation: for each attribute,Step 1: Sort the attribute on valuesStep 2: Linearly scan these values, each time updating the count matrixStep 3: Computing Gini index and choose the split position that has the least Gini le IncomeStep 1:Step 2:Step 3:38Sorted ValuesPossible Splitting Values60705575658572908095879297110122172230 4000.420At each level of the decision tree, for attribute selection, (1) First, discretize a continuous attribute by deciding the splitting value;(2) Then, compare the discretized attribute with other attributes in terms of Gini Index reduction or Information Gain.

Another Impurity Measure: Misclassification Error Classification error at a node t :Error (t ) 1 - max P(i t )i 39P(i t) means the relative frequency of class i at node t.Measures misclassification error made by a node.n Maximum (1 - 1/nc) when records are equally distributed among all classes,implying most impurityn Minimum (0.0) when all records belong to one class, implying least impurity

Examples for Misclassification ErrorError (t ) 1 - max P (i t )i40C1C206P(C1) 0/6 0C1C215P(C1) 1/6C1C224P(C1) 2/6P(C2) 6/6 1Error 1 – max (0, 1) 1 – 1 0P(C2) 5/6Error 1 – max (1/6, 5/6) 1 – 5/6 1/6P(C2) 4/6Error 1 – max (2/6, 4/6) 1 – 4/6 1/3

Comparison among Impurity MeasureFor a 2-class problem:Entropy - p log( p ) - (1 - p ) log(1 - p )Gini 1- p2- (1-2p)Error 1 - max( p,1 - p )41

Other Attribute Selection Measures CHAID: a popular decision tree algorithm, measure based on χ2 test for independence C-SEP: performs better than info. gain and gini index in certain cases G-statistic: has a close approximation to χ2 distribution MDL (Minimal Description Length) principle (i.e., the simplest solution is preferred): Multivariate splits (partition based on multiple variable combinations) CART: finds multivariate splits based on a linear comb. of attrs.Which attribute selection measure is the best? 42The best tree as the one that requires the fewest # of bits to both (1) encode thetree, and (2) encode the exceptions to the treeMost give good results, none is significantly superior than others

Decision Tree Based Classification Advantages:Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simpledata sets 43

Example: C4.5 Simple depth-first construction.Uses Information GainSorts Continuous Attributes at each node.Needs entire data to fit in memory.Unsuitable for Large Datasets. Needs out-of-core sorting.You can download the software online, e.g.,http://www2.cs.uregina.ca/ dbd/cs831/notes/ml/dtrees/c4.5/tutorial.html44

Classification: Basic Concepts45 Classification: Basic Concepts Decision Tree Induction Model Evaluation and Selection Practical Issues of Classification Bayes Classification Methods Techniques to Improve Classification Accuracy: Ensemble Methods Summary

Model Evaluation Metrics for Performance Evaluation Methods for Performance Evaluation How to obtain reliable estimates?Methods for Model Comparison 46How to evaluate the performance of a model?How to compare the relative performance among competing models?

Metrics for Performance Evaluation Focus on the predictive capability of a model Rather than how fast it takes to classify or build models, scalability, etc.Confusion Matrix:PREDICTED CLASSClass YesClass YesACTUALCLASS Class No47Class Noabcda: TP (true positive)b: FN (false negative)c: FP (false positive)d: TN (true negative)

Classifier Evaluation Metrics: Confusion MatrixConfusion Matrix:Actual class\PredictedclassC1 C1C1True Positives (TP)False Negatives (FN) C1False Positives (FP)True Negatives (TN)Example of Confusion Matrix: buy computer yesbuy computer noTotalbuy computer yes6954467000buy computer no41225883000Total7366263410000Given m classes, an entry, CMi,j in a confusion matrix indicates # of tuples inclass i that were labeled by the classifier as class j 48Actual class\Predicted classMay have extra rows/columns to provide totals

Classifier Evaluation Metrics:Accuracy, Error Rate 49A\PC CCTPFNP CFPTNNP’N’AllClassifier Accuracy, or recognition rate:percentage of test set tuples that arecorrectly classifiedAccuracy (TP TN)/AllError rate: 1 – accuracy, orError rate (FP FN)/All

Limitation of Accuracy Consider a 2-class problemNumber of Class 0 examples 9990 Number of Class 1 examples 10 If a model predicts everything to be class 0,Accuracy is 9990/10000 99.9 % 50Accuracy is misleading because model does not detect any class 1 example

Cost MatrixPREDICTED CLASSC(i j)Class YesClass YesC(Yes Yes)C(No Yes)C(Yes No)C(No No)ACTUALCLASS Class NoClass NoC(i j): Cost of misclassifying one class j example as class i51

Computing Cost of ASSPREDICTED CLASS - 15040-60250Accuracy 80%Cost 391052PREDICTED CLASSC(i j) - -1100-10ModelM2ACTUALCLASSPREDICTED CLASS - 25045-5200Accuracy 90%Cost 4255

Cost-Sensitive MeasuresaPrecision (p) a caRecall (r) a b2rp2aF - measure (F) r p 2a b c PREDICTED CLASSACTUAL CLASSPrecision is biased towards C(Yes Yes) & C(Yes No)Recall is biased towards C(Yes Yes) & C(No Yes)F-measure is biased towards all except C(No No)wa w dWeighted Accuracy wa wb wc w d15314234Class YesClass NoClass Yesa (TP)b (FN)Class Noc (FP)d (TN)

Classifier Evaluation Metrics: Sensitivity and Specificity 54Class Imbalance Problem:q One class may be rare, e.g. fraud, or HIVpositiveq Significant majority of the negative class andminority of the positive classqA\PC CCTPFNP CFPTNNP’N’AllClassifier Accuracy, or recognition rate:percentage of test set tuples that arecorrectly classifiedAccuracy (TP TN)/AllError rate: 1 – accuracy, orError rate (FP FN)/AllqSensitivity: True Positive recognition rateq Sensitivity TP/PqSpecificity: True Negative recognition rateq Specificity TN/N

Methods for Performance Evaluation 55How to obtain a reliable estimate of performance?Performance of a model may depend on other factors besides thelearning algorithm: Class distribution Cost of misclassification Size of training and test sets

Learning Curve Learning curve shows how accuracychanges with varying sample size Requires a sampling schedule for creatinglearning curve: Arithmetic sampling(Langley, et al) Geometric sampling(Provost et al)Effect of small sample size:56-Bias in the estimate-Variance of estimate

Methods of Estimation 57Holdout E.g., reserve 2/3 for training and 1/3 for testingRandom subsampling Repeated holdoutCross validation Partition data into k disjoint subsets k-fold: train on k-1 partitions, test on the remaining one Leave-one-out: k nStratified sampling oversampling vs undersamplingBootstrap Sampling with replacement

Evaluating Classifier Accuracy: Holdout & Cross-Validation Methods 58Holdout method Given data is randomly partitioned into two independent setsn Training set (e.g., 2/3) for model constructionn Test set (e.g., 1/3) for accuracy estimation Random sampling: a variation of holdoutn Repeat holdout k times, accuracy avg. of the accuracies obtainedCross-validation (k-fold, where k 10 is most popular) Randomly partition the data into k mutually exclusive subsets, each approximately equal size At i-th iteration, use Di as test set and others as training set Leave-one-out: k folds where k # of tuples, for small sized data *Stratified cross-validation*: folds are stratified so that class dist. in each fold is approx. the sameas that in the initial data

Evaluating Classifier Accuracy: Bootstrap Bootstrap Works well with small data sets Samples the given training tuples uniformly with replacementn 59Each time a tuple is selected, it is equally likely to be selected again and re-added to the training setSeveral bootstrap methods, and a common one is .632 boostrap A data set with d tuples is sampled d times, with replacement, resulting in a training set of d samples.The data tuples that did not make it into the training set end up forming the test set. About 63.2% ofthe original data end up in the bootstrap, and the remaining 36.8% form the test set (since (1 – 1/d)d e-1 0.368) Repeat the sampling procedure k times, overall accuracy of the model:

Model Evaluation Metrics for Performance Evaluation Methods for Performance Evaluation How to obtain reliable estimates?Methods for Model Comparison 60How to evaluate the performance of a model?How to compare the relative performance among competing models?

ROC (Receiver Operating Characteristic) Curve(False Positive Rate, True Positive Rate): (0,0): declare everythingto be negative class (1,1): declare everythingto be positive class (0,1): ideal Diagonal line: Random guessing Below diagonal line:n61prediction is opposite of the true class

Using ROC for Model Comparison No model consistently outperform theother M1 is better for small FPR M2 is better for large FPRArea Under the ROC curve Ideal:§ Random guess (diagonal line):§62Area 1Area 0.5

Classification: Basic Concepts63 Classification: Basic Concepts Decision Tree Induction Model Evaluation and Selection Practical Issues of Classification Bayes Classification Methods Techniques to Improve Classification Accuracy: Ensemble Methods Summary

If a data set D contains examples from nclasses, giniindex, gini(D) is defined as, where p j is the relative frequency of class jin D If a data set Dis split on A into two subsets D 1 and D 2, the giniindex after the splitis defined as Reduction in impurity: The attribute provides the smallest (or, the largest reduction in impurity) is