Predictive Techniques And Methods For Decision Support In . - DiVA Portal

Transcription

Licentiate ThesisPredictive Techniques and Methods for Decision Supportin Situations with Poor Data QualityR IKARD K ÖNIGTechnologyU NIVERSITY OF B ORÅSSchool of Business and InformaticsU NIVERSITY OF S KÖVDEInformatics Research CenterU NIVERSITY OF Ö REBROSchool of Science and TechnologyI

II

AbstractToday, decision support systems based on predictive modeling are becomingmore common, since organizations often collect more data than decision makerscan handle manually. Predictive models are used to find potentially valuablepatterns in the data, or to predict the outcome of some event. There are numerouspredictive techniques, ranging from simple techniques such as linear regression, tocomplex powerful ones like artificial neural networks. Complex models usuallyobtain better predictive performance, but are opaque and thus cannot be used toexplain predictions or discovered patterns. The design choice of which predictivetechnique to use becomes even harder since no technique outperforms all othersover a large set of problems. It is even difficult to find the best parameter valuesfor a specific technique, since these settings also are problem dependent. One wayto simplify this vital decision is to combine several models, possibly created withdifferent settings and techniques, into an ensemble. Ensembles are known to bemore robust and powerful than individual models, and ensemble diversity can beused to estimate the uncertainty associated with each prediction.In real-world data mining projects, data is often imprecise, contain uncertaintiesor is missing important values, making it impossible to create models withsufficient performance for fully automated systems. In these cases, predictionsneed to be manually analyzed and adjusted. Here, opaque models like ensembleshave a disadvantage, since the analysis requires understandable models. Toovercome this deficiency of opaque models, researchers have developed ruleextraction techniques that try to extract comprehensible rules from opaquemodels, while retaining sufficient accuracy.This thesis suggests a straightforward but comprehensive method for predictivemodeling in situations with poor data quality. First, ensembles are used for theactual modeling, since they are powerful, robust and require few design choices.Next, ensemble uncertainty estimations pinpoint predictions that need specialattention from a decision maker. Finally, rule extraction is performed to supportthe analysis of uncertain predictions. Using this method, ensembles can be usedfor predictive modeling, in spite of their opacity and sometimes insufficient globalperformance, while the involvement of a decision maker is minimized.The main contributions of this thesis are three novel techniques that enhancethe performance of the purposed method. The first technique deals with ensembleuncertainty estimation and is based on a successful approach often used inweather forecasting. The other two are improvements of a rule extractiontechnique, resulting in increased comprehensibility and more accurate uncertaintyestimations.III

IV

Contents12Introduction . 11.1Key Observations . 41.2Problem Statement . 41.3Thesis Objectives . 41.4Main Contributions . 41.5Thesis outline . 5Information Fusion . 72.1OODA-Loop . 72.1.1 The OODA-loop in Relation to Predictive Modeling . 83Knowledge Discovery in Databases . 113.1Data Mining . 123.1.1 Data Mining Tasks . 123.1.2 Machine Learning vs. Data Mining . 143.1.3 Statistics versus Data Mining . 143.2Predictive Modeling . 143.3Data . 153.3.1 Missing Values . 163.3.2 Outliers . 173.4Preprocessing . 183.4.1 Normalization . 184Evaluation of Predictive Models . 194.1Holdout Sample. 194.2Random Subsampling . 194.3Cross Validation . 194.4Comparing Models . 204.4.1 Wilcoxon Signed Rank Rest . 204.4.2 Friedman Test . 214.4.3 Nemenyi Test . 224.5Error Metrics for Classification Tasks. 23V

4.5.1 Accuracy (ACC) . 234.5.2 Area Under ROC-curve (AUC) . 234.5.3 Brier Score (BRI) . 244.6Error Metrics for Estimation Tasks . 254.6.1 Mean Absolute Deviation (MAD). 254.6.2 Coefficient of Determination (r2) . 264.6.3 Root Mean Square Error (RMSE). 264.6.4 Geometric Mean of the Relative Absolute Error (GMRAE) . 265Data Mining Techniques . 295.1Linear Regression . 295.2Decision Trees . 295.2.1 Creation . 305.2.2 Pruning . 315.2.3 Probability Estimation . 315.3Artificial Neural Networks . 325.3.1 Basics of a Neural Network . 325.4Evolutionary Algorithms. 355.4.1 Genetic Algorithms. 365.4.2 Genetic Programming . 365.5Ensembles . 395.5.1 Calculating Member Weights . 395.5.2 Ensemble Creation . 405.5.3 Member Selection . 405.5.4 Estimating Ensemble Uncertainty . 415.6Rule extraction . 425.6.1 Rule Size . 446VIPrevious work - Genetic Rule Extraction (G-REX) . 476.1Fitness Functions . 476.2Representation Languages . 486.3Empirical Results . 49

7Motivation of Case Studies . 518Data sets Used for Evaluation. 5398.1Classification Data sets . 538.2Estimation Data sets . 56Case Studies . 599.1Increasing Rule Extraction Comprehensibility . 599.1.1 Summary . 599.1.2 Background . 599.1.3 Method . 609.1.4 Experiments . 629.1.5 Results . 639.1.6 Discussion . 669.1.7 Conclusions . 679.1.8 Considerations . 679.2Using GP to increase Rule Quality . 699.2.1 Summary . 699.2.2 Background . 699.2.3 Experiments . 729.2.4 Results . 739.2.5 Conclusions . 759.2.6 Considerations . 769.3Instance Ranking using Ensemble Spread . 779.3.1 Summary . 779.3.2 Background . 779.3.3 Method . 789.3.4 Experiments . 829.3.5 Results . 829.3.6 Conclusions . 859.4Inconsistency – Friend or Foe. 879.4.1 Summary . 87VII

9.4.2 Background . 879.4.3 Method . 889.4.4 Results . 909.4.5 Conclusions . 939.4.6 Discussion and future work . 939.4.7 Considerations . 949.5Genetic Programming – A Tool for Flexible Rule Extraction . 959.5.1 Summary . 959.5.2 Background . 959.5.3 Method . 969.5.4 Results . 1009.5.5 Considerations . 10110Conclusions . 10310.1 Concluding remarks . 10411Future Work . 10712References . 109VIII

List of FiguresFigure 1 - The role of predictive modeling tasks in the OODA-loop . 8Figure 2 - Phases of CRISP-DM . 11Figure 3 - Possible model of the balloon data set in Table 1 . 15Figure 4 - Predictive Modeling . 15Figure 5 - Gaussian distribution with mean of 0 and STD of 1 . 17Figure 6 - Sample ROC curve . 24Figure 7 - Simple Decision Tree . 30Figure 8 - Correct classifications / instances reaching each node . 32Figure 9 - A simple MLP . 33Figure 10 - GA crossover . 36Figure 11 - GP crossover . 38Figure 12 - Rule extraction . 42Figure 13 - Pedagogical rule extraction . 44Figure 14 - Transparent but not comprehensible model . 45Figure 15 - Complex-IF rule evolved by G-REX . 48Figure 16 - IF-REG rule evolved by G-REX . 49Figure 17 - Simple rule with introns. 60Figure 18 - The simplification process . 61Figure 19 - Extracted unsimplified rule for CLE . 66Figure 20 - Simplified rule for CLE . 66Figure 21 - Bins affected by outliers . 80Figure 22 - Trimmed bins . 80Figure 23 - Gain chart for fold 1 of SER . 84Figure 24 - Sample rule for LINREG . 97Figure 25 - Sample rule for FREE . 98IX

List of TablesTable 1 - The balloons data set . 16Table 2 - Critical values of the Wilcoxon test at 0.05 significance level . 21Table 3 - F-distribution Critical values for P 0.05 . 22Table 4 - Critical values for the Nemenyi test . 22Table 5 - Critical values for the Bonferroni-Dunn test . 23Table 6 - BNF for Boolean representation language . 37Table 7 - BNF for the IF-ELSE representation language . 38Table 8 - BNF for the Complex-IF representation language . 48Table 9 - BNF for the IF-REG representation laguage. 49Table 10 - Contribution of case studies . 52Table 11 - Characteristics of classification data sets . 56Table 12 - Characteristics of estimation data sets . 57Table 13 - Accuracy on individual data sets (CD 1.09) . 64Table 14 - Size of extracted rules (CD 1.09) . 64Table 15 - Rule size after simplification (CD .792) . 65Table 16 - G-REX settings used in all experiments . 72Table 17 - BNF for the Adjusted IF-ELSE representation language . 73Table 18 - Average ACC over 10-folds (CD 1.21) . 74Table 19 - Average AUC over 10-folds (CD 1.21) . 74Table 20 - Average BRE over 10-folds (CD 0.879) . 75Table 21 - Ensemble MSE. 83Table 22 - PDE using mode scores (CD 2.15) . 83Table 23 - PDE using support scores (CD 2.15) . 84Table 24 - G-REX Settings . 89Table 25 - Intra-model consistency (ZOO) . 91Table 26 - Inter-model consistency (PID) . 91Table 27 - Average consistency over all pairs and folds . 91Table 28 - Accuracy comparison between G-REX and CART . 92Table 29 - Comparison between methods CD 1.38. 92Table 30 - BNF for LINREG . 97Table 31 - Free representation (FREE) . 98Table 32 - GMRAE on individual data sets CD 1.97 . 100X

List of publicationsJournal papersR. Konig, U. Johansson and L. Niklasson. Increasing rule extractioncomprehensibility, International Journal of Information Technology andIntelligent Computing, 1, pp. 303-314, 2006.U. Johansson, T. Lofstrom, R. Konig and L. Niklasson, Why Not Use an OracleWhen You Got One?, Neural Information Processing - Letters and Reviews, pp.227-236, 2006.International Conference PapersR. Konig, U. Johansson and L. Niklasson, Using Genetic Programming toIncrease Rule Quality, International Florida Artificial Intelligence ResearchSociety Conference, Coconut Grove, Florida, USA, pp. 288-293, 2008.R. Konig, U. Johansson and L. Niklasson, G-REX: A Versatile Framework forEvolutionary Data Mining, IEEE International Conference on Data MiningWorkshops, Pisa, Italy, pp. 971-974, 2008.R. Konig, U. Johansson and L. Niklasson, Evolving a Locally Optimized InstanceBased Learner, International Conference on Data mining, Las Vegas, Nevada,USA, pp. 124-129, 2008.C. Sonstrod, U. Johansson, R. Konig and L. Niklasson, Genetic Decision Lists forConcept Description, International Conference on Data Mining, Lasvegas,Nevada, USA, pp. 450-457, 2008.U. Johansson, H. Bostrom and R. Konig, Extending Nearest NeighborClassification with Spheres of Confidence, Proceedings of the Twenty-FirstInternational FLAIRS Conference, Coconut Grove, Florida, USA, pp. 282-287,2008.U. Johansson, R. Konig, T. Lofstrom and L. Niklasson, Increasing RuleExtraction Accuracy by Post-processing GP Trees, IEEE Congress onEvolutionary Computation, Hong Kong, China, pp. 3010-3015, 2008.XI

R. Konig, U. Johansson and L. Niklasson, Instance ranking using ensemblespread, International Conference on Data Mining, Las Vegas, Nevada, USA, pp.73-78, 2007.R. Konig, U. Johansson and L. Niklasson, Genetic Programming – a Tool forFlexible Rule Extraction, IEEE Congress on Evolutionary Computation,Singapore, pp. 1304-1310, 2007.U. Johansson, R. Konig and L. Niklasson, Inconsistency – Friend or Foe,International Joint Conference on Neural Networks, Orlando, Florida, USA, pp.1383-1388, 2007.C. Sonstrod, U. Johansson and R. Konig, Towards a Unified View on ConceptDescription, International conference on data mining, Las Vegas, Nevada, USA,pp. 59-65, 2007.U. Johansson, T. Lofstrom, R. Konig and N. Lars, Building Neural NetworkEnsembles using Genetic Programming, International Joint Conference on NeuralNetworks, Vancouver, Canada, pp. 1260-1265, 2006.U. Johansson, T. Lofstrom, R. Konig, C. Sonstrod and L. Niklasson, RuleExtraction from Opaque Models - A Slightly Different Perspective, InternationalConference on Machine Learning and Applications, Orlando, Florida, USA, pp.22-27, 2006.U. Johansson, T. Lofstrom, R. Konig and L. Niklasson. Genetically evolved treesrepresenting ensembles, International Conference - Artificial Intelligence and SoftComputing, 613-622, 2006.U. Johansson, T. Lofstrom, R. Konig and L. Niklasson, Introducing GEMS - Anovel technique for ensemble creation, International Florida Artificial IntelligenceResearch Society Conference, pp. 700-705, 2006.T. Lofstrom, R. Konig, U. Johansson, L. Niklasson, M. Strand and T. Ziemke,Benefits of relating the retail domain and information fusion, InternationalConference on Information Fusion, Florence, Italy, pp. 129-134, 2006.XII

U. Johansson, R. Konig and L. Niklasson, Automatically balancing accuracy onalConferenceonInformation Fusion, Philadelphia, PA, USA, 2005.U. Johansson, L. Niklasson and R. Konig, Accuracy vs. comprehensibility in datamining models, International Conference on Information Fusion, Stockholm,Sweden, pp. 295-300, 2004.U. Johansson, R. Konig and L. Niklasson, The truth is in there - Rule extractionfrom opaque models using genetic programming, International Florida ArtificialIntelligence Research Society Conference, Miami Beach, Florida, USA, pp. 658663, 2004.U. Johansson, C. Sonstrod, R. Konig and L. Niklasson, Neural networks and ruleextraction for prediction and explanation in the marketing domain, InternationalJoint Conference on Neural Networks, Portland, Oregan, USA, pp. 2866 - 2871,2003.U. Johansson, R. Konig and L. Niklasson, Rule Extraction from Trained NeuralNetworks using Genetic Programming, International Conference on ArtificialNeural Networks and International Conference on Neural InformationProcessing, Istanbul, Turkey, pp. 13-16, 2003.U. Johansson, C. Sonstrod and R. Konig, Cheating by Sharing Information – TheDoom of Online Poker?, International Conference on Application andDevelopment of Computer Games in the 21st Century, Hong Kong, China, pp.16-22, 2003.XIII

AcknowledgmentsFirst of all I want to thank my supervisor Professor Lars “The dark horse”Niklasson, University of Skövde, for all help, support and encouragement thatmade this thesis possible.Special thanks to my assistant supervisor Dr. Ulf “Hawkeye” Johansson, whomotivated me to start my PhD studies and then supervised my daily work withkeen eyes. This thesis would definitely not have happened without your help andsupport. Are you sure that it wasn’t your eye, there on top of Barad-dûr?Obviously I’m very grateful to the University of Borås, University of Skövde, ICAHandlarnas AB and the Knowledge Foundation, for funding my PhD studies.Especially, Rolf Appelqvist, and Mikael Lind at University of Borås, and LarsNiklasson, Mattias Strand at University of Skövde, who together made my studiespossible. At ICA Handlarna AB, I especially want to thank Thomas Khanestad,Johan Petterson, Christer Nyman and Anna Byberg for interesting discussionsand valuable insights about problems in the “real world”. With some luck, myPhD thesis will contain some solutions that will make your lives a bit easier.I especially want to thank to Cecilia Sönströd who helped proooofreading thisthesis, I will buy you a set of new red pens, since yours surely must be out of ink.I also want to thank Andrèas Stråhle (who helped giving birth to G-REX),Marcus Hedblom (who showed me that this was possible) and Tuve Löfström(who is a good discussion partner and a superb spiritual guide).Finally I want to thank my family, Lorena, Birgitta, Leif, Susanna and Olof whosupported and believed in me without having a clue of what I was doing. Iwouldn’t recommend you to read this thesis, as it probably will bore you to deathand make everything even more confusing. Special thanks to Lorena who kickedme out of bed every morning, and encouraged me when I needed it the most.XIV

1 IntroductionToday decision makers often have access to numerous information sourcescontaining vast amounts of data and information; i.e. processed data. Often thedata is collected using several sensors to gain as much information as possibleabout activities related to a decision. [BPR02] argue that this approach is notnecessarily the best one to support a decision maker. First, all information maynot be required to adequately execute the task. Secondly, the resources requiredto process all this information might exceed the human information processingcapabilities. This problem is also recognized by [HL01] and [SHB 03] as theypoint out that a decision maker will need an automated or semi automateddecision support system (DSS) to make use of large quantities of data, especially ifthe data is of heterogeneous nature. A DSS needs to be flexible, easy to use androbust enough to perform well over a large set of problems to be useful in theever changing environment of today.Information fusion (IF) is a research area that focuses on achieving synergeticeffects by combining different information sources. There are numerousdefinitions of IF, for examples see [HL01], [CGC 05], [Das01], [HW01] and[GB05] but what they all have in common is the emphasis on maximizing theuseful information content acquired from heterogeneous sources. IF techniquesare sometimes implemented in a fully automatic process or in a human-aidingprocess for analysis and/or decision support [HL01].A typical situation that requires a DSS is when the result of some futuresituation needs to be predicted based on a large set of examples. In informationFusion and the related area of Data Mining this task is called predictive modeling,i.e., the task of predicting an unknown (often future) value of a specific variablebased on a model learned from known examples.A predictive model is created from variables that are thought to provideinformation of the behavior of the dependent (predicted) variable. These variablesare called independent or causal variables and are assumed to have knownconstant values. Predictive modeling techniques try to create models thatmaximize a score function which most often is designed to optimize theperformance metric used for evaluation.For predictive modeling the primary performance goal is to achieve highaccuracy, i.e. a low error between the predicted value and the real value, whenthe model is applied to novel data. For certain problems, a simple model, e.g. alinear regression model, is enough to achieve sufficient level of accuracy. Otherproblems require a more complicated model, e.g. a non-linear neural network oran ensemble. It is well known that ensembles, i.e. a combination of different1

models, are able to improve both accuracy and robustness of predictionscompared to single models [KV95]. Another advantage of ensembles techniques isthat they can eliminate a lot of parameter tweaking which is often needed forcreating and finding the best single model.The choice of a highly complex model as an ensemble, however, entails that themodel is hard or impossible to interpret, a dilemma often called the accuracyversus comprehensibility tradeoff. Neither the model as a whole nor the reasonsbehind a single prediction can be explained when using a complex opaque model.Sometimes it is impossible to obtain a sufficient performance even with acomplex model since no model can be better than the data it was created from.Data can contain uncertainties, low precision, noise or simply lack relevant datamaking it impossible to create a model with sufficient accuracy. In these cases it iscrucial that the model is comprehensible so manual adjustment can be done in arational way [Goo02]. Even if a complex opaque model would be better than atransparent model with regards to accuracy, it may not be good enough to betrusted in practice. Without any explanation of an uncertain prediction it cannotbe adjusted or verified and thus cannot be trusted and used for decision support.Other motivations for manual adjustments and therefore comprehensiblemodels are that all relevant data cannot always be presented in a structuredmanner, all data is not always available at prediction time or that a drasticenvironmental change may have occurred. Fur

different settings and techniques, into an ensemble. Ensembles are known to be more robust and powerful than individual models, and ensemble diversity can be used to estimate the uncertainty associated with each prediction. In real-world data mining projects, data is often imprecise, contain uncertainties