Data Mining: Concepts And Techniques

Transcription

Data Mining:Concepts and Techniques

Data Mining:Concepts and TechniquesSecond EditionJiawei HanandMicheline KamberUniversity of Illinois at Urbana-ChampaignAMSTERDAM BOSTONHEIDELBERG LONDONNEW YORK OXFORD PARISSAN DIEGO SAN FRANCISCOSINGAPORE SYDNEY TOKYO

Publisher Diane CerraPublishing Services Manager Simon CrumpEditorial Assistant Asma StephanCover DesignCover ImageCover IllustrationText DesignComposition diacriTechTechnical Illustration Dartmouth Publishing, Inc.Copyeditor Multiscience PressProofreader Multiscience PressIndexer Multiscience PressInterior printer Maple-Vail Book Manufacturing GroupCover printer Phoenix ColorMorgan Kaufmann Publishers is an imprint of Elsevier.500 Sansome Street, Suite 400, San Francisco, CA 94111This book is printed on acid-free paper.c 2006 by Elsevier Inc. All rights reserved.Designations used by companies to distinguish their products are often claimed as trademarks orregistered trademarks. In all instances in which Morgan Kaufmann Publishers is aware of a claim,the product names appear in initial capital or all capital letters. Readers, however, should contactthe appropriate companies for more complete information regarding trademarks andregistration.No part of this publication may be reproduced, stored in a retrieval system, or transmitted in anyform or by any means—electronic, mechanical, photocopying, scanning, or otherwise—withoutprior written permission of the publisher.Permissions may be sought directly from Elsevier’s Science & Technology Rights Department inOxford, UK: phone: ( 44) 1865 843830, fax: ( 44) 1865 853333, e-mail:permissions@elsevier.co.uk. You may also complete your request on-line via the Elsevier homepage(http://elsevier.com) by selecting “Customer Support” and then “Obtaining Permissions.”Library of Congress Cataloging-in-Publication DataApplication submittedISBN 13: 978-1-55860-901-3ISBN 10: 1-55860-901-6For information on all Morgan Kaufmann publications, visit our Web site atwww.mkp.com or www.books.elsevier.comPrinted in the United States of America06 07 08 09 105 4 3 2 1

DedicationTo Y. Dora and Lawrence for your love and encouragementJ.H.To Erik, Kevan, Kian, and Mikael for your love and inspirationM.K.v

ContentsAbout the AuthorForewordPrefacexviixixxxiChapter 1 Introduction 11.1What Motivated Data Mining? Why Is It Important? 11.2So, What Is Data Mining? 51.3Data Mining—On What Kind of Data? 91.3.1 Relational Databases 101.3.2 Data Warehouses 121.3.3 Transactional Databases 141.3.4 Advanced Data and Information Systems and AdvancedApplications 151.4Data Mining Functionalities—What Kinds of Patterns Can BeMined? 211.4.1 Concept/Class Description: Characterization andDiscrimination 211.4.2 Mining Frequent Patterns, Associations, and Correlations 231.4.3 Classification and Prediction 241.4.4 Cluster Analysis 251.4.5 Outlier Analysis 261.4.6 Evolution Analysis 271.5Are All of the Patterns Interesting? 271.6Classification of Data Mining Systems 291.7Data Mining Task Primitives 311.8Integration of a Data Mining System witha Database or Data Warehouse System 341.9Major Issues in Data Mining 36vii

viiiContents1.10Summary 39Exercises 40Bibliographic Notes 42Chapter 2 Data Preprocessing 472.1Why Preprocess the Data? 482.2Descriptive Data Summarization 512.2.1 Measuring the Central Tendency 512.2.2 Measuring the Dispersion of Data 532.2.3 Graphic Displays of Basic Descriptive Data Summaries 562.3Data Cleaning 612.3.1 Missing Values 612.3.2 Noisy Data 622.3.3 Data Cleaning as a Process 652.4Data Integration and Transformation 672.4.1 Data Integration 672.4.2 Data Transformation 702.5Data Reduction 722.5.1 Data Cube Aggregation 732.5.2 Attribute Subset Selection 752.5.3 Dimensionality Reduction 772.5.4 Numerosity Reduction 802.6Data Discretization and Concept Hierarchy Generation 862.6.1 Discretization and Concept Hierarchy Generation forNumerical Data 882.6.2 Concept Hierarchy Generation for Categorical Data 942.7Summary 97Exercises 97Bibliographic Notes 101Chapter 3 Data Warehouse and OLAP Technology: An Overview 1053.1What Is a Data Warehouse? 1053.1.1 Differences between Operational Database Systemsand Data Warehouses 1083.1.2 But, Why Have a Separate Data Warehouse? 1093.2A Multidimensional Data Model 1103.2.1 From Tables and Spreadsheets to Data Cubes 1103.2.2 Stars, Snowflakes, and Fact Constellations:Schemas for Multidimensional Databases 1143.2.3 Examples for Defining Star, Snowflake,and Fact Constellation Schemas 117

Contents3.2.43.2.53.2.63.2.73.33.43.53.6Measures: Their Categorization and Computation 119Concept Hierarchies 121OLAP Operations in the Multidimensional Data Model 123A Starnet Query Model for QueryingMultidimensional Databases 126Data Warehouse Architecture 1273.3.1 Steps for the Design and Construction of Data Warehouses 1283.3.2 A Three-Tier Data Warehouse Architecture 1303.3.3 Data Warehouse Back-End Tools and Utilities 1343.3.4 Metadata Repository 1343.3.5 Types of OLAP Servers: ROLAP versus MOLAPversus HOLAP 135Data Warehouse Implementation 1373.4.1 Efficient Computation of Data Cubes 1373.4.2 Indexing OLAP Data 1413.4.3 Efficient Processing of OLAP Queries 144From Data Warehousing to Data Mining 1463.5.1 Data Warehouse Usage 1463.5.2 From On-Line Analytical Processingto On-Line Analytical Mining 148Summary 150Exercises 152Bibliographic Notes 154Chapter 4 Data Cube Computation and Data Generalization 1574.1Efficient Methods for Data Cube Computation 1574.1.1 A Road Map for the Materialization of Different Kindsof Cubes 1584.1.2 Multiway Array Aggregation for Full Cube Computation 1644.1.3 BUC: Computing Iceberg Cubes from the Apex CuboidDownward 1684.1.4 Star-cubing: Computing Iceberg Cubes Usinga Dynamic Star-tree Structure 1734.1.5 Precomputing Shell Fragments for Fast High-DimensionalOLAP 1784.1.6 Computing Cubes with Complex Iceberg Conditions 1874.2Further Development of Data Cube and OLAPTechnology 1894.2.1 Discovery-Driven Exploration of Data Cubes 1894.2.2 Complex Aggregation at Multiple Granularity:Multifeature Cubes 1924.2.3 Constrained Gradient Analysis in Data Cubes 195ix

xContents4.34.4Attribute-Oriented Induction—An AlternativeMethod for Data Generalization and Concept Description 1984.3.1 Attribute-Oriented Induction for Data Characterization 1994.3.2 Efficient Implementation of Attribute-Oriented Induction 2054.3.3 Presentation of the Derived Generalization 2064.3.4 Mining Class Comparisons: Discriminating betweenDifferent Classes 2104.3.5 Class Description: Presentation of Both Characterizationand Comparison 215Summary 218Exercises 219Bibliographic Notes 223Chapter 5 Mining Frequent Patterns, Associations, and Correlations 2275.1Basic Concepts and a Road Map 2275.1.1 Market Basket Analysis: A Motivating Example 2285.1.2 Frequent Itemsets, Closed Itemsets, and Association Rules 2305.1.3 Frequent Pattern Mining: A Road Map 2325.2Efficient and Scalable Frequent Itemset Mining Methods 2345.2.1 The Apriori Algorithm: Finding Frequent Itemsets UsingCandidate Generation 2345.2.2 Generating Association Rules from Frequent Itemsets 2395.2.3 Improving the Efficiency of Apriori 2405.2.4 Mining Frequent Itemsets without Candidate Generation 2425.2.5 Mining Frequent Itemsets Using Vertical Data Format 2455.2.6 Mining Closed Frequent Itemsets 2485.3Mining Various Kinds of Association Rules 2505.3.1 Mining Multilevel Association Rules 2505.3.2 Mining Multidimensional Association Rulesfrom Relational Databases and Data Warehouses 2545.4From Association Mining to Correlation Analysis 2595.4.1 Strong Rules Are Not Necessarily Interesting: An Example 2605.4.2 From Association Analysis to Correlation Analysis 2615.5Constraint-Based Association Mining 2655.5.1 Metarule-Guided Mining of Association Rules 2665.5.2 Constraint Pushing: Mining Guided by Rule Constraints 2675.6Summary 272Exercises 274Bibliographic Notes 280

ContentsChapter 6 Classification and Prediction 2856.1What Is Classification? What Is Prediction? 2856.2Issues Regarding Classification and Prediction 2896.2.1 Preparing the Data for Classification and Prediction 2896.2.2 Comparing Classification and Prediction Methods 2906.3Classification by Decision Tree Induction 2916.3.1 Decision Tree Induction 2926.3.2 Attribute Selection Measures 2966.3.3 Tree Pruning 3046.3.4 Scalability and Decision Tree Induction 3066.4Bayesian Classification 3106.4.1 Bayes’ Theorem 3106.4.2 Naïve Bayesian Classification 3116.4.3 Bayesian Belief Networks 3156.4.4 Training Bayesian Belief Networks 3176.5Rule-Based Classification 3186.5.1 Using IF-THEN Rules for Classification 3196.5.2 Rule Extraction from a Decision Tree 3216.5.3 Rule Induction Using a Sequential Covering Algorithm 3226.6Classification by Backpropagation 3276.6.1 A Multilayer Feed-Forward Neural Network 3286.6.2 Defining a Network Topology 3296.6.3 Backpropagation 3296.6.4 Inside the Black Box: Backpropagation and Interpretability 3346.7Support Vector Machines 3376.7.1 The Case When the Data Are Linearly Separable 3376.7.2 The Case When the Data Are Linearly Inseparable 3426.8Associative Classification: Classification by AssociationRule Analysis 3446.9Lazy Learners (or Learning from Your Neighbors) 3476.9.1 k-Nearest-Neighbor Classifiers 3486.9.2 Case-Based Reasoning 3506.10 Other Classification Methods 3516.10.1 Genetic Algorithms 3516.10.2 Rough Set Approach 3516.10.3 Fuzzy Set Approaches 3526.11 Prediction 3546.11.1 Linear Regression 3556.11.2 Nonlinear Regression 3576.11.3 Other Regression-Based Methods 358xi

xiiContents6.126.136.146.156.16Accuracy and Error Measures 3596.12.1 Classifier Accuracy Measures 3606.12.2 Predictor Error Measures 362Evaluating the Accuracy of a Classifier or Predictor6.13.1 Holdout Method and Random Subsampling 3646.13.2 Cross-validation 3646.13.3 Bootstrap 365Ensemble Methods—Increasing the Accuracy 3666.14.1 Bagging 3666.14.2 Boosting 367Model Selection 3706.15.1 Estimating Confidence Intervals 3706.15.2 ROC Curves 372Summary 373Exercises 375Bibliographic Notes 378363Chapter 7 Cluster Analysis 3837.1What Is Cluster Analysis? 3837.2Types of Data in Cluster Analysis 3867.2.1 Interval-Scaled Variables 3877.2.2 Binary Variables 3897.2.3 Categorical, Ordinal, and Ratio-Scaled Variables 3927.2.4 Variables of Mixed Types 3957.2.5 Vector Objects 3977.3A Categorization of Major Clustering Methods 3987.4Partitioning Methods 4017.4.1 Classical Partitioning Methods: k-Means and k-Medoids 4027.4.2 Partitioning Methods in Large Databases: Fromk-Medoids to CLARANS 4077.5Hierarchical Methods 4087.5.1 Agglomerative and Divisive Hierarchical Clustering 4087.5.2 BIRCH: Balanced Iterative Reducing and ClusteringUsing Hierarchies 4127.5.3 ROCK: A Hierarchical Clustering Algorithm forCategorical Attributes 4147.5.4 Chameleon: A Hierarchical Clustering AlgorithmUsing Dynamic Modeling 4167.6Density-Based Methods 4187.6.1 DBSCAN: A Density-Based Clustering Method Based onConnected Regions with Sufficiently High Density 418

Contents7.77.87.97.107.117.127.6.2 OPTICS: Ordering Points to Identify the ClusteringStructure 4207.6.3 DENCLUE: Clustering Based on DensityDistribution Functions 422Grid-Based Methods 4247.7.1 STING: STatistical INformation Grid 4257.7.2 WaveCluster: Clustering Using Wavelet Transformation 427Model-Based Clustering Methods 4297.8.1 Expectation-Maximization 4297.8.2 Conceptual Clustering 4317.8.3 Neural Network Approach 433Clustering High-Dimensional Data 4347.9.1 CLIQUE: A Dimension-Growth Subspace Clustering Method 4367.9.2 PROCLUS: A Dimension-Reduction Subspace ClusteringMethod 4397.9.3 Frequent Pattern–Based Clustering Methods 440Constraint-Based Cluster Analysis 4447.10.1 Clustering with Obstacle Objects 4467.10.2 User-Constrained Cluster Analysis 4487.10.3 Semi-Supervised Cluster Analysis 449Outlier Analysis 4517.11.1 Statistical Distribution-Based Outlier Detection 4527.11.2 Distance-Based Outlier Detection 4547.11.3 Density-Based Local Outlier Detection 4557.11.4 Deviation-Based Outlier Detection 458Summary 460Exercises 461Bibliographic Notes 464Chapter 8 Mining Stream, Time-Series, and Sequence Data 4678.1Mining Data Streams 4688.1.1 Methodologies for Stream Data Processing andStream Data Systems 4698.1.2 Stream OLAP and Stream Data Cubes 4748.1.3 Frequent-Pattern Mining in Data Streams 4798.1.4 Classification of Dynamic Data Streams 4818.1.5 Clustering Evolving Data Streams 4868.2Mining Time-Series Data 4898.2.1 Trend Analysis 4908.2.2 Similarity Search in Time-Series Analysis 493xiii

xivContents8.38.48.5Mining Sequence Patterns in Transactional Databases 4988.3.1 Sequential Pattern Mining: Concepts and Primitives 4988.3.2 Scalable Methods for Mining Sequential Patterns 5008.3.3 Constraint-Based Mining of Sequential Patterns 5098.3.4 Periodicity Analysis for Time-Related Sequence Data 512Mining Sequence Patterns in Biological Data 5138.4.1 Alignment of Biological Sequences 5148.4.2 Hidden Markov Model for Biological Sequence Analysis 518Summary 527Exercises 528Bibliographic Notes 531Chapter 9 Graph Mining, Social Network Analysis, and MultirelationalData Mining 5359.1Graph Mining 5359.1.1 Methods for Mining Frequent Subgraphs 5369.1.2 Mining Variant and Constrained Substructure Patterns 5459.1.3 Applications: Graph Indexing, Similarity Search, Classification,and Clustering 5519.2Social Network Analysis 5569.2.1 What Is a Social Network? 5569.2.2 Characteristics of Social Networks 5579.2.3 Link Mining: Tasks and Challenges 5619.2.4 Mining on Social Networks 5659.3Multirelational Data Mining 5719.3.1 What Is Multirelational Data Mining? 5719.3.2 ILP Approach to Multirelational Classification 5739.3.3 Tuple ID Propagation 5759.3.4 Multirelational Classification Using Tuple ID Propagation 5779.3.5 Multirelational Clustering with User Guidance 5809.4Summary 584Exercises 586Bibliographic Notes 587Chapter 10 Mining Object, Spatial, Multimedia, Text, and Web Data 59110.1 Multidimensional Analysis and Descriptive Mining of ComplexData Objects 59110.1.1 Generali

Data Mining: Concepts and Techniques. Data Mining: Concepts and Techniques Second Edition Jiawei Han and Micheline Kamber University of Illinois at Urbana-Champaign AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO SAN FRANCISCO SINGAPORE SYDNEY TOKYO. Publisher Diane Cerra Publishing Services Manager Simon Crump Editorial