Data Mining - Elsevier

Transcription

Data MiningThird Edition

The Morgan Kaufmann Series in Data Management Systems (Selected Titles)Joe Celko’s Data, Measurements, and Standards in SQLJoe CelkoInformation Modeling and Relational Databases, 2nd EditionTerry Halpin, Tony MorganJoe Celko’s Thinking in SetsJoe CelkoBusiness MetadataBill Inmon, Bonnie O’Neil, Lowell FrymanUnleashing Web 2.0Gottfried Vossen, Stephan HagemannEnterprise Knowledge ManagementDavid LoshinThe Practitioner’s Guide to Data Quality ImprovementDavid LoshinBusiness Process Change, 2nd EditionPaul HarmonIT Manager’s Handbook, 2nd EditionBill Holtsnider, Brian JaffeJoe Celko’s Puzzles and Answers, 2nd EditionJoe CelkoArchitecture and Patterns for IT Service Management, 2nd Edition, Resource Planningand GovernanceCharles BetzJoe Celko’s Analytics and OLAP in SQLJoe CelkoData Preparation for Data Mining Using SASMamdouh RefaatQuerying XML: XQuery, XPath, and SQL/ XML in ContextJim Melton, Stephen BuxtonData Mining: Concepts and Techniques, 3rd EditionJiawei Han, Micheline Kamber, Jian PeiDatabase Modeling and Design: Logical Design, 5th EditionToby J. Teorey, Sam S. Lightstone, Thomas P. Nadeau, H. V. JagadishFoundations of Multidimensional and Metric Data StructuresHanan SametJoe Celko’s SQL for Smarties: Advanced SQL Programming, 4th EditionJoe CelkoMoving Objects DatabasesRalf Hartmut Güting, Markus SchneiderJoe Celko’s SQL Programming StyleJoe CelkoFuzzy Modeling and Genetic Algorithms for Data Mining and ExplorationEarl Cox

Data Modeling Essentials, 3rd EditionGraeme C. Simsion, Graham C. WittDeveloping High Quality Data ModelsMatthew WestLocation-Based ServicesJochen Schiller, Agnes VoisardManaging Time in Relational Databases: How to Design, Update, and Query Temporal DataTom Johnston, Randall WeisDatabase Modeling with Microsoft R Visio for Enterprise ArchitectsTerry Halpin, Ken Evans, Patrick Hallock, Bill MacleanDesigning Data-Intensive Web ApplicationsStephano Ceri, Piero Fraternali, Aldo Bongio, Marco Brambilla, Sara Comai, Maristella MateraMining the Web: Discovering Knowledge from Hypertext DataSoumen ChakrabartiAdvanced SQL: 1999—Understanding Object-Relational and Other Advanced FeaturesJim MeltonDatabase Tuning: Principles, Experiments, and Troubleshooting TechniquesDennis Shasha, Philippe BonnetSQL: 1999—Understanding Relational Language ComponentsJim Melton, Alan R. SimonInformation Visualization in Data Mining and Knowledge DiscoveryEdited by Usama Fayyad, Georges G. Grinstein, Andreas WierseTransactional Information SystemsGerhard Weikum, Gottfried VossenSpatial DatabasesPhilippe Rigaux, Michel Scholl, and Agnes VoisardManaging Reference Data in Enterprise DatabasesMalcolm ChisholmUnderstanding SQL and Java TogetherJim Melton, Andrew EisenbergDatabase: Principles, Programming, and Performance, 2nd EditionPatrick and Elizabeth O’NeilThe Object Data StandardEdited by R. G. G. Cattell, Douglas BarryData on the Web: From Relations to Semistructured Data and XMLSerge Abiteboul, Peter Buneman, Dan SuciuData Mining: Practical Machine Learning Tools and Techniques with Java Implementations,3rd EditionIan Witten, Eibe Frank, Mark A. HallJoe Celko’s Data and Databases: Concepts in PracticeJoe CelkoDeveloping Time-Oriented Database Applications in SQLRichard T. SnodgrassWeb Farming for the Data WarehouseRichard D. Hackathorn

Management of Heterogeneous and Autonomous Database SystemsEdited by Ahmed Elmagarmid, Marek Rusinkiewicz, Amit ShethObject-Relational DBMSs, 2nd EditionMichael Stonebraker, Paul Brown, with Dorothy MooreUniversal Database Management: A Guide to Object/Relational TechnologyCynthia Maro SaraccoReadings in Database Systems, 3rd EditionEdited by Michael Stonebraker, Joseph M. HellersteinUnderstanding SQL’s Stored Procedures: A Complete Guide to SQL/PSMJim MeltonPrinciples of Multimedia Database SystemsV. S. SubrahmanianPrinciples of Database Query Processing for Advanced ApplicationsClement T. Yu, Weiyi MengAdvanced Database SystemsCarlo Zaniolo, Stefano Ceri, Christos Faloutsos, Richard T. Snodgrass, V. S. Subrahmanian,Roberto ZicariPrinciples of Transaction Processing, 2nd EditionPhilip A. Bernstein, Eric NewcomerUsing the New DB2: IBM’s Object-Relational Database SystemDon ChamberlinDistributed AlgorithmsNancy A. LynchActive Database Systems: Triggers and Rules for Advanced Database ProcessingEdited by Jennifer Widom, Stefano CeriMigrating Legacy Systems: Gateways, Interfaces, and the Incremental ApproachMichael L. Brodie, Michael StonebrakerAtomic TransactionsNancy Lynch, Michael Merritt, William Weihl, Alan FeketeQuery Processing for Advanced Database SystemsEdited by Johann Christoph Freytag, David Maier, Gottfried VossenTransaction ProcessingJim Gray, Andreas ReuterDatabase Transaction Models for Advanced ApplicationsEdited by Ahmed K. ElmagarmidA Guide to Developing Client/Server SQL ApplicationsSetrag Khoshafian, Arvola Chan, Anna Wong, Harry K. T. Wong

Data MiningConcepts and TechniquesThird EditionJiawei HanUniversity of Illinois at Urbana–ChampaignMicheline KamberJian PeiSimon Fraser UniversityAMSTERDAM BOSTON HEIDELBERG LONDONNEW YORK OXFORD PARIS SAN DIEGOSAN FRANCISCO SINGAPORE SYDNEY TOKYOMorgan Kaufmann is an imprint of Elsevier

Morgan Kaufmann Publishers is an imprint of Elsevier.225 Wyman Street, Waltham, MA 02451, USAc 2012 by Elsevier Inc. All rights reserved.No part of this publication may be reproduced or transmitted in any form or by any means,electronic or mechanical, including photocopying, recording, or any information storage andretrieval system, without permission in writing from the publisher. Details on how to seekpermission, further information about the Publisher’s permissions policies and ourarrangements with organizations such as the Copyright Clearance Center and the CopyrightLicensing Agency, can be found at our website: www.elsevier.com/permissions.This book and the individual contributions contained in it are protected under copyright bythe Publisher (other than as may be noted herein).NoticesKnowledge and best practice in this field are constantly changing. As new research andexperience broaden our understanding, changes in research methods or professional practices,may become necessary. Practitioners and researchers must always rely on their own experienceand knowledge in evaluating and using any information or methods described herein. In usingsuch information or methods they should be mindful of their own safety and the safety of others,including parties for whom they have a professional responsibility.To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors,assume any liability for any injury and/or damage to persons or property as a matter of productsliability, negligence or otherwise, or from any use or operation of any methods, products,instructions, or ideas contained in the material herein.Library of Congress Cataloging-in-Publication DataHan, Jiawei.Data mining : concepts and techniques / Jiawei Han, Micheline Kamber, Jian Pei. – 3rd ed.p. cm.ISBN 978-0-12-381479-11. Data mining. I. Kamber, Micheline. II. Pei, Jian. III. Title.QA76.9.D343H36 2011006.30 12–dc222011010635British Library Cataloguing-in-Publication DataA catalogue record for this book is available from the British Library.For information on all Morgan Kaufmann publications, visit ourWeb site at www.mkp.com or www.elsevierdirect.comPrinted in the United States of America11 12 13 14 1510 9 8 7 6 5 4 3 2 1

To Y. Dora and Lawrence for your love and encouragementJ.H.To Erik, Kevan, Kian, and Mikael for your love and inspirationM.K.To my wife, Jennifer, and daughter, JacquelineJ.P.

ContentsForewordxixForeword to Second EditionPrefacexxixxiiiAcknowledgmentsxxxiAbout the AuthorsxxxvChapter 1 Introduction 11.1Why Data Mining? 11.1.1 Moving toward the Information Age 11.1.2 Data Mining as the Evolution of Information Technology 21.2What Is Data Mining? 51.3What Kinds of Data Can Be Mined? 81.3.1 Database Data 91.3.2 Data Warehouses 101.3.3 Transactional Data 131.3.4 Other Kinds of Data 141.4What Kinds of Patterns Can Be Mined? 151.4.1 Class/Concept Description: Characterization and Discrimination1.4.2 Mining Frequent Patterns, Associations, and Correlations 171.4.3 Classification and Regression for Predictive Analysis 181.4.4 Cluster Analysis 191.4.5 Outlier Analysis 201.4.6 Are All Patterns Interesting? 211.5Which Technologies Are Used? 231.5.1 Statistics 231.5.2 Machine Learning 241.5.3 Database Systems and Data Warehouses 261.5.4 Information Retrieval 2615ix

xContents1.61.71.81.91.10Which Kinds of Applications Are Targeted?1.6.1 Business Intelligence 271.6.2 Web Search Engines 28Major Issues in Data Mining 291.7.1 Mining Methodology 291.7.2 User Interaction 301.7.3 Efficiency and Scalability 311.7.4 Diversity of Database Types 321.7.5 Data Mining and Society 32Summary 33Exercises 34Bibliographic Notes 3527Chapter 2 Getting to Know Your Data 392.1Data Objects and Attribute Types 402.1.1 What Is an Attribute? 402.1.2 Nominal Attributes 412.1.3 Binary Attributes 412.1.4 Ordinal Attributes 422.1.5 Numeric Attributes 432.1.6 Discrete versus Continuous Attributes 442.2Basic Statistical Descriptions of Data 442.2.1 Measuring the Central Tendency: Mean, Median, and Mode 452.2.2 Measuring the Dispersion of Data: Range, Quartiles, Variance,Standard Deviation, and Interquartile Range 482.2.3 Graphic Displays of Basic Statistical Descriptions of Data 512.3Data Visualization 562.3.1 Pixel-Oriented Visualization Techniques 572.3.2 Geometric Projection Visualization Techniques 582.3.3 Icon-Based Visualization Techniques 602.3.4 Hierarchical Visualization Techniques 632.3.5 Visualizing Complex Data and Relations 642.4Measuring Data Similarity and Dissimilarity 652.4.1 Data Matrix versus Dissimilarity Matrix 672.4.2 Proximity Measures for Nominal Attributes 682.4.3 Proximity Measures for Binary Attributes 702.4.4 Dissimilarity of Numeric Data: Minkowski Distance 722.4.5 Proximity Measures for Ordinal Attributes 742.4.6 Dissimilarity for Attributes of Mixed Types 752.4.7 Cosine Similarity 772.5Summary 792.6Exercises 792.7Bibliographic Notes 81

ContentsChapter 3 Data Preprocessing 833.1Data Preprocessing: An Overview 843.1.1 Data Quality: Why Preprocess the Data?3.1.2 Major Tasks in Data Preprocessing 853.23.3Data Cleaning 883.2.1 Missing Values 883.2.2 Noisy Data 893.2.3 Data Cleaning as a Process8491Data Integration 933.3.1 Entity Identification Problem 943.3.2 Redundancy and Correlation Analysis 943.3.3 Tuple Duplication 983.3.4 Data Value Conflict Detection and Resolution993.4Data Reduction 993.4.1 Overview of Data Reduction Strategies 993.4.2 Wavelet Transforms 1003.4.3 Principal Components Analysis 1023.4.4 Attribute Subset Selection 1033.4.5 Regression and Log-Linear Models: ParametricData Reduction 1053.4.6 Histograms 1063.4.7 Clustering 1083.4.8 Sampling 1083.4.9 Data Cube Aggregation 1103.5Data Transformation and Data Discretization 1113.5.1 Data Transformation Strategies Overview 1123.5.2 Data Transformation by Normalization 1133.5.3 Discretization by Binning 1153.5.4 Discretization by Histogram Analysis 1153.5.5 Discretization by Cluster, Decision Tree, and CorrelationAnalyses 1163.5.6 Concept Hierarchy Generation for Nominal Data 1173.6Summary3.7Exercises3.8Bibliographic Notes120121123Chapter 4 Data Warehousing and Online Analytical Processing 1254.1Data Warehouse: Basic Concepts 1254.1.1 What Is a Data Warehouse? 1264.1.2 Differences between Operational Database Systemsand Data Warehouses 1284.1.3 But, Why Have a Separate Data Warehouse? 129xi

xiiContents4.1.44.1.54.24.34.44.54.64.74.8Data Warehousing: A Multitiered Architecture 130Data Warehouse Models: Enterprise Warehouse, Data Mart,and Virtual Warehouse 1324.1.6 Extraction, Transformation, and Loading 1344.1.7 Metadata Repository 134Data Warehouse Modeling: Data Cube and OLAP 1354.2.1 Data Cube: A Multidimensional Data Model 1364.2.2 Stars, Snowflakes, and Fact Constellations: Schemasfor Multidimensional Data Models 1394.2.3 Dimensions: The Role of Concept Hierarchies 1424.2.4 Measures: Their Categorization and Computation 1444.2.5 Typical OLAP Operations 1464.2.6 A Starnet Query Model for Querying MultidimensionalDatabases 149Data Warehouse Design and Usage 1504.3.1 A Business Analysis Framework for Data Warehouse Design 1504.3.2 Data Warehouse Design Process 1514.3.3 Data Warehouse Usage for Information Processing 1534.3.4 From Online Analytical Processing to MultidimensionalData Mining 155Data Warehouse Implementation 1564.4.1 Efficient Data Cube Computation: An Overview 1564.4.2 Indexing OLAP Data: Bitmap Index and Join Index 1604.4.3 Efficient Processing of OLAP Queries 1634.4.4 OLAP Server Architectures: ROLAP versus MOLAPversus HOLAP 164Data Generalization by Attribute-Oriented Induction 1664.5.1 Attribute-Oriented Induction for Data Characterization 1674.5.2 Efficient Implementation of Attribute-Oriented Induction 1724.5.3 Attribute-Oriented Induction for Class Comparisons 175Summary 178Exercises 180Bibliographic Notes 184Chapter 5 Data Cube Technology 1875.1Data Cube Computation: Preliminary Concepts 1885.1.1 Cube Materialization: Full Cube, Iceberg Cube, Closed Cube,and Cube Shell 1885.1.2 General Strategies for Data Cube Computation 1925.2Data Cube Computation Methods 1945.2.1 Multiway Array Aggregation for Full Cube Computation 195

Contents5.2.2xiii5.5BUC: Computing Iceberg Cubes from the Apex CuboidDownward 2005.2.3 Star-Cubing: Computing Iceberg Cubes Using a DynamicStar-Tree Structure 2045.2.4 Precomputing Shell Fragments for Fast High-Dimensional OLAP 210Processing Advanced Kinds of Queries by Exploring CubeTechnology 2185.3.1 Sampling Cubes: OLAP-Based Mining on Sampling Data 2185.3.2 Ranking Cubes: Efficient Computation of Top-k Queries 225Multidimensional Data Analysis in Cube Space 2275.4.1 Prediction Cubes: Prediction Mining in Cube Space 2275.4.2 Multifeature Cubes: Complex Aggregation at MultipleGranularities 2305.4.3 Exception-Based, Discovery-Driven Cube Space Exploration 231Summary 2345.6Exercises5.7Bibliographic Notes5.35.4235240Chapter 6 Mining Frequent Patterns, Associations, and Correlations: BasicConcepts and Methods 2436.1Basic Concepts 2436.1.1 Market Basket Analysis: A Motivating Example 2446.1.2 Frequent Itemsets, Closed Itemsets, and Association Rules 2466.2Frequent Itemset Mining Methods 2486.2.1 Apriori Algorithm: Finding Frequent Itemsets by ConfinedCandidate Generation 2486.2.2 Generating Association Rules from Frequent Itemsets 2546.2.3 Improving the Efficiency of Apriori 2546.2.4 A Pattern-Growth Approach for Mining Frequent Itemsets 2576.2.5 Mining Frequent Itemsets Using Vertical Data Format 2596.2.6 Mining Closed and Max Patterns 2626.36.4Which Patterns Are Interesting?—Pattern EvaluationMethods 2646.3.1 Strong Rules Are Not Necessarily Interesting 2646.3.2 From Association Analysis to Correlation Analysis 2656.3.3 A Comparison of Pattern Evaluation Measures 267Summary 2716.5Exercises6.6Bibliographic Notes273276

xivContentsChapter 7 Advanced Pattern Mining 2797.1Pattern Mining: A Road Map 2797.2Pattern Mining in Multilevel, Multidimensional Space 2837.2.1 Mining Multilevel Associations 2837.2.2 Mining Multidimensional Associations 2877.2.3 Mining Quantitative Association Rules 2897.2.4 Mining Rare Patterns and Negative Patterns 2917.3Constraint-Based Frequent Pattern Mining 2947.3.1 Metarule-Guided Mining of Association Rules 2957.3.2 Constraint-Based Pattern Generation: Pruning Pattern Spaceand Pruning Data Space 2967.4Mining High-Dimensional Data and Colossal Patterns 3017.4.1 Mining Colossal Patterns by Pattern-Fusion 3027.5Mining Compressed or Approximate Patterns 3077.5.1 Mining Compressed Patterns by Pattern Clustering 3087.5.2 Extracting Redundancy-Aware Top-k Patterns 3107.6Pattern Exploration and Application 3137.6.1 Semantic Annotation of Frequent Patterns 3137.6.2 Applications of Pattern Mining 3177.7Summary 3197.8Exercises 3217.9Bibliographic Notes 323Chapter 8 Classification: Basic Concepts 3278.1Basic Concepts 3278.1.1 What Is Classification? 3278.1.2 General Approach to Classification 3288.2Decision Tree Induction 3308.2.1 Decision Tree Induction 3328.2.2 Attribute Selection Measures 3368.2.3 Tree Pruning 3448.2.4 Scalability and Decision Tree Induction 3478.2.5 Visual Mining for Decision Tree Induction 3488.3Bayes Classification Methods 3508.3.1 Bayes’ Theorem 3508.3.2 Naı̈ve Bayesian Classification 3518.4Rule-Based Classification 3558.4.1 Using IF-THEN Rules for Classification 3558.4.2 Rule Extraction from a Decision Tree 3578.4.3 Rule Induction Using a Sequential Covering Algorithm359

Contents8.58.68.78.88.9xvModel Evaluation and Selection 3648.5.1 Metrics for Evaluating Classifier Performance 3648.5.2 Holdout Method and Random Subsampling 3708.5.3 Cross-Validation 3708.5.4 Bootstrap 3718.5.5 Model Selection Using Statistical Tests of Significance 3728.5.6 Comparing Classifiers Based on Cost–Benefit and ROC Curves 373Techniques to Improve Classification Accuracy 3778.6.1 Introducing Ensemble Methods 3788.6.2 Bagging 3798.6.3 Boosting and AdaBoost 3808.6.4 Random Forests 3828.6.5 Improving Classification Accuracy of Class-Imbalanced Data 383Summary 385Exercises 386Bibliographic Notes 389Chapter 9 Classification: Advanced Methods 3939.1Bayesian Belief Networks 3939.1.1 Concepts and Mechanisms 3949.1.2 Training Bayesian Belief Networks 3969.2Classification by Backpropagation 3989.2.1 A Multilayer Feed-Forward Neural Network 3989.2.2 Defining a Network Topology 4009.2.3 Backpropagation 4009.2.4 Inside the Black Box: Backpropagation and Interpretability 4069.3Support Vector Machines 4089.3.1 The Case When the Data Are Linearly Separable 4089.3.2 The Case When the Data Are Linearly Inseparable 4139.4Classification Using Frequent Patterns 4159.4.1 Associative Classification 4169.4.2 Discriminative Frequent Pattern–Based Classification 4199.5Lazy Learners (or Learning from Your Neighbors) 4229.5.1 k-Nearest-Neighbor Classifiers 4239.5.2 Case-Based Reasoning 4259.6Other Classification Methods 4269.6.1 Genetic Algorithms 4269.6.2 Rough Set Approach 4279.6.3 Fuzzy Set Approaches 4289.7Additional Topics Regarding Classification 4299.7.1 Multiclass Classification 430

xviContents9.89.99.109.7.2 Semi-Supervised Classification9.7.3 Active Learning 4339.7.4 Transfer Learning 434Summary 436Exercises 438Bibliographic Notes 439432Chapter 10 Cluster Analysis: Basic Concepts and Methods 44310.1 Cluster Analysis 44410.1.1 What Is Cluster Analysis? 44410.1.2 Requirements for Cluster Analysis 44510.1.3 Overview of Basic Clustering Methods 44810.2 Partitioning Methods 45110.2.1 k-Means: A Centroid-Based Technique 45110.2.2 k-Medoids: A Representative Object-Based Technique 45410.3 Hierarchical Methods 45710.3.1 Agglomerative versus Divisive Hierarchical Clustering 45910.3.2 Distance Measures in Algorithmic Methods 46110.3.3 BIRCH: Multiphase Hierarchical Clustering Using ClusteringFeature Trees 46210.3.4 Chameleon: Multiphase Hierarchical Clustering Using DynamicModeling 46610.3.5 Probabilistic Hierarchical Clustering 46710.4 Density-Based Methods 47110.4.1 DBSCAN: Density-Based Clustering Based on ConnectedRegions with High Density 47110.4.2 OPTICS: Ordering Points to Identify the Clustering Structure 47310.4.3 DENCLUE: Clustering Based on Density Distribution Functions 47610.5 Grid-Based Methods 47910.5.1 STING: STatistical INformation Grid 47910.5.2 CLIQUE: An Apriori-like Subspace Clustering Method 48110.6 Evaluation of Clustering 48310.6.1 Assessing Clustering Tendency 48410.6.2 Determining the Number of Clusters 48610.6.3 Measuring Clustering Quality 48710.7 Summary 49010.8 Exercises 49110.9 Bibliographic Notes 494Chapter 11 Advanced Cluster Analysis 49711.1 Probabilistic Model-Based Clustering11.1.1 Fuzzy Clusters 499497

Contents11.211.311.411.511.611.711.1.2 Probabilistic Model-Based Clusters 50111.1.3 Expectation-Maximization Algorithm 505Clustering High-Dimensional Data 50811.2.1 Clustering High-Dimensional Data: Problems, Challenges,and Major Methodologies 50811.2.2 Subspace Clustering Methods 51011.2.3 Biclustering 51211.2.4 Dimensionality Reduction Methods and Spectral ClusteringClustering Graph and Network Data 52211.3.1 Applications and Challenges 52311.3.2 Similarity Measures 52511.3.3 Graph Clustering Methods 528Clustering with Constraints 53211.4.1 Categorization of Constraints 53311.4.2 Methods for Clustering with Constraints 535Summary 538Exercises 539Bibliographic Notes 540519Chapter 12 Outlier Detection 54312.1 Outliers and Outlier Analysis 54412.1.1 What Are Outliers? 54412.1.2 Types of Outliers 54512.1.3 Challenges of Outlier Detection 54812.2 Outlier Detection Methods 54912.2.1 Supervised, Semi-Supervised, and Unsupervised Methods 54912.2.2 Statistical Methods, Proximity-Based Methods, andClustering-Based Methods 55112.3 Statistical Approaches 55312.3.1 Parametric Methods 55312.3.2 Nonparametric Methods 55812.4 Proximity-Based Approaches 56012.4.1 Distance-Based Outlier Detection and a Nested LoopMethod 56112.4.2 A Grid-Based Method 56212.4.3 Density-Based Outlier Detection 56412.5 Clustering-Based Approaches 56712.6 Classification-Based Approaches 57112.7 Mining Contextual and Collective Outliers 57312.7.1 Transforming Contextual Outlier Detection to ConventionalOutlier Detection 573xvii

xviiiContents12.7.2 Modeling Normal Behavior with Respect to Contexts12.7.3 Mining Collective Outliers 57512.8 Outlier Detection in High-Dimensional Data 57612.8.1 Extending Conventional Outlier Detection 57712.8.2 Finding Outliers in Subspaces 57812.8.3 Modeling High-Dimensional Outliers 57912.9 Summary 58112.10 Exercises 58212.11 Bibliographic Notes 583574Chapter 13 Data Mining Trends and Research Frontiers 58513.1 Mining Complex Data Types 58513.1.1 Mining Sequence Data: Time-Series, Symbolic Sequences,and Biological Sequences 58613.1.2 Mining Graphs and Networks 59113.1.3 Mining Other Kinds of Data 59513.2 Other Methodologies of Data Mining 59813.2.1 Statistical Data Mining 59813.2.2 Views on Data Mining Foundations 60013.2.3 Visual and Audio Data Mining 60213.3 Data Mining Applications 60713.3.1 Data Mining for Financial Data Analysis 60713.3.2 Data Mining for Retail and Telecommunication Industries 60913.3.3 Data Mining in Science and Engineering 61113.3.4 Data Mining for Intrusion Detection and Prevention 61413.3.5 Data Mining and Recommender Systems 61513.4 Data Mining and Society 61813.4.1 Ubiquitous and Invisible Data Mining 61813.4.2 Privacy, Security, and Social Impacts of Data Mining 62013.5 Data Mining Trends 62213.6 Summary 62513.7 Exercises 62613.8 Bibliographic Notes 628BibliographyIndex673633

ForewordAnalyzing large amounts of data is a necessity. Even popular science books, like “supercrunchers,” give compelling cases where large amounts of data yield discoveries andintuitions that surprise even experts. Every enterprise benefits from collecting and analyzing its data: Hospitals can spot trends and anomalies in their patient records, searchengines can do better ranking and ad placement, and environmental and public healthagencies can spot patterns and abnormalities in their data. The list continues, withcybersecurity and computer network intrusion detection; monitoring of the energyconsumption of household appliances; pattern analysis in bioinformatics and pharmaceutical data; financial and business intelligence data; spotting trends in blogs, Twitter,and many more. Storage is inexpensive and getting even less so, as are data sensors. Thus,collecting and storing data is easier than ever before.The problem then becomes how to analyze the data. This is exactly the focus of thisThird Edition of the book. Jiawei, Micheline, and Jian give encyclopedic coverage of allthe related methods, from the classic topics of clustering and classification, to databasemethods (e.g., association rules, data cubes) to more recent and advanced topics (e.g.,SVD/PCA, wavelets, support vector machines).The exposition is extremely accessible to beginners and advanced readers alike. Thebook gives the fundamental material first and the more advanced material in follow-upchapters. It also has numerous rhetorical questions, which I found extremely helpful formaintaining focus.We have used the first two editions as textbooks in data mining courses at CarnegieMellon and plan to continue to do so with this Third Edition. The new version hassignificant additions: Notably, it has more than 100 citations to works from 2006onward, focusing on more recent material such as graphs and social networks, sensor networks, and outlier detection. This book has a new section for visualization, hasexpanded outlier detection into a whole chapter, and has separate chapters for advancedxix

xxForewordmethods—for example, pattern mining with top-k patterns and more and clusteringmethods with biclustering and graph clustering.Overall, it is an excellent book on classic and modern data mining methods, and it isideal not only for teaching but also as a reference book.Christos FaloutsosCarnegie Mellon University

Foreword to Second EditionWe are deluged by data—scientific data, medical data, demographic data, financial data,and marketing data. People have no time to look at this data. Human attention hasbecome the precious resource. So, we must find ways to automatically analyze thedata, to automatically classify it, to automatically summarize it, to automatically discover and characterize trends in it, and to automatically flag anomalies. This is oneof the most active and exciting areas of the database research community. Researchersin areas including statistics, visualization, artificial intelligence, and machine learningare contributing to this field. The breadth of the field makes it difficult to grasp theextraordinary progress over the last few decades.Six years ago, Jiawei Han’s and Micheline Kamber’s seminal textbook organized andpresented Data Mining. It heralded a golden age of innovation in the field. This revisionof their book reflects that progress; more than half of the references and historical notesare to recent work. The field has matured with many new and improved algorithms, andhas broadened to include many more datatypes: streams, sequences, graphs, time-series,geospatial, audio, images, and video. We are certainly not at the end of the golden age—indeed research and commercial interest in data mining continues to grow—but we areall fortunate to have this modern compendium.The book gives quick introductions to database and data mining concepts withparticular emphasis on data analysis. It then covers in a chapter-by-chapter tour theconcepts and techniques that underlie classification, prediction, association, and clustering. These topics are presented with examples, a tour of the best algorithms for eachproblem class, and with pragmatic rules of thumb about when to apply each technique.The Socratic presentation style is both very readable and very informative. I certainlylearned a lot from reading the first edition and got re-educated and updated in readingthe second edition.Jiawei Han and Micheline Kamber have been leading contributors to data miningresearch. This is the text they use with their students to bring them up to speed onxxi

xxiiForeword to Second Editionthe field. The field is evolving very rapidly, but this book is a quick way to learn thebasic ideas, and to understand where the field is today. I found it very informative andstimulating, and believe you will too.Jim GrayIn his memory

PrefaceThe computerization of our society has substantially enhanced our capabilities for bothgenerating and collecting data from diverse sources. A tremendous amount of data hasflooded almost every aspect of our lives. This explosive growth in stored or transientdata has generated an urgent need for new techniques and automated tools that canintelligently assist us in transforming the vast amounts of data into useful informationand knowledge. This has led to the generation of a promising and flourishing frontierin computer science called data mining, and its various applications. Data mining, alsopopularly referred to as knowledge discovery from data (KDD), is the automated or convenient extraction of patterns representing knowledge implicitly stored or captured inlarge databases, data warehouses, the Web, other massive information repositories, ordata streams.This book explores the concepts and techniques of knowledge discovery and data mining. As a multidisciplinary field, data mining draws on work from areas including statistics,machine learning, pattern recognition, database technology, information retrieval,network science, knowledge-based systems, artificial intelligence, high-performancecomputing, and data visualization. We focus on issues relating to the feasibility, usefulness, effectiveness, and scalability of techniques for the discovery of patterns hiddenin large data sets. As a result, this book is not intended as an introduction to statistics, machine learning, database systems, or other such areas, although we do providesome background knowledge to facilitate the reader’s comprehension of their respectiveroles in data mining. Rather, the book is a comprehensive introduction to data mining.It is useful for computing science students, application developers, and businessprofessionals, as well as researchers involved in any of the disciplines previously listed.Data mining emerged during the late 1980s, made great strides during the 1990s, andcontinues to flourish into the new millennium. This book presents an overall pictureof the field, introducing interesting data mining techniques and systems and discussingapplications and research directions. An important motivation for writing this book wasthe need to build an organized framework for the study of data mining—a challengingtask, owing to the extensive multidiscip

Data Preparation for Data Mining Using SAS Mamdouh Refaat Querying XML: XQuery, XPath, and SQL/ XML in Context Jim Melton, Stephen Buxton Data Mining: Concepts and Techniques, 3rd Edition Jiawei Han, Micheline Kamber, Jian Pei Database Modeling and Design: Logical Design, 5th Edition Toby