Data Mining - Hs-wismar.de

Transcription

Data MiningThird Edition

This page intentionally left blank

Data MiningPractical Machine LearningTools and TechniquesThird EditionIan H. WittenEibe FrankMark A. HallAMSTERDAM BOSTON HEIDELBERG LONDONNEW YORK OXFORD PARIS SAN DIEGOSAN FRANCISCO SINGAPORE SYDNEY TOKYOMorgan Kaufmann Publishers is an imprint of Elsevier

Morgan Kaufmann Publishers is an imprint of Elsevier30 Corporate Drive, Suite 400, Burlington, MA 01803, USAThis book is printed on acid-free paper.Copyright 2011 Elsevier Inc. All rights reserved.No part of this publication may be reproduced or transmitted in any form or by any means,electronic or mechanical, including photocopying, recording, or any information storageand retrieval system, without permission in writing from the publisher. Details on how toseek permission, further information about the Publisher’s permissions policies and ourarrangements with organizations such as the Copyright Clearance Center and the CopyrightLicensing Agency, can be found at our website: www.elsevier.com/permissions.This book and the individual contributions contained in it are protected under copyrightby the Publisher (other than as may be noted herein).NoticesKnowledge and best practice in this field are constantly changing. As new research andexperience broaden our understanding, changes in research methods, professional practices,or medical treatment may become necessary.Practitioners and researchers must always rely on their own experience and knowledgein evaluating and using any information, methods, compounds, or experiments describedherein. In using such information or methods they should be mindful of their own safetyand the safety of others, including parties for whom they have a professional responsibility.To the fullest extent of the law, neither the Publisher nor the authors, contributors, oreditors, assume any liability for any injury and/or damage to persons or property as amatter of products liability, negligence or otherwise, or from any use or operation of anymethods, products, instructions, or ideas contained in the material herein.Library of Congress Cataloging-in-Publication DataWitten, I. H. (Ian H.)Data mining : practical machine learning tools and techniques.—3rd ed. /Ian H. Witten, Frank Eibe, Mark A. Hall.p. cm.—(The Morgan Kaufmann series in data management systems)ISBN 978-0-12-374856-0 (pbk.)1. Data mining. I. Hall, Mark A. II. Title.QA76.9.D343W58 2011006.3′12—dc222010039827British Library Cataloguing-in-Publication DataA catalogue record for this book is available from the British Library.For information on all Morgan Kaufmann publications, visit ourwebsite at www.mkp.com or www.elsevierdirect.comPrinted in the United States11 12 13 14 15 10 9 8 7 6 5 4 3 2 1Working together to growlibraries in developing countrieswww.elsevier.com www.bookaid.org www.sabre.org

ContentsLIST OF FIGURES. xvLIST OF TABLES.xixPREFACE.xxiUpdated and Revised Content. xxvSecond Edition. xxvThird Edition.xxviACKNOWLEDGMENTS.xxixABOUT THE AUTHORS.xxxiiiPART IINTRODUCTION TO DATA MININGCHAPTER 1 What’s It All About?. 31.11.21.31.41.51.61.7Data Mining and Machine Learning. 3Describing Structural Patterns. 5Machine Learning. 7Data Mining. 8Simple Examples: The Weather Problem and Others. 9The Weather Problem. 9Contact Lenses: An Idealized Problem. 12Irises: A Classic Numeric Dataset. 13CPU Performance: Introducing Numeric Prediction.15Labor Negotiations: A More Realistic Example. 15Soybean Classification: A Classic Machine Learning Success. 19Fielded Applications. 21Web Mining.21Decisions Involving Judgment. 22Screening Images. 23Load Forecasting. 24Diagnosis. 25Marketing and Sales. 26Other Applications. 27Machine Learning and Statistics. 28Generalization as Search . 29Data Mining and Ethics. 33Reidentification. 33Using Personal Information. 34Wider Issues. 35Further Reading. 36v

viContentsCHAPTER 2 Input: Concepts, Instances, and Attributes. 392.12.22.32.42.5What’s a Concept?. 40What’s in an Example?. 42Relations. 43Other Example Types. 46What’s in an Attribute?. 49Preparing the Input. 51Gathering the Data Together. 51ARFF Format. 52Sparse Data. 56Attribute Types. 56Missing Values. 58Inaccurate Values. 59Getting to Know Your Data. 60Further Reading. 60CHAPTER 3 Output: Knowledge Representation. 613.13.23.33.43.53.63.7Tables. 61Linear Models. 62Trees. 64Rules. 67Classification Rules. 69Association Rules. 72Rules with Exceptions. 73More Expressive Rules. 75Instance-Based Representation. 78Clusters. 81Further Reading. 83CHAPTER 4 Algorithms: The Basic Methods. 854.14.24.3Inferring Rudimentary Rules. 86Missing Values and Numeric Attributes. 87Discussion. 89Statistical Modeling. 90Missing Values and Numeric Attributes . . 94Naïve Bayes for Document Classification.97Discussion. 99Divide-and-Conquer: Constructing Decision Trees. 99Calculating Information. 103Highly Branching Attributes. 105Discussion. 107

Contents 4.4Covering Algorithms: Constructing Rules. 108Rules versus Trees. 109A Simple Covering Algorithm. 110Rules versus Decision Lists. 1154.5 Mining Association Rules. 116Item Sets. 116Association Rules. 119Generating Rules Efficiently. 122Discussion. 1234.6 Linear Models. 124Numeric Prediction: Linear Regression. 124Linear Classification: Logistic Regression. 125Linear Classification Using the Perceptron. 127Linear Classification Using Winnow. 1294.7 Instance-Based Learning. 131Distance Function. 131Finding Nearest Neighbors Efficiently. 132Discussion. 1374.8 Clustering. 138Iterative Distance-Based Clustering. 139Faster Distance Calculations. 139Discussion. 1414.9 Multi-Instance Learning. 141Aggregating the Input. 142Aggregating the Output. 142Discussion. 1424.10 Further Reading. 1434.11 Weka Implementations. 145CHAPTER 5 Credibility: Evaluating What’s Been Learned. 1475.15.25.35.45.55.6Training and Testing. 148Predicting Performance. 150Cross-Validation. 152Other Estimates. 154Leave-One-Out Cross-Validation. 154The Bootstrap. 155Comparing Data Mining Schemes. 156Predicting Probabilities. 159Quadratic Loss Function. 160Informational Loss Function. 161Discussion. 162vii

viiiContents5.7Counting the Cost. 163Cost-Sensitive Classification. 166Cost-Sensitive Learning. 167Lift Charts. 168ROC Curves.

Data mining : practical machine learning tools and techniques.—3rd ed. / Ian H. Witten, Frank Eibe, Mark A. Hall. p. cm.—(The Morgan Kaufmann series in data management systems) ISBN 978-0-12-374856-0 (pbk.) 1. Data mining. I. Hall, Mark A. II. Title. QA76.9.D343W58 2011 006.3′12—dc22 2010039827 British Library Cataloguing-in-Publication .