Exploratory Data Analysis With MATLAB

Transcription

Computer Science and Data Analysis SeriesExploratory DataAnalysiswith MATLAB 2005 by CRC Press LLC

Chapman & Hall/CRCSeries in Computer Science and Data AnalysisThe interface between the computer and statistical sciences is increasing,as each discipline seeks to harness the power and resources of the other.This series aims to foster the integration between the computer sciencesand statistical, numerical and probabilistic methods by publishing a broadrange of reference works, textbooks and handbooks.SERIES EDITORSJohn Lafferty, Carnegie Mellon UniversityDavid Madigan, Rutgers UniversityFionn Murtagh, Queen’s University BelfastPadhraic Smyth, University of California IrvineProposals for the series should be sent directly to one of the series editorsabove, or submitted to:Chapman & Hall/CRC Press UK23-25 Blades CourtLondon SW15 2NUUKPublished TitlesBayesian Artificial IntelligenceKevin B. Korb and Ann E. NicholsonExploratory Data Analysis with MATLAB Wendy L. Martinez and Angel R. MartinezForthcoming TitlesCorrespondence Analysis and Data Coding with JAVA and RFionn MurtaghR GraphicsPaul MurrellNonlinear Dimensionality ReductionVin de Silva and Carrie Grimes 2005 by CRC Press LLC

Computer Science and Data Analysis SeriesExploratory DataAnalysiswith MATLAB Wendy L. MartinezAngel R. MartinezCHAPMAN & HALL/CRCA CRC Press CompanyBoca Raton London New York Washington, D.C. 2005 by CRC Press LLC

C3669 disclaimer.fm Page 1 Monday, October 18, 2004 12:24 PMLibrary of Congress Cataloging-in-Publication DataMartinez, Wendy L.Exploratory data analysis with MATLAB / Wendy L. Martinez, Angel R. Martinez.p. cm.Includes bibliographical references and index.ISBN 1-58488-366-9 (alk. paper)1. Multivariate analysis. 2. MATLAB. 3. Mathematical statistics. I. Martinez, Angel R.II. Title.QA278.M3735 2004519.5'35--dc222004058245This book contains information obtained from authentic and highly regarded sources. Reprinted materialis quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonableefforts have been made to publish reliable data and information, but the author and the publisher cannotassume responsibility for the validity of all materials or for the consequences of their use.Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronicor mechanical, including photocopying, microfilming, and recording, or by any information storage orretrieval system, without prior permission in writing from the publisher.The consent of CRC Press does not extend to copying for general distribution, for promotion, for creatingnew works, or for resale. Specific permission must be obtained in writing from CRC Press for suchcopying.Direct all inquiries to CRC Press, 2000 N.W. Corporate Blvd., Boca Raton, Florida 33431.Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and areused only for identification and explanation, without intent to infringe.Visit the CRC Press Web site at www.crcpress.com 2005 by Chapman & Hall/CRC PressNo claim to original U.S. Government worksInternational Standard Book Number 1-58488-366-9Library of Congress Card Number 2004058245Printed in the United States of America 1 2 3 4 5 6 7 8 9 0Printed on acid-free paper 2005 by CRC Press LLC

EDA.book Page i Monday, October 18, 2004 8:31 AMThis book is dedicated to our children:Angel and OchidaDeborah and NatanielJeff and LynnandLisa (Principessa) 2005 by CRC Press LLC

EDA.book Page vii Monday, October 18, 2004 8:31 AMTable of ContentsTable of Contents .viiPreface .xiiiPart IIntroduction to Exploratory Data AnalysisChapter 1Introduction to Exploratory Data Analysis1.1 What is Exploratory Data Analysis . 31.2 Overview of the Text . 61.3 A Few Words About Notation . 81.4 Data Sets Used in the Book . 91.4.1 Unstructured Text Documents . 91.4.2 Gene Expression Data . 121.4.3 Oronsay Data Set . 181.4.4 Software Inspection . 191.5 Transforming Data . 201.5.1 Power Transformations . 211.5.2 Standardization . 221.5.3 Sphering the Data . 241.6 Further Reading . 25Exercises . 27Part IIEDA as Pattern DiscoveryChapter 2Dimensionality Reduction - Linear Methods2.1 Introduction . 312.2 Principal Component Analysis - PCA . 332.2.1 PCA Using the Sample Covariance Matrix . 342.2.2 PCA Using the Sample Correlation Matrix . 372.2.3 How Many Dimensions Should We Keep? . 382.3 Singular Value Decomposition - SVD . 422.4 Factor Analysis . 46vii 2005 by CRC Press LLC

EDA.book Page viii Monday, October 18, 2004 8:31 AMviiiExploratory Data Analysis with MATLAB2.5 Intrinsic Dimensionality . 522.6 Summary and Further Reading . 57Exercises . 57Chapter 3Dimensionality Reduction - Nonlinear Methods3.1 Multidimensional Scaling - MDS . 613.1.1 Metric MDS . 633.1.2 Nonmetric MDS . 723.2 Manifold Learning . 813.2.1 Locally Linear Embedding . 813.2.2 Isometric Feature Mapping - ISOMAP . 833.2.3 Hessian Eigenmaps . 853.3 Artificial Neural Network Approaches . 903.3.1 Self-Organizing Maps - SOM . 903.3.2 Generative Topographic Maps - GTM . 943.4 Summary and Further Reading . 98Exercises . 100Chapter 4Data Tours4.1 Grand Tour . 1044.1.1 Torus Winding Method . 1054.1.2 Pseudo Grand Tour . 1074.2 Interpolation Tours . 1104.3 Projection Pursuit . 1124.4 Projection Pursuit Indexes . 1204.4.1 Posse Chi-Square Index . 1204.4.2 Moment Index . 1244.5 Summary and Further Reading . 125Exercises . 126Chapter 5Finding Clusters5.1 Introduction . 1275.2 Hierarchical Methods . 1295.3 Optimization Methods - k-Means . 1355.4 Evaluating the Clusters . 1395.4.1 Rand Index . 1415.4.2 Cophenetic Correlation . 1435.5.3 Upper Tail Rule . 1445.5.4 Silhouette Plot . 1475.5.5 Gap Statistic . 1495.5 Summary and Further Reading . 155 2005 by CRC Press LLC

EDA.book Page ix Monday, October 18, 2004 8:31 AMTable of ContentsixExercises . 158Chapter 6Model-Based Clustering6.1 Overview of Model-Based Clustering . 1636.2 Finite Mixtures . 1666.2.1 Multivariate Finite Mixtures . 1676.2.2 Component Models - Constraining the Covariances . 1686.3 Expectation-Maximization Algorithm . 1766.4 Hierarchical Agglomerative Model-Based Clustering . 1816.5 Model-Based Clustering . 1826.6 Generating Random Variables from a Mixture Model . 1886.7 Summary and Further Reading . 192Exercises . 193Chapter 7Smoothing Scatterplots7.1 Introduction . 1977.2 Loess . 1987.3 Robust Loess . 2087.4 Residuals and Diagnostics . 2117.4.1 Residual Plots . 2127.4.2 Spread Smooth . 2167.4.3 Loess Envelopes - Upper and Lower Smooths . 2187.5 Bivariate Distribution Smooths . 2197.5.1 Pairs of Middle Smoothings . 2197.5.2 Polar Smoothing . 2227.6 Curve Fitting Toolbox . 2267.7 Summary and Further Reading . 228Exercises . 229Part IIIGraphical Methods for EDAChapter 8Visualizing Clusters8.1 Dendrogram . 2338.2 Treemaps . 2358.3 Rectangle Plots . 2388.4 ReClus Plots . 2448.5 Data Image . 2498.6 Summary and Further Reading . 255Exercises . 256 2005 by CRC Press LLC

EDA.book Page x Monday, October 18, 2004 8:31 AMxExploratory Data Analysis with MATLABChapter 9Distribution Shapes9.1 Histograms . 2599.1.1 Univariate Histograms . 2599.1.2 Bivariate Histograms . 2669.2 Boxplots . 2689.2.1 The Basic Boxplot . 2699.2.2 Variations of the Basic Boxplot . 2749.3 Quantile Plots . 2799.3.1 Probability Plots . 2799.3.2 Quantile-quantile Plot . 2819.3.3 Quantile Plot . 2849.4 Bagplots . 2869.5 Summary and Further Reading . 289Exercises . 289Chapter 10Multivariate Visualization10.1 Glyph Plots . 29310.2 Scatterplots . 29410.2.1 2-D and 3-D Scatterplots . 29410.2.2 Scatterplot Matrices . 29810.2.3 Scatterplots with Hexagonal Binning . 29910.3 Dynamic Graphics . 30110.3.1 Identification of Data . 30110.3.2 Linking . 30510.3.3 Brushing . 30810.4 Coplots . 30910.5 Dot Charts . 31210.5.1 Basic Dot Chart . 31310.5.2 Multiway Dot Chart . 31410.6 Plotting Points as Curves . 31810.6.1 Parallel Coordinate Plots . 31810.6.2 Andrews’ Curves . 32110.6.3 More Plot Matrices . 32510.7 Data Tours Revisited . 32610.7.1 Grand Tour . 32610.7.2 Permutation Tour . 32810.8 Summary and Further Reading . 332Exercises . 333Appendix AProximity MeasuresA.1 Definitions . 337A.1.1 Dissimilarities . 338 2005 by CRC Press LLC

EDA.book Page xi Monday, October 18, 2004 8:31 AMTable of ContentsxiA.1.2 Similarity Measures . 340A.1.3 Similarity Measures for Binary Data . 340A.1.4 Dissimilarities for Probability Density Functions . 341A.2 Transformations . 342A.3 Further Reading . 343Appendix BSoftware Resources for EDAB.1 MATLAB Programs . 345B.2 Other Programs for EDA . 348B.3 EDA Toolbox . 350Appendix CDescription of Data Sets . 351Appendix DIntroduction to MATLABD.1 What Is MATLAB? . 357D.2 Getting Help in MATLAB . 358D.3 File and Workspace Management . 358D.4 Punctuation in MATLAB . 360D.5 Arithmetic Operators . 361D.6 Data Constructs in MATLAB . 362Basic Data Constructs . 362Building Arrays . 363Cell Arrays . 363Structures . 364D.7 Script Files and Functions . 365D.8 Control Flow . 366for Loop . 366while Loop . 366if-else Statements . 367switch Statement . 367D.9 Simple Plotting . 367D.10 Where to get MATLAB Information . 370Appendix EMATLAB FunctionsE.1 MATLAB . 371E.2 Statistics Toolbox - Versions 4 and 5 . 373E.3 Exploratory Data Analysis Toolbox . 374 2005 by CRC Press LLC

EDA.book Page xii Monday, October 18, 2004 8:31 AMxiiExploratory Data Analysis with MATLABReferences . 377 2005 by CRC Press LLC

EDA.book Page xiii Monday, October 18, 2004 8:31 AMPrefaceOne of the goals of our first book, Computational Statistics Handbook withMATLAB [2002], was to show some of the key concepts and methods ofcomputational statistics and how they can be implemented in MATLAB.1 Acore component of computational statistics is the discipline known asexploratory data analysis or EDA. Thus, we see this book as a complement tothe first one with similar goals: to make exploratory data analysis techniquesavailable to a wide range of users.Exploratory data analysis is an area of statistics and data analysis, wherethe idea is to first explore the data set, often using methods from descriptivestatistics, scientific visualization, data tours, dimensionality reduction, andothers. This exploration is done without any (hopefully!) pre-conceivednotions or hypotheses. Indeed, the idea is to use the results of the explorationto guide and to develop the subsequent hypothesis tests, models, etc. It isclosely related to the field of data mining, and many of the EDA toolsdiscussed in this book are part of the toolkit for knowledge discovery anddata mining.This book is intended for a wide audience that includes scientists,statisticians, data miners, engineers, computer scientists, biostatisticians,social scientists, and any other discipline that must deal with the analysis ofraw data. We also hope this book can be useful in a classroom setting at thesenior undergraduate or graduate level. Exercises are included with eachchapter, making it suitable as a textbook or supplemental text for a course inexploratory data analysis, data mining, computational statistics, machinelearning, and others. Readers are encouraged to look over the exercises,because new concepts are sometimes introduced in them. Exercises arecomputational and exploratory in nature, so there is often no unique answer!As for the background required for this book, we assume that the readerhas an understanding of basic linear algebra. For example, one should havea familiarity with the notation of linear algebra, array multiplication, a matrixinverse, determinants, an array transpose, etc. We also assume that the readerhas had introductory probability and statistics courses. Here one shouldknow about random variables, probability distributions and densityfunctions, basic descriptive measures, regression, etc.In a spirit similar to the first book, this text is not focused on the theoreticalaspects of the methods. Rather, the main focus of this book is on the use of the1MATLAB and Handle Graphics are registered trademarks of The MathWorks, Inc.xiii 2005 by CRC Press LLC

EDA.book Page xiv Monday, October 18, 2004 8:31 AMxivExploratory Data Analysis with MATLABEDA methods. Implementation of the methods is secondary, but wherefeasible, we show students and practitioners the implementation throughalgorithms, procedures, and MATLAB code. Many of the methods arecomplicated, and the details of the MATLAB implementation are notimportant. In these instances, we show how to use the functions andtechniques. The interested reader (or programmer) can consult the M-files formore information. Thus, readers who prefer to use some other programminglanguage should be able to implement the algorithms on their own.While we do not delve into the theory, we would like to emphasize that themethods described in the book have a theoretical basis. Therefore, at the endof each chapter, we provide additional references and resources, so thosereaders who would like to know more about the underlying theory willknow where to find the information.MATLAB code in the form of an Exploratory Data Analysis Toolbox isprovided with the text. This includes the functions, GUIs, and data sets thatare described in the book. This is available for download ociates.comPlease review the readme file for installation instructions and information onany changes. M-files that contain the MATLAB commands for the exercisesare also available for download.We also make the disclaimer that our MATLAB code is not necessarily themost efficient way to accomplish the task. In many cases, we sacrificedefficiency for clarity. Please refer to the example M-files for alternativeMATLAB code, courtesy of Tom Lane of The MathWorks, Inc.We describe the EDA Toolbox in greater detail in Appendix B. We alsoprovide website information for other tools that are available for download(at no cost). Some of these toolboxes and functions are used in the book andothers are provided for informational purposes. Where possible andappropriate, we include some of this free MATLAB code with the EDAToolbox to make it easier for the reader to follow along with the examples andexercises.We assume that the reader has the Statistics Toolbox (Version 4 or higher)from The MathWorks, Inc. Where appropriate, we specify whether thefunction we are using is in the main MATLAB software package, StatisticsToolbox, or the EDA Toolbox. The development of the EDA Toolbox wasmostly accomplished with MATLAB Version 6.5 (Statistics Toolbox, Version4), so the code should work if this is what you have. However, a new releaseof MATLAB and the Statistics Toolbox was introduced in the middle ofwriting th is book, so we also in corp orate information abou t n ewfunctionality provided in these versions. 2005 by CRC Press LLC

EDA.book Page xv Monday, October 18, 2004 8:31 AMxvWe would like to acknowledge the invaluable help of the reviewers: ChrisFraley, David Johannsen, Catherine Loader, Tom Lane, David Marchette, andJeff Solka. Their many helpful comments and suggestions resulted in a betterbook. Any shortcomings are the sole responsibility of the authors. We owe aspecial thanks to Jeff Solka for programming assistance with finite mixturesand to Richard Johnson for allowing us to use his Data Visualization Toolboxand updating his functions. We would also like to acknowledge all of thoseresearchers who wrote MATLAB code for methods described in this bookand also made it available for free. We thank the editors of the book series inComputer Science and Data Analysis for including this text. We greatlyappreciate the help and patience of those at CRC press: Bob Stern, RobCalver, Jessica Vakili, and Andrea Demby. Finally, we are indebted to NaomiFernandes and Tom Lane at The MathWorks, Inc. for their special assistancewith MATLAB.Disclaimers1. Any MATLAB programs and data sets that are included with the book areprovided in good faith. The authors, publishers, or distributors do notguarantee their accuracy and are not responsible for the consequences oftheir use.2. Some of the MATLAB functions provided with the EDA Toolbox werewritten by other researchers, and they retain the copyright. References aregiven in Appendix B and in the help section of each function. Unlessotherwise specified, the EDA Toolbox is provided under the GNU gpl.html3. The views expressed in this book are those of the authors and do notnecessarily represent the views of the United States Department of Defenseor its components.Wendy L. and Angel R. MartinezOctober 2004 2005 by CRC Press LLC

EDA.book Page 1 Wednesday, October 27, 2004 9:10 PMPart IIntroduction to Exploratory Data Analysis 2005 by CRC Press LLC

EDA.book Page 3 Wednesday, October 27, 2004 9:10 PMChapter 1Introduction to Exploratory Data AnalysisWe shall not cease from explorationAnd the end of all our exploringWill be to arrive where we startedAnd know the place for the first time.T. S. Eliot, “Little Gidding” (the last of his Four Quartets)The purpose of this chapter is to provide some introductory and backgroundinformation. First, we cover the philosophy of exploratory data analysis anddiscuss how this fits in with other data analysis techniques and objectives.This is followed by an overview of the text, which includes the software thatwill be used and the background necessary to understand the methods. Wethen present several data sets that will be employed throughout the book toillustrate the concepts and ideas. Finally, we conclude the chapter with someinformation on data transforms, which will be important in some of themethods presented in the text.1.1 What is Exploratory Data AnalysisJohn W. Tukey [1977] was one of the first statisticians to provide a detaileddescription of exploratory data analysis (EDA). He defined it as “detectivework - numerical detective work - or counting detective work - or graphicaldetective work.” [Tukey, 1977, page 1] It is mostly a philosophy of dataanalysis where the researcher examines the data without any pre-conceivedideas in order to discover what the data can tell him about the

exploratory data analysis or EDA. Thus, we see this book as a complement to the first one with similar goals: to make exploratory data analysis techniques available to a wide range of users. Exploratory data analysis is an area of statistics and data analysis, where the idea is to first explore the