PATTERN CLASSIFICATION - CERN Document Server

Transcription

PATTERNCLASSIFICATIONSecond EditionRichard O. DudaPeter E . HartDavid G . StorkA Wiley-Interscience PublicationJOHN WILEY & SONS, INC .ChichesterNew YorkWeinheimBrisbaneSingaporeToronto

CONTENTSPREFACEINTRODUCTIONxvii11.1 Machine Perception, 11 .2 An Example, 11 .2.1 Related Fields, 81.3 Pattern Recognition Systems, 91 .3.1 Sensing, 91 .3 .2 Segmentation and Grouping, 91 .3 .3 Feature Extraction, 111 .3 .4 Classification, 121 .3 .5 Post Processing, 131 .4 The Design Cycle, 141 .4.1 Data Collection, 141 .4.2 Feature Choice, 141 .4.3 Model Choice, 151 .4.4 Training, 151 .4.5 Evaluation, 151 .4.6 Computational Complexity, 161 .5 Learning and Adaptation, 161 .5.1 Supervised Learning, 161 .5.2 Unsupervised Learning, 171 .5.3 Reinforcement Learning, 171 .6 Conclusion, 17Summary by Chapters, 17Bibliographical and Historical Remarks, 18Bibliography, 19BAYESIAN DECISION THEORY202.1 Introduction, 202.2 Bayesian Decision Theory-Continuous Features, 242.2.1 Two-Category Classification, 252.3 Minimum-Error-Rate Classification, 264 2.3.1 Minimax Criterion, 27vn

ViiiCONTENTS*2.3.2 Neyman-Pearson Criterion, 282.4 Classifiers, Discriminant Functions, and Decision Surfaces, 292.4.1 The Multicategory Case, 292.4.2 The Two-Category Case, 302.5 The Normal Density, 312.5 .1 Univariate Density, 322.5 .2 Multivariate Density, 332.6 Discriminant Functions for the Normal Density, 362.6.1 Case 1 :o- 1, 362.6.2 Case 2: %i %, 392.6.3 Case 3: arbitrary, 41Decision Regions for Two-DimensionalGaussian Data, 41*2.7 Error Probabilities and Integrals, 45*2.8 Error Bounds for Normal Densities, 462.8.1 Chernoff Bound, 462.8.2 Bhattacharyya Bound, 47I- ' Error Bounds for Gaussian Distributions, 482.8.3 Signal Detection Theory and Operating Characteristics, 482.9 Bayes Decision Theory-Discrete Features, 512.9.1 Independent Binary Features, 52-, :Bayesian Decisions for Three-DimensionalBinary Data, 53*2.10 Missing and Noisy Features, 542.10.1 Missing Features, 542.10 .2 Noisy Features, 55*2.11 Bayesian Belief Networks, 56Belief Network for Fish, 59*2.12 Compound Bayesian Decision Theory and Context, 62Summary, 63Bibliographical and Historical Remarks, 64Problems, 65Computer exercises, 80Bibliography, 82%j 2%j i i 1 ;1 ( '.MAXIMUM-LIKELIHOOD AND BAYESIANPARAMETER ESTIMATION3 .1 Introduction, 843.2 Maximum-Likelihood Estimation, 853.2.1 The General Principle, 853 .2.2 The Gaussian Case: Unknown883.2.3 The Gaussian Case: Unknown Ix and Y,, 883.2.4 Bias, 893.3 Bayesian Estimation, 903.3.1 The Class-Conditional Densities, 913.3.2 The Parameter Distribution, 913.4 Bayesian Parameter Estimation: Gaussian Case, 923.4.1 The Univariate Case : p(I.LI D), 923.4.2 The Univariate Case: p (x ID), 953.4.3 The Multivariate Case, 95it,84

NCONTENTS3 .5 Bayesian Parameter Estimation: General Theory, 97( Recursive Bayes Learning, 983.5.1 When Do Maximum-Likelihood and Bayes Methods Differ?, 1003.5.2 Noninformative Priors and Invariance, 1013.5.3 Gibbs Algorithm, 102*3 .6 Sufficient Statistics, 1023.6.1 Sufficient Statistics and the Exponential Family, 1063 .7 Problems of Dimensionality, 1073 .7.1 Accuracy, Dimension, and Training Sample Size, 1073.7.2 Computational Complexity, 111,3.7.3 Overfitting, 113*3 .8 Component Analysis and Discriminants, 1143.8.1 Principal Component Analysis (PCA), 1153.8.2 Fisher Linear Discriminant, 1173.8.3 Multiple Discriminant Analysis, 121*3 .9 Expectation-Maximization (EM), 124 ;, oaniaYe 2 Expectation-Maximization for a 2D Normal Model, 1263.10 Hidden Markov Models, 1283.10.1 First-Order Markov Models, 1283.10.2 First-Order Hidden Markov Models, 1293.10.3 Hidden Markov Model Computation, 1293.10.4 Evaluation, 131, 5 :.o-q* Hidden Markov Model, 1333.10.5 Decoding, 135;: ,r?rrrj rs r HMM Decoding, 1363.10.6 Learning, 137Summary, 139Bibliographical and Historical Remarks, 139Problems, 140Computer exercises, 155Bibliography, 159 NONPARAMETRIC TECHNIQUES4.1 Introduction, 1614.2 Density Estimation, 1614.3 Parzen Windows, 1644.3.1 Convergence of the Mean, 1674.3.2 Convergence of the Variance, 1674.3.3 Illustrations, 1684.3.4 Classification Example, 1684.3.5 Probabilistic Neural Networks (PNNs), 1724.3.6 Choosing the Window Function, 1744.4 Ic -Nearest-Neighbor Estimation, 1744.4.1 k -Nearest-Neighbor and Parzen-Window Estimation, 1764.4.2 Estimation of A Posteriori Probabilities, 1774.5 The Nearest-Neighbor Rule, 1774.5.1 Convergence of the Nearest Neighbor, 1794.5.2 Error Rate for the Nearest-Neighbor Rule, 1804.5.3 Error Bounds, 1804.5.4 The k-Nearest-Neighbor Rule, 182161

XCONTENTS4.5.5 Computational Complexity of the k-Nearest-Neighbor Rule, 1844.6 Metrics and Nearest-Neighbor Classification, 1874.6.1 Properties of Metrics, 1874 .6.2 Tangent Distance, 188*4.7 Fuzzy Classification, 192*4 .8 Reduced Coulomb Energy Networks, 1954.9 Approximations by Series Expansions, 197Summary, 199Bibliographical and Historical Remarks, 200Problems, 201Computer exercises, 209Bibliography, 2135LINEAR DISCRIMINANT FUNCTIONS5.1 Introduction, 2155.2 Linear Discriminant Functions and Decision Surfaces, 2165.2.1 The Two-Category Case, 2165.2.2 The Multicategory Case, 2185.3 Generalized Linear Discriminant Functions, 2195.4 The Two-Category Linearly Separable Case, 2235.4.1 Geometry and Terminology, 2245.4.2 Gradient Descent Procedures, 2245.5 Minimizing the Perceptron Criterion Function, 2275 .5.1 The Perceptron Criterion Function, 2275 .5.2 Convergence Proof for Single-Sample Correction, 2295.5.3 Some Direct Generalizations, 2325.6 Relaxation Procedures, 2355.6.1 The Descent Algorithm, 2355.6.2 Convergence Proof, 2375.7 Nonseparable Behavior, 2385.8 Minimum Squared-Error Procedures, 2395.8.1 Minimum Squared-Error and the Pseudoinverse, 240i Constructing a Linear Classifier by MatrixPseudoinverse, 2415 .8 .2 Relation to Fisher's Linear Discriminant, 2425.8.3 Asymptotic Approximation to an Optimal Discriminant, 2435 .8.4 The Widrow-Hoff or LMS Procedure, 2455.8.5 Stochastic Approximation Methods, 2465.9 The Ho-Kashyap Procedures, 2495 .9.1 The Descent Procedure, 2505 .9.2 Convergence Proof, 2515 .9.3 Nonseparable Behavior, 2535.9.4 Some Related Procedures, 253*5.10 Linear Programming Algorithms, 2565 .10.1 Linear Programming, 2565 .10.2 The Linearly Separable Case, 2575 .10.3 Minimizing the Perceptron Criterion Function, 258*5,11 Support Vector Machines, 2595 .11 .1 SVMTraining, 263ilopk ;2 SVM for the XOR Problem, 264215

CONTENTSA5 .12 Multicategory Generalizations, 2655.12.1 Kesler's Construction, 2665.12.2 Convergence of the Fixed-Increment Rule, 2665.12.3 Generalizations for MSE Procedures, 268Summary, 269Bibliographical and Historical Remarks, 270Problems, 271Computer exercises, 278Bibliography, 281MULTILAYER NEURAL NETWORKS6.1 Introduction, 2826.2 Feedforward Operation and Classification, 2846.2.1 General Feedforward Operation, 2866.2.2 Expressive Power of Multilayer Networks, 2876 .3 Backpropagation Algorithm, 2886.3.1 Network Learning, 2896.3.2 Training Protocols, 2936.3.3 Learning Curves, 2956.4 Error Surfaces, 2966.4.1 Some Small Networks, 2966.4.2 The Exclusive-OR (XOR), 2986.4.3 Larger Networks, 2986.4.4 How Important Are Multiple Minima?, 2996.5 Backpropagation as Feature Mapping, 2996.5.1 Representations at the Hidden Layer-Weights, 3026.6 Backpropagation, Bayes Theory and Probability, 3036.6.1 Bayes Discriminants and Neural Networks, 3036.6.2 Outputs as Probabilities, 304*6.7 Related Statistical Techniques, 3056.8 Practical Techniques for Improving Backpropagation, 3066.8 .1 Activation Function, 3076.8.2 Parameters for the Sigmoid, 3086.8 .3 Scaling Input, 3086.8 .4 Target Values, 3096.8.5 Training with Noise, 3106.8.6 Manufacturing Data, 3106.8 .7 Number of Hidden Units, 3106.8 .8 Initializing Weights, 3116.8.9 Learning Rates, 3126.8 .10 Momentum, 3136.8.11 Weight Decay, 3146.8 .12 Hints, 3156.8.13 On-Line, Stochastic or Batch Training?, 3166.8.14 Stopped Training, 3166.8 .15 Number of Hidden Layers, 3176.8.16 Criterion Function, 318*6.9 Second-Order Methods, 3186.9.1 Hessian Matrix, 3186.9.2 Newton's Method, 319282

XiiCONTENTS6 .9.3 Quickprop, 3206.9.4 Conjugate Gradient Descent, 321( Conjugate Gradient Descent, 322*6.10 Additional Networks and Training Methods, 3246 .10.1 Radial Basis Function Networks (RBFs), 3246.10 .2 Special Bases, 3256.10.3 Matched Filters, 3256.10.4 Convolutional Networks, 3266.10.5 Recurrent Networks, 3286.10.6 Cascade-Correlation, 3296.11 Regularization, Complexity Adjustment and Pruning, 330Summary, 333Bibliographical and Historical Remarks, 333Problems, 335Computer exercises, 343Bibliography, 347STOCHASTIC METHODS3507.1 Introduction, 3507.2 Stochastic Search, 3517 .2.1 Simulated Annealing, 3517.2.2 The Boltzmann Factor, 3527.2.3 Deterministic Simulated Annealing, 3577.3 Boltzmann Learning, 3607.3.1 Stochastic Boltzmann Learning of Visible States, 3607.3.2 Missing Features and Category Constraints, 3657 .3.3 Deterministic Boltzmann Learning, 3667.3.4 Initialization and Setting Parameters, 367*7.4 Boltzmann Networks and Graphical Models, 3707 .4.1 Other Graphical Models, 372*7.5 Evolutionary Methods, 3737.5.1 Genetic Algorithms, 3737.5.2 Further Heuristics, 3777.5.3 Why Do They Work?, 378*7.6 Genetic Programming, 378Summary, 381Bibliographical and Historical Remarks, 381Problems, 383Computer exercises, 388Bibliography, 391NONMETRIC METHODS8.1 Introduction, 3948.2 Decision Trees, 3958.3 CART, 3968 .3.1 Number of Splits, 3978 .3.2 Query Selection and Node Impurity, 3988.3.3 When to Stop Splitting, 4028.3.4 Pruning, 403394

CONTENTSXiii8 .3.5 Assignment of Leaf Node Labels, 404A Simple Tree, 4048.3.6 Computational Complexity, 4068 .3.7 Feature Choice, 4078.3.8 Multivariate Decision Trees, 4088.3.9 Priors and Costs, 4098.3.10 Missing Attributes, 409aF;,cvmpdi '? Surrogate Splits and Missing Attributes, 4108.4 Other Tree Methods, 4118.4.1 ID3, 4118.4.2 C4 .5, 4118.4.3 Which Tree Classifier Is Best?, 412*8.5 Recognition with Strings, 4138.5.1 String Matching, 4158.5 .2 Edit Distance, 4188.5 .3 Computational Complexity, 4208.5.4 String Matching with Errors, 4208.5.5 String Matching with the "Don't-Care" Symbol, 4218.6 Grammatical Methods, 4218.6.1 Grammars, 4228.6.2 Types of String Grammars, 4241 , .11 -: i f I(: .s A Grammar for Pronouncing Numbers, 4258.6.3 Recognition Using Grammars, 4268.7 Grammatical Inference, 4294 Grammatical Inference, 431*8.8 Rule-Based Methods, 4318.8 .1 Learning Rules, 433Summary, 434Bibliographical and Historical Remarks, 435Problems, 437Computer exercises, 446Bibliography, 450VI : , ;r , -ijrt ;,IALGORITHM-INDEPENDENT MACHINE LEARNING4539.1 Introduction, 4539.2 Lack of Inherent Superiority of Any Classifier, 4549.2.1 No Free Lunch Theorem, 4541 No Free Lunch for Binary Data, 457DucklingTheorem, 458*9.2.2 Ugly9.2.3 Minimum Description Length (MDL), 4619.2.4 Minimum Description Length Principle, 4639.2.5 Overfitting Avoidance and Occam's Razor, 4649.3 Bias and Variance, 4659.3.1 Bias and Variance for Regression, 4669.3 .2 Bias and Variance for Classification, 4689.4 Resampling for Estimating Statistics, 4719.4.1 Jackknife, 472? Jackknife Estimate of Bias and Variance of the Mode, 473Bootstrap,4749.4.2ClassifierDesign, 4759.5 Resampling for

AVCONTENTS9.5.1 Bagging, 4759.5 .2 Boosting, 4769.5 .3 Learning with Queries, 4809.5.4 Arcing, Learning with Queries, Bias and Variance, 4829.6 Estimating and Comparing Classifiers, 4829.6.1 Parameterc Models, 4839.6 .2 Cross-Validation, 4839.6.3 Jackknife and Bootstrap Estimation of Classification Accuracy, 4859.6 .4 Maximum-Likelihood Model Comparison, 4869.6.5 Bayesian Model Comparison, 4879.6 .6 The Problem-Average Error Rate, 4899.6.7 Predicting Final Performance from Learning Curves, 4929.6.8 The Capacity of a Separating Plane, 4949.7 Combining Classifiers, 4959.7.1 Component Classifiers with Discriminant Functions, 4969.7.2 Component Classifiers without Discriminant Functions, 498Summary, 499Bibliographical and Historical Remarks, 500Problems, 502Computer exercises, 508Bibliography, 513UNSUPERVISED LEARNING AND CLUSTERING10.110.210.310.410.510.610.7*10.810.910 .10Introduction, 517Mixture Densities and Identifiability, 518Maximum-Likelihood Estimates, 519Application to Normal Mixtures, 52110.4.1 Case l: Unknown Mean Vectors, 52210.4.2 Case 2: All Parameters Unknown, 52410.4.3 k-Means Clustering, 526*10 .4.4 Fuzzy k-Means Clustering, 528Unsupervised Bayesian Learning, 53010.5.1 The Bayes Classifier, 53010.5.2 Learning the Parameter Vector, 531Exa 1 pie i Unsupervised Learning of Gaussian Data, 53410.5.3 Decision-Directed Approximation, 536Data Description and Clustering, 53710.6.1 Similarity Measures, 538Criterion Functions for Clustering, 54210.7.1 The Sum-of-Squared-Error Criterion, 54210.7.2 Related Minimum Variance Criteria, 54310.7.3 Scatter Criteria, 544?L. ;; ipte ? Clustering Criteria, 546Iterative Optimization, 548Hierarchical Clustering, 55010.9.1 Definitions, 55110 .9.2 Agglomerative Hierarchical Clustering, 55210.9.3 Stepwise-Optimal Hierarchical Clustering, 55510 .9.4 Hierarchical Clustering and Induced Metrics, 556The Problem of Validity, 557517

CONTENTSXV*10.11 On-line clustering, 55910.11 .1 Unknown Number of Clusters, 56110.11 .2 Adaptive Resonance, 56310.11 .3 Learning with a Critic, 565* 10.12 Graph-Theoretic Methods, 56610.13 Component Analysis, 56810.13 .1 Principal Component Analysis (PCA), 56810.13 .2 Nonlinear Component Analysis (NLCA), 569*10 .13 .3 Independent Component Analysis (ICA), 57010.14 Low-Dimensional Representations and Multidimensional Scaling(MDS), 57310.14.1 Self-Organizing Feature Maps, 57610.14.2 Clustering and Dimensionality Reduction, 580Summary, 581Bibliographical and Historical Remarks, 582Problems, 583Computer exercises, 593Bibliography, 598A MATHEMATICAL FOUNDATIONSA.1 Notation, 601A.2 Linear Algebra, 604A.2.1 Notation and Preliminaries, 604A.2.2 Inner Product, 605A.2.3 Outer Product, 606A.2.4 Derivatives of Matrices, 606A.2.5 Determinant and Trace, 608A.2.6 Matrix Inversion, 609A.2.7 Eigenvectors and Eigenvalues, 609A.3 Lagrange Optimization, 610A.4 Probability Theory, 611A.4.1 Discrete Random Variables, 611A.4.2 Expected Values, 611A.4.3 Pairs of Discrete Random Variables, 612A.4.4 Statistical Independence, 613A.4.5 Expected Values of Functions of Two Variables, 613A.4.6 Conditional Probability, 614A.4.7 The Law ofTotal Probability and Bayes Rule, 615A.4 .8 Vector Random Variables, 616AA.9 Expectations, Mean Vectors and Covariance Matrices, 617A.4.10 Continuous Random Variables, 618A.4.11 Distributions of Sums of Independent Random Variables, 620A.4.12 Normal Distributions, 621A.5 Gaussian Derivatives and Integrals, 623A.5 .1 Multivariate Normal Densities, 624A.5 .2 Bivariate Normal Densities, 626A.6 Hypothesis Testing, 628A.6.1 Chi-Squared Test, 629A.7 Information Theory, 630A.7.1 Entropy and Information, 630601

XViCONTENTSA.7.2 Relative Entropy, 632A.7.3 Mutual Information, 632A.8 Computational Complexity, 633Bibliography, 635

CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart DavidG. Stork AWiley-Interscience Publication JOHNWILEY&SONS,INC. NewYork Chichester Weinheim Brisbane Singapore Toronto. CONTENTS PREFACE xvii INTRODUCTION 1 1.1 MachinePerception, 1 1 .2 AnExample, 1 1.2.1 RelatedFields, 8 1.3 Pattern Recognition Systems, 9 1.3.1 Sensing, 9 1.3.2 .