Understanding Machine Learning: From Theory To Algorithms

Transcription

Understanding Machine Learning:From Theory to Algorithmsc 2014 by Shai Shalev-Shwartz and Shai Ben-DavidPublished 2014 by Cambridge University Press.This copy is for personal use only. Not for distribution.Do not post. Please link to:http://www.cs.huji.ac.il/ shais/UnderstandingMachineLearningPlease note: This copy is almost, but not entirely, identical to the printed versionof the book. In particular, page numbers are not identical (but section numbers are thesame).

Understanding Machine LearningMachine learning is one of the fastest growing areas of computer science,with far-reaching applications. The aim of this textbook is to introducemachine learning, and the algorithmic paradigms it offers, in a principled way. The book provides an extensive theoretical account of thefundamental ideas underlying machine learning and the mathematicalderivations that transform these principles into practical algorithms. Following a presentation of the basics of the field, the book covers a widearray of central topics that have not been addressed by previous textbooks. These include a discussion of the computational complexity oflearning and the concepts of convexity and stability; important algorithmic paradigms including stochastic gradient descent, neural networks,and structured output learning; and emerging theoretical concepts such asthe PAC-Bayes approach and compression-based bounds. Designed foran advanced undergraduate or beginning graduate course, the text makesthe fundamentals and algorithms of machine learning accessible to students and nonexpert readers in statistics, computer science, mathematics,and engineering.Shai Shalev-Shwartz is an Associate Professor at the School of ComputerScience and Engineering at The Hebrew University, Israel.Shai Ben-David is a Professor in the School of Computer Science at theUniversity of Waterloo, Canada.

UNDERSTANDINGMACHINE LEARNINGFrom Theory toAlgorithmsShai Shalev-ShwartzThe Hebrew University, JerusalemShai Ben-DavidUniversity of Waterloo, Canada

32 Avenue of the Americas, New York, NY 10013-2473, USACambridge University Press is part of the University of Cambridge.It furthers the University’s mission by disseminating knowledge in the pursuit ofeducation, learning and research at the highest international levels of excellence.www.cambridge.orgInformation on this title: www.cambridge.org/9781107057135c Shai Shalev-Shwartz and Shai Ben-David 2014⃝This publication is in copyright. Subject to statutory exceptionand to the provisions of relevant collective licensing agreements,no reproduction of any part may take place without the writtenpermission of Cambridge University Press.First published 2014Printed in the United States of AmericaA catalog record for this publication is available from the British LibraryLibrary of Congress Cataloging in Publication DataISBN 978-1-107-05713-5 HardbackCambridge University Press has no responsibility for the persistence or accuracy ofURLs for external or third-party Internet Web sites referred to in this publication,and does not guarantee that any content on such Web sites is, or will remain,accurate or appropriate.

Triple-S dedicates the book to triple-M

viiPrefaceThe term machine learning refers to the automated detection of meaningfulpatterns in data. In the past couple of decades it has become a common tool inalmost any task that requires information extraction from large data sets. We aresurrounded by a machine learning based technology: search engines learn howto bring us the best results (while placing profitable ads), anti-spam softwarelearns to filter our email messages, and credit card transactions are secured bya software that learns how to detect frauds. Digital cameras learn to detectfaces and intelligent personal assistance applications on smart-phones learn torecognize voice commands. Cars are equipped with accident prevention systemsthat are built using machine learning algorithms. Machine learning is also widelyused in scientific applications such as bioinformatics, medicine, and astronomy.One common feature of all of these applications is that, in contrast to moretraditional uses of computers, in these cases, due to the complexity of the patternsthat need to be detected, a human programmer cannot provide an explicit, finedetailed specification of how such tasks should be executed. Taking example fromintelligent beings, many of our skills are acquired or refined through learning fromour experience (rather than following explicit instructions given to us). Machinelearning tools are concerned with endowing programs with the ability to “learn”and adapt.The first goal of this book is to provide a rigorous, yet easy to follow, introduction to the main concepts underlying machine learning: What is learning?How can a machine learn? How do we quantify the resources needed to learn agiven concept? Is learning always possible? Can we know if the learning processsucceeded or failed?The second goal of this book is to present several key machine learning algorithms. We chose to present algorithms that on one hand are successfully usedin practice and on the other hand give a wide spectrum of different learningtechniques. Additionally, we pay specific attention to algorithms appropriate forlarge scale learning (a.k.a. “Big Data”), since in recent years, our world has become increasingly “digitized” and the amount of data available for learning isdramatically increasing. As a result, in many applications data is plentiful andcomputation time is the main bottleneck. We therefore explicitly quantify boththe amount of data and the amount of computation time needed to learn a givenconcept.The book is divided into four parts. The first part aims at giving an initialrigorous answer to the fundamental questions of learning. We describe a generalization of Valiant’s Probably Approximately Correct (PAC) learning model,which is a first solid answer to the question “what is learning?”. We describethe Empirical Risk Minimization (ERM), Structural Risk Minimization (SRM),and Minimum Description Length (MDL) learning rules, which shows “how cana machine learn”. We quantify the amount of data needed for learning usingthe ERM, SRM, and MDL rules and show how learning might fail by deriving

viiia “no-free-lunch” theorem. We also discuss how much computation time is required for learning. In the second part of the book we describe various learningalgorithms. For some of the algorithms, we first present a more general learningprinciple, and then show how the algorithm follows the principle. While the firsttwo parts of the book focus on the PAC model, the third part extends the scopeby presenting a wider variety of learning models. Finally, the last part of thebook is devoted to advanced theory.We made an attempt to keep the book as self-contained as possible. However,the reader is assumed to be comfortable with basic notions of probability, linearalgebra, analysis, and algorithms. The first three parts of the book are intendedfor first year graduate students in computer science, engineering, mathematics, orstatistics. It can also be accessible to undergraduate students with the adequatebackground. The more advanced chapters can be used by researchers intendingto gather a deeper theoretical understanding.AcknowledgementsThe book is based on Introduction to Machine Learning courses taught by ShaiShalev-Shwartz at the Hebrew University and by Shai Ben-David at the University of Waterloo. The first draft of the book grew out of the lecture notes forthe course that was taught at the Hebrew University by Shai Shalev-Shwartzduring 2010–2013. We greatly appreciate the help of Ohad Shamir, who servedas a TA for the course in 2010, and of Alon Gonen, who served as a TA for thecourse in 2011–2013. Ohad and Alon prepared few lecture notes and many ofthe exercises. Alon, to whom we are indebted for his help throughout the entiremaking of the book, has also prepared a solution manual.We are deeply grateful for the most valuable work of Dana Rubinstein. Danahas scientifically proofread and edited the manuscript, transforming it fromlecture-based chapters into fluent and coherent text.Special thanks to Amit Daniely, who helped us with a careful read of theadvanced part of the book and also wrote the advanced chapter on multiclasslearnability. We are also grateful for the members of a book reading club inJerusalem that have carefully read and constructively criticized every line ofthe manuscript. The members of the reading club are: Maya Alroy, Yossi Arjevani, Aharon Birnbaum, Alon Cohen, Alon Gonen, Roi Livni, Ofer Meshi, DanRosenbaum, Dana Rubinstein, Shahar Somin, Alon Vinnikov, and Yoav Wald.We would also like to thank Gal Elidan, Amir Globerson, Nika Haghtalab, ShieMannor, Amnon Shashua, Nati Srebro, and Ruth Urner for helpful discussions.Shai Shalev-Shwartz, Jerusalem, IsraelShai Ben-David, Waterloo, Canada

ContentsPreface1Part Ipage viiIntroduction1.1What Is Learning?1.2When Do We Need Machine Learning?1.3Types of Learning1.4Relations to Other Fields1.5How to Read This Book1.5.1 Possible Course Plans Based on This Book1.6NotationFoundations1919212224252627312A Gentle Start2.1A Formal Model – The Statistical Learning Framework2.2Empirical Risk Minimization2.2.1 Something May Go Wrong – Overfitting2.3Empirical Risk Minimization with Inductive Bias2.3.1 Finite Hypothesis Classes2.4Exercises333335353637413A Formal Learning Model3.1PAC Learning3.2A More General Learning Model3.2.1 Releasing the Realizability Assumption – Agnostic PACLearning3.2.2 The Scope of Learning Problems Modeled3.3Summary3.4Bibliographic Remarks3.5Exercises434344Learning via Uniform Convergence4.1Uniform Convergence Is Sufficient for Learnability4.2Finite Classes Are Agnostic PAC Learnable5454554Understanding Machine Learning, c 2014 by Shai Shalev-Shwartz and Shai Ben-DavidPublished 2014 by Cambridge University Press.Personal use only. Not for distribution. Do not post.Please link to http://www.cs.huji.ac.il/ shais/UnderstandingMachineLearning4547495050

xContents4.34.44.5SummaryBibliographic RemarksExercises5858585The Bias-Complexity Tradeoff5.1The No-Free-Lunch Theorem5.1.1 No-Free-Lunch and Prior Knowledge5.2Error Decomposition5.3Summary5.4Bibliographic nsionInfinite-Size Classes Can Be LearnableThe VC-DimensionExamples6.3.1 Threshold Functions6.3.2 Intervals6.3.3 Axis Aligned Rectangles6.3.4 Finite Classes6.3.5 VC-Dimension and the Number of ParametersThe Fundamental Theorem of PAC learningProof of Theorem 6.76.5.1 Sauer’s Lemma and the Growth Function6.5.2 Uniform Convergence for Classes of Small Effective SizeSummaryBibliographic remarksExercises7Nonuniform Learnability7.1Nonuniform Learnability7.1.1 Characterizing Nonuniform Learnability7.2Structural Risk Minimization7.3Minimum Description Length and Occam’s Razor7.3.1 Occam’s Razor7.4Other Notions of Learnability – Consistency7.5Discussing the Different Notions of Learnability7.5.1 The No-Free-Lunch Theorem Revisited7.6Summary7.7Bibliographic Remarks7.8Exercises8The Runtime of Learning8.1Computational Complexity of Learning838384858991929395969797100101

Contents8.28.38.48.58.68.7Part II8.1.1 Formal Definition*Implementing the ERM Rule8.2.1 Finite Classes8.2.2 Axis Aligned Rectangles8.2.3 Boolean Conjunctions8.2.4 Learning 3-Term DNFEfficiently Learnable, but Not by a Proper ERMHardness of Learning*SummaryBibliographic RemarksExercisesFrom Theory to inear Predictors9.1Halfspaces9.1.1 Linear Programming for the Class of Halfspaces9.1.2 Perceptron for Halfspaces9.1.3 The VC Dimension of Halfspaces9.2Linear Regression9.2.1 Least Squares9.2.2 Linear Regression for Polynomial Regression Tasks9.3Logistic Regression9.4Summary9.5Bibliographic 2812810Boosting10.1 Weak Learnability10.1.1 Efficient Implementation of ERM for Decision Stumps10.2 AdaBoost10.3 Linear Combinations of Base Hypotheses10.3.1 The VC-Dimension of L(B, T )10.4 AdaBoost for Face Recognition10.5 Summary10.6 Bibliographic Remarks10.7 Exercises13013113313413713914014114114211Model Selection and Validation11.1 Model Selection Using SRM11.2 Validation11.2.1 Hold Out Set11.2.2 Validation for Model Selection11.2.3 The Model-Selection Curve144145146146147148

xiiContents11.311.411.511.2.4 k-Fold Cross Validation11.2.5 Train-Validation-Test SplitWhat to Do If Learning FailsSummaryExercises14915015115415412Convex Learning Problems12.1 Convexity, Lipschitzness, and Smoothness12.1.1 Convexity12.1.2 Lipschitzness12.1.3 Smoothness12.2 Convex Learning Problems12.2.1 Learnability of Convex Learning Problems12.2.2 Convex-Lipschitz/Smooth-Bounded Learning Problems12.3 Surrogate Loss Functions12.4 Summary12.5 Bibliographic Remarks12.6 ularization and Stability13.1 Regularized Loss Minimization13.1.1 Ridge Regression13.2 Stable Rules Do Not Overfit13.3 Tikhonov Regularization as a Stabilizer13.3.1 Lipschitz Loss13.3.2 Smooth and Nonnegative Loss13.4 Controlling the Fitting-Stability Tradeoff13.5 Summary13.6 Bibliographic Remarks13.7 stic Gradient Descent14.1 Gradient Descent14.1.1 Analysis of GD for Convex-Lipschitz Functions14.2 Subgradients14.2.1 Calculating Subgradients14.2.2 Subgradients of Lipschitz Functions14.2.3 Subgradient Descent14.3 Stochastic Gradient Descent (SGD)14.3.1 Analysis of SGD for Convex-Lipschitz-Bounded Functions14.4 Variants14.4.1 Adding a Projection Step14.4.2 Variable Step Size14.4.3 Other Averaging Techniques184185186188189190190191191193193194195

Contents14.514.614.714.814.4.4 Strongly Convex Functions*Learning with SGD14.5.1 SGD for Risk Minimization14.5.2 Analyzing SGD for Convex-Smooth Learning Problems14.5.3 SGD for Regularized Loss MinimizationSummaryBibliographic ort Vector Machines15.1 Margin and Hard-SVM15.1.1 The Homogenous Case15.1.2 The Sample Complexity of Hard-SVM15.2 Soft-SVM and Norm Regularization15.2.1 The Sample Complexity of Soft-SVM15.2.2 Margin and Norm-Based Bounds versus Dimension15.2.3 The Ramp Loss*15.3 Optimality Conditions and “Support Vectors”*15.4 Duality*15.5 Implementing Soft-SVM Using SGD15.6 Summary15.7 Bibliographic Remarks15.8 416Kernel Methods16.1 Embeddings into Feature Spaces16.2 The Kernel Trick16.2.1 Kernels as a Way to Express Prior Knowledge16.2.2 Characterizing Kernel Functions*16.3 Implementing Soft-SVM with Kernels16.4 Summary16.5 Bibliographic Remarks16.6 Exercises21521521722122222222422522517Multiclass, Ranking, and Complex Prediction Problems17.1 One-versus-All and All-Pairs17.2 Linear Multiclass Predictors17.2.1 How to Construct Ψ17.2.2 Cost-Sensitive Classification17.2.3 ERM17.2.4 Generalized Hinge Loss17.2.5 Multiclass SVM and SGD17.3 Structured Output Prediction17.4 Ranking227227230230232232233234236238

xivContents17.4.1 Linear Predictors for RankingBipartite Ranking and Multivariate Performance Measures17.5.1 Linear Predictors for Bipartite RankingSummaryBibliographic RemarksExercises24024324524724724818Decision Trees18.1 Sample Complexity18.2 Decision Tree Algorithms18.2.1 Implementations of the Gain Measure18.2.2 Pruning18.2.3 Threshold-Based Splitting Rules for Real-Valued Features18.3 Random Forests18.4 Summary18.5 Bibliographic Remarks18.6 Exercises25025125225325425525525625625619Nearest Neighbor19.1 k Nearest Neighbors19.2 Analysis19.2.1 A Generalization Bound for the 1-NN Rule19.2.2 The “Curse of Dimensionality”19.3 Efficient Implementation*19.4 Summary19.5 Bibliographic Remarks19.6 Exercises25825825926026326426426426520Neural Networks20.1 Feedforward Neural Networks20.2 Learning Neural Networks20.3 The Expressive Power of Neural Networks20.3.1 Geometric Intuition20.4 The Sample Complexity of Neural Networks20.5 The Runtime of Learning Neural Networks20.6 SGD and Backpropagation20.7 Summary20.8 Bibliographic Remarks20.9 Exercises268269270271273274276277281281282Part IIIAdditional Learning Models28521Online Learning21.1 Online Classification in the Realizable Case28728817.517.617.717.8

Contentsxv21.1.1 Online LearnabilityOnline Classification in the Unrealizable Case21.2.1 Weighted-MajorityOnline Convex OptimizationThe Online Perceptron AlgorithmSummaryBibliographic ng22.1 Linkage-Based Clustering Algorithms22.2 k-Means and Other Cost Minimization Clusterings22.2.1 The k-Means Algorithm22.3 Spectral Clustering22.3.1 Graph Cut22.3.2 Graph Laplacian and Relaxed Graph Cuts22.3.3 Unnormalized Spectral Clustering22.4 Information Bottleneck*22.5 A High Level View of Clustering22.6 Summary22.7 Bibliographic Remarks22.8 Dimensionality Reduction23.1 Principal Component Analysis (PCA)23.1.1 A More Efficient Solution for the Case d m23.1.2 Implementation and Demonstration23.2 Random Projections23.3 Compressed Sensing23.3.1 Proofs*23.4 PCA or Compressed Sensing?23.5 Summary23.6 Bibliographic Remarks23.7 tive Models24.1 Maximum Likelihood Estimator24.1.1 Maximum Likelihood Estimation for Continuous Random Variables24.1.2 Maximum Likelihood and Empirical Risk Minimization24.1.3 Generalization Analysis24.2 Naive Bayes24.3 Linear Discriminant Analysis24.4 Latent Variables and the EM 7347348

xviContents24.524.624.724.824.4.1 EM as an Alternate Maximization Algorithm24.4.2 EM for Mixture of Gaussians (Soft k-Means)Bayesian ReasoningSummaryBibliographic RemarksExercises35035235335535535625Feature Selection and Generation25.1 Feature Selection25.1.1 Filters25.1.2 Greedy Selection Approaches25.1.3 Sparsity-Inducing Norms25.2 Feature Manipulation and Normalization25.2.1 Examples of Feature Transformations25.3 Feature Learning25.3.1 Dictionary Learning Using Auto-Encoders25.4 Summary25.5 Bibliographic Remarks25.6 Exercises357358359360363365367368368370371371Part IVAdvanced Theory37326Rademacher Complexities26.1 The Rademacher Complexity26.1.1 Rademacher Calculus26.2 Rademacher Complexity of Linear Classes26.3 Generalization Bounds for SVM26.4 Generalization Bounds for Predictors with Low 1 Norm26.5 Bibliographic Remarks37537537938238338638627Covering Numbers27.1 Covering27.1.1 Properties27.2 From Covering to Rademacher Complexity via Chaining27.3 Bibliographic Remarks38838838838939128Proof of the Fundamental Theorem of Learning Theory28.1 The Upper Bound for the Agnostic Case28.2 The Lower Bound for the Agnostic Case28.2.1 Showing That m( , δ) 0.5 log(1/(4δ))/ 228.2.2 Showing That m( , 1/8) 8d/ 228.3 The Upper Bound for the Realizable Case28.3.1 From -Nets to PAC Learnability392392393393395398401

Contentsxvii29Multiclass Learnability29.1 The Natarajan Dimension29.2 The Multiclass Fundamental Theorem29.2.1 On the Proof of Theorem 29.329.3 Calculating the Natarajan Dimension29.3.1 One-versus-All Based Classes29.3.2 General Multiclass-to-Binary Reductions29.3.3 Linear Multiclass Predictors29.4 On Good and Bad ERMs29.5 Bibliographic Remarks29.6 ssion Bounds30.1 Compression Bounds30.2 Examples30.2.1 Axis Aligned Rectangles30.2.2 Halfspaces30.2.3 Separating Polynomials30.2.4 Separation with Margin30.3 Bibliographic Remarks41041041241241241341441431PAC-Bayes31.1 PAC-Bayes Bounds31.2 Bibliographic Remarks31.3 Exercises415415417417Appendix ATechnical Lemmas419Appendix BMeasure Concentration422Appendix CLinear Algebra430NotesReferencesIndex435437447

1IntroductionThe subject of this book is automated learning, or, as we will more often callit, Machine Learning (ML). That is, we wish to program computers so thatthey can “learn” fro

Understanding Machine Learning Machine learning is one of the fastest growing areas of computer science, with far-reaching applications. The aim of this textbook is to introduce machine learning, and the algorithmic paradigms it offers, in a princi-pled way. The book provides an extensive theoretical account of the fundamental ideas underlying .File Size: 2MB