Foundations Of Data Science - TTIC

Transcription

Foundations of Data Science Avrim Blum, John Hopcroft, and Ravindran KannanThursday 27th February, 2020 This material has been published by Cambridge University Press as Foundations of Data Science byAvrim Blum, John Hopcroft, and Ravi Kannan. This pre-publication version is free to view and downloadfor personal use only. Not for re-distribution, re-sale or use in derivative works. Please do not re-postor mirror, instead link to http://ttic.edu/blum/book.pdf. c Avrim Blum, John Hopcroft, and RaviKannan 2020. https://www.cambridge.org/97811084850671

Contents1 Introduction92 High-Dimensional Space2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .2.2 The Law of Large Numbers . . . . . . . . . . . . . . .2.3 The Geometry of High Dimensions . . . . . . . . . . .2.4 Properties of the Unit Ball . . . . . . . . . . . . . . . .2.4.1 Volume of the Unit Ball . . . . . . . . . . . . .2.4.2 Volume Near the Equator . . . . . . . . . . . .2.5 Generating Points Uniformly at Random from a Ball .2.6 Gaussians in High Dimension . . . . . . . . . . . . . .2.7 Random Projection and Johnson-Lindenstrauss Lemma2.8 Separating Gaussians . . . . . . . . . . . . . . . . . . .2.9 Fitting a Spherical Gaussian to Data . . . . . . . . . .2.10 Bibliographic Notes . . . . . . . . . . . . . . . . . . . .2.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . .12121216171719222425272931323 Best-Fit Subspaces and Singular Value Decomposition (SVD)3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3.3 Singular Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3.4 Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . .3.5 Best Rank-k Approximations . . . . . . . . . . . . . . . . . . . . .3.6 Left Singular Vectors . . . . . . . . . . . . . . . . . . . . . . . . . .3.7 Power Method for Singular Value Decomposition . . . . . . . . . . .3.7.1 A Faster Method . . . . . . . . . . . . . . . . . . . . . . . .3.8 Singular Vectors and Eigenvectors . . . . . . . . . . . . . . . . . . .3.9 Applications of Singular Value Decomposition . . . . . . . . . . . .3.9.1 Centering Data . . . . . . . . . . . . . . . . . . . . . . . . .3.9.2 Principal Component Analysis . . . . . . . . . . . . . . . . .3.9.3 Clustering a Mixture of Spherical Gaussians . . . . . . . . .3.9.4 Ranking Documents and Web Pages . . . . . . . . . . . . .3.9.5 An Illustrative Application of SVD . . . . . . . . . . . . . .3.9.6 An Application of SVD to a Discrete Optimization Problem3.10 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . .3.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Random Walks and Markov Chains4.1 Stationary Distribution . . . . . . . .4.2 Markov Chain Monte Carlo . . . . .4.2.1 Metropolis-Hasting Algorithm4.2.2 Gibbs Sampling . . . . . . . .2.

4.34.4Areas and Volumes . . . . . . . . . . . . . . . . . . . . . . . .Convergence of Random Walks on Undirected Graphs . . . . .4.4.1 Using Normalized Conductance to Prove Convergence .4.5 Electrical Networks and Random Walks . . . . . . . . . . . . .4.6 Random Walks on Undirected Graphs with Unit Edge Weights4.7 Random Walks in Euclidean Space . . . . . . . . . . . . . . .4.8 The Web as a Markov Chain . . . . . . . . . . . . . . . . . . .4.9 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . .4.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 Machine Learning5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5.2 The Perceptron Algorithm . . . . . . . . . . . . . . . . . . . . . . .5.3 Kernel Functions and Non-linearly Separable Data . . . . . . . . . .5.4 Generalizing to New Data . . . . . . . . . . . . . . . . . . . . . . .5.4.1 Overfitting and Uniform Convergence . . . . . . . . . . . . .5.4.2 Occam’s Razor . . . . . . . . . . . . . . . . . . . . . . . . .5.4.3 Regularization: Penalizing Complexity . . . . . . . . . . . .5.5 VC-Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5.5.1 Definitions and Key Theorems . . . . . . . . . . . . . . . . .5.5.2 VC-Dimension of Some Set Systems . . . . . . . . . . . . .5.5.3 Shatter Function for Set Systems of Bounded VC-Dimension5.5.4 VC-Dimension of Combinations of Concepts . . . . . . . . .5.5.5 The Key Theorem . . . . . . . . . . . . . . . . . . . . . . .5.6 VC-dimension and Machine Learning . . . . . . . . . . . . . . . . .5.7 Other Measures of Complexity . . . . . . . . . . . . . . . . . . . . .5.8 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5.8.1 Generative Adversarial Networks (GANs) . . . . . . . . . . .5.9 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . .5.9.1 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . .5.9.2 Regularizer . . . . . . . . . . . . . . . . . . . . . . . . . . .5.10 Online Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5.10.1 An Example: Learning Disjunctions . . . . . . . . . . . . . .5.10.2 The Halving Algorithm . . . . . . . . . . . . . . . . . . . . .5.10.3 The Perceptron Algorithm . . . . . . . . . . . . . . . . . . .5.10.4 Inseparable Data and Hinge Loss . . . . . . . . . . . . . . .5.10.5 Online to Batch Conversion . . . . . . . . . . . . . . . . . .5.10.6 Combining (Sleeping) Expert Advice . . . . . . . . . . . . .5.11 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5.12 Further Current Directions . . . . . . . . . . . . . . . . . . . . . . .5.12.1 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . .5.12.2 Active Learning . . . . . . . . . . . . . . . . . . . . . . . . .5.12.3 Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . .3.87899598103110113117119.131. 131. 132. 134. 135. 137. 139. 140. 141. 142. 143. 145. 147. 148. 150. 151. 151. 157. 159. 161. 162. 162. 163. 164. 164. 166. 166. 168. 170. 174. 174. 176. 177

5.13 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1775.14 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1796 Algorithms for Massive Data: Streaming, Sketching, and Sampling6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6.2 Frequency Moments of Data Streams . . . . . . . . . . . . . . . . . . .6.2.1 Number of Distinct Elements in a Data Stream . . . . . . . . .6.2.2 Number of Occurrences of a Given Element. . . . . . . . . . . .6.2.3 Frequent Elements . . . . . . . . . . . . . . . . . . . . . . . . .6.2.4 The Second Moment . . . . . . . . . . . . . . . . . . . . . . . .6.3 Matrix Algorithms using Sampling . . . . . . . . . . . . . . . . . . . .6.3.1 Matrix Multiplication using Sampling . . . . . . . . . . . . . . .6.3.2 Implementing Length Squared Sampling in Two Passes . . . . .6.3.3 Sketch of a Large Matrix . . . . . . . . . . . . . . . . . . . . . .6.4 Sketches of Documents . . . . . . . . . . . . . . . . . . . . . . . . . . .6.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .186. 186. 187. 188. 192. 192. 194. 197. 198. 202. 202. 206. 208. 2097 Clustering7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .7.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . .7.1.2 Two General Assumptions on the Form of Clusters7.1.3 Spectral Clustering . . . . . . . . . . . . . . . . . .7.2 k-Means Clustering . . . . . . . . . . . . . . . . . . . . . .7.2.1 A Maximum-Likelihood Motivation . . . . . . . . .7.2.2 Structural Properties of the k-Means Objective . .7.2.3 Lloyd’s Algorithm . . . . . . . . . . . . . . . . . . .7.2.4 Ward’s Algorithm . . . . . . . . . . . . . . . . . . .7.2.5 k-Means Clustering on the Line . . . . . . . . . . .7.3 k-Center Clustering . . . . . . . . . . . . . . . . . . . . . .7.4 Finding Low-Error Clusterings . . . . . . . . . . . . . . . .7.5 Spectral Clustering . . . . . . . . . . . . . . . . . . . . . .7.5.1 Why Project? . . . . . . . . . . . . . . . . . . . . .7.5.2 The Algorithm . . . . . . . . . . . . . . . . . . . .7.5.3 Means Separated by Ω(1) Standard Deviations . . .7.5.4 Laplacians . . . . . . . . . . . . . . . . . . . . . . .7.5.5 Local spectral clustering . . . . . . . . . . . . . . .7.6 Approximation Stability . . . . . . . . . . . . . . . . . . .7.6.1 The Conceptual Idea . . . . . . . . . . . . . . . . .7.6.2 Making this Formal . . . . . . . . . . . . . . . . . .7.6.3 Algorithm and Analysis . . . . . . . . . . . . . . .7.7 High-Density Clusters . . . . . . . . . . . . . . . . . . . .7.7.1 Single Linkage . . . . . . . . . . . . . . . . . . . . 3224226227229229230230232232

.2332342342352382412442458 Random Graphs8.1 The G(n, p) Model . . . . . . . . . . . . . . . . . . . . . . . . . . .8.1.1 Degree Distribution . . . . . . . . . . . . . . . . . . . . . . .8.1.2 Existence of Triangles in G(n, d/n) . . . . . . . . . . . . . .8.2 Phase Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . .8.3 Giant Component . . . . . . . . . . . . . . . . . . . . . . . . . . . .8.3.1 Existence of a Giant Component . . . . . . . . . . . . . . .8.3.2 No Other Large Components . . . . . . . . . . . . . . . . .8.3.3 The Case of p 1/n . . . . . . . . . . . . . . . . . . . . . .8.4 Cycles and Full Connectivity . . . . . . . . . . . . . . . . . . . . . .8.4.1 Emergence of Cycles . . . . . . . . . . . . . . . . . . . . . .8.4.2 Full Connectivity . . . . . . . . . . . . . . . . . . . . . . . .8.4.3 Threshold for O(ln n) Diameter . . . . . . . . . . . . . . . .8.5 Phase Transitions for Increasing Properties . . . . . . . . . . . . . .8.6 Branching Processes . . . . . . . . . . . . . . . . . . . . . . . . . .8.7 CNF-SAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8.7.1 SAT-solvers in practice . . . . . . . . . . . . . . . . . . . . .8.7.2 Phase Transitions for CNF-SAT . . . . . . . . . . . . . . . .8.8 Non-uniform Models of Random Graphs . . . . . . . . . . . . . . .8.8.1 Giant Component in Graphs with Given Degree Distribution8.9 Growth Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8.9.1 Growth Model Without Preferential Attachment . . . . . . .8.9.2 Growth Model With Preferential Attachment . . . . . . . .8.10 Small World Graphs . . . . . . . . . . . . . . . . . . . . . . . . . .8.11 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . .8.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37.147.7.2 Robust Linkage . . . . . . . . . . . . . .Kernel Methods . . . . . . . . . . . . . . . . . .Recursive Clustering Based on Sparse Cuts . . .Dense Submatrices and Communities . . . . . .Community Finding and Graph Partitioning . .Spectral Clustering Applied to Social NetworksBibliographic Notes . . . . . . . . . . . . . . . .Exercises . . . . . . . . . . . . . . . . . . . . . .9 Topic Models, Non-negative Matrix Factorization, Hidden Markov Models, and Graphical Models3179.1 Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3179.2 An Idealized Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3209.3 Non-negative Matrix Factorization - NMF . . . . . . . . . . . . . . . . . . 3239.4 NMF with Anchor Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3249.5 Hard and Soft Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3265

199.20The Latent Dirichlet Allocation Model for Topic ModelingThe Dominant Admixture Model . . . . . . . . . . . . . .Formal Assumptions . . . . . . . . . . . . . . . . . . . . .Finding the Term-Topic Matrix . . . . . . . . . . . . . . .Hidden Markov Models . . . . . . . . . . . . . . . . . . . .Graphical Models and Belief Propagation . . . . . . . . . .Bayesian or Belief Networks . . . . . . . . . . . . . . . . .Markov Random Fields . . . . . . . . . . . . . . . . . . . .Factor Graphs . . . . . . . . . . . . . . . . . . . . . . . . .Tree Algorithms . . . . . . . . . . . . . . . . . . . . . . . .Message Passing in General Graphs . . . . . . . . . . . . .9.16.1 Graphs with a Single Cycle . . . . . . . . . . . . .9.16.2 Belief Update in Networks with a Single Loop . . .9.16.3 Maximum Weight Matching . . . . . . . . . . . . .Warning Propagation . . . . . . . . . . . . . . . . . . . . .Correlation Between Variables . . . . . . . . . . . . . . . .Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . .Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . .10 Other Topics10.1 Ranking and Social Choice . . . . . . . . . . . . . . . . .10.1.1 Randomization . . . . . . . . . . . . . . . . . . .10.1.2 Examples . . . . . . . . . . . . . . . . . . . . . .10.2 Compressed Sensing and Sparse Vectors . . . . . . . . .10.2.1 Unique Reconstruction of a Sparse Vector . . . .10.2.2 Efficiently Finding the Unique Sparse Solution . .10.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . .10.3.1 Biological . . . . . . . . . . . . . . . . . . . . . .10.3.2 Low Rank Matrices . . . . . . . . . . . . . . . . .10.4 An Uncertainty Principle . . . . . . . . . . . . . . . . . .10.4.1 Sparse Vector in Some Coordinate Basis . . . . .10.4.2 A Representation Cannot be Sparse in Both TimeDomains . . . . . . . . . . . . . . . . . . . . . . .10.5 Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . .10.6 Linear Programming . . . . . . . . . . . . . . . . . . . .10.6.1 The Ellipsoid Algorithm . . . . . . . . . . . . . .10.7 Integer Optimization . . . . . . . . . . . . . . . . . . . .10.8 Semi-Definite Programming . . . . . . . . . . . . . . . .10.9 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . .10.10Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 9363364. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .and Frequency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384385387388

11 Wavelets11.1 Dilation . . . . . . . . . . . . . . . . . . . . . . . . . .11.2 The Haar Wavelet . . . . . . . . . . . . . . . . . . . . .11.3 Wavelet Systems . . . . . . . . . . . . . . . . . . . . .11.4 Solving the Dilation Equation . . . . . . . . . . . . . .11.5 Conditions on the Dilation Equation . . . . . . . . . .11.6 Derivation of the Wavelets from the Scaling Function .11.7 Sufficient Conditions for the Wavelets to be Orthogonal11.8 Expressing a Function in Terms of Wavelets . . . . . .11.9 Designing a Wavelet System . . . . . . . . . . . . . . .11.10Applications . . . . . . . . . . . . . . . . . . . . . . . .11.11 Bibliographic Notes . . . . . . . . . . . . . . . . . . .11.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . .392. 392. 393. 397. 397. 399. 401. 405. 408. 409. 409. 410. 41112 Appendix12.1 Definitions and Notation . . . . . . . . . . . . . . . . . . . . .12.1.1 Integers . . . . . . . . . . . . . . . . . . . . . . . . . .12.1.2 Substructures . . . . . . . . . . . . . . . . . . . . . . .12.1.3 Asymptotic Notation . . . . . . . . . . . . . . . . . . .12.2 Useful Relations . . . . . . . . . . . . . . . . . . . . . . . . . .12.3 Useful Inequalities . . . . . . . . . . . . . . . . . . . . . . . .12.4 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . .12.4.1 Sample Space, Events, and Independence . . . . . . . .12.4.2 Linearity of Expectation . . . . . . . . . . . . . . . . .12.4.3 Union Bound . . . . . . . . . . . . . . . . . . . . . . .12.4.4 Indicator Variables . . . . . . . . . . . . . . . . . . . .12.4.5 Variance . . . . . . . . . . . . . . . . . . . . . . . . . .12.4.6 Variance of the Sum of Independent Random Variables12.4.7 Median . . . . . . . . . . . . . . . . . . . . . . . . . .12.4.8 The Central Limit Theorem . . . . . . . . . . . . . . .12.4.9 Probability Distributions . . . . . . . . . . . . . . . . .12.4.10 Bayes Rule and Estimators . . . . . . . . . . . . . . . .12.5 Bounds on Tail Probability . . . . . . . . . . . . . . . . . . . .12.5.1 Chernoff Bounds . . . . . . . . . . . . . . . . . . . . .12.5.2 More General Tail Bounds . . . . . . . . . . . . . . . .12.6 Applications of the Tail Bound . . . . . . . . . . . . . . . . .12.7 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . .12.7.1 Symmetric Matrices . . . . . . . . . . . . . . . . . . .12.7.2 Relationship between SVD and Eigen Decomposition .12.7.3 Extremal Properties of Eigenvalues . . . . . . . . . . .12.7.4 Eigenvalues of the Sum of Two Symmetric Matrices . .12.7.5 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . .12.7.6 Important Norms and Their Properties . . . . . . . . 1431435437437440443445446448448450452453

12.7.7 Additional Linear Algebra . . . . . . . . . . . . . . . . . . . . . .12.7.8 Distance Between Subspaces . . . . . . . . . . . . . . . . . . . . .12.7.9 Positive Semidefinite Matrix . . . . . . . . . . . . . . . . . . . . .12.8 Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . .12.8.1 Generating Functions for Sequences Defined by Recurrence Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12.8.2 The Exponential Generating Function and the Moment GeneratingFunction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12.9 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12.9.1 Lagrange Multipliers . . . . . . . . . . . . . . . . . . . . . . . . .12.9.2 Finite Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12.9.3 Application of Mean Value Theorem . . . . . . . . . . . . . . . .12.10Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Index.455457458458. 459.4614634634644644664708

1IntroductionComputer science as an academic discipline began in the 1960’s. Emphasis was onprogramming languages, compilers, operating systems, and the mathematical theory thatsupported these areas. Courses in theoretical computer science covered finite automata,regular expressions, context-free languages, and computability. In the 1970’s, the studyof algorithms was added as an important component of theory. The emphasis was onmaking computers useful. Today, a fundamental change is taking place and the focus ismore on a wealth of applications. There are many reasons for this change. The mergingof computing and communications has played an important role. The enhanced abilityto observe, collect, and store data in the natural sciences, in commerce, and in otherfields calls for a change in our understanding of data and how to handle it in the modernsetting. The emergence of the web and social networks as central aspects of daily lifepresents both opportunities and challenges for theory.While traditional areas of computer science remain highly important, increasingly researchers of the future will be involved with using computers to understand and extractusable information from massive data arising in applications, not just how to make computers useful on specific well-defined problems. With this in mind we have written thisbook to cover the theory we expect to be useful in the next 40 years, just as an understanding of automata theory, algorithms, and related topics gave students an advantagein the last 40 years. One of the major changes is an increase in emphasis on probability,statistics, and numerical methods.Early drafts of the book have been used for both undergraduate and graduate courses.Background material needed for an undergraduate course has been put in the appendix.For this reason, the appendix has homework problems.Modern data in diverse fields such as information processing, search, and machinelearning is often advantageously represented as vectors with a large number of components. The vector representation is not just a book-keeping device to store many fieldsof a record. Indeed, the two salient aspects of vectors: geometric (length, dot products,orthogonality, etc.) and linear algebraic (independence, rank, singular values, etc.) turnout to be relevant and useful. Chapters 2 and 3 lay the foundations of geometry andlinear algebra respectively. More specifically, our intuition from two or three dimensionalspace can be surprisingly off the mark when it comes to high dimensions. Chapter 2works out the fundamentals needed to understand the differences. The emphasis of thechapter, as well as the book in general, is to get across the intellectual ideas and theThis material has been published by Cambridge University Press as Foundations of Data Science byAvrim Blum, John Hopcroft, and Ravi Kannan. This pre-publication version is free to view and downloadfor personal use only. Not for re-distribution, re-sale or use in derivative works. Please do not re-postor mirror, instead link to http://ttic.edu/blum/book.pdf. c Avrim Blum, John Hopcroft, and RaviKannan 2020. https://www.cambridge.org/97811084850679

mathematical foundations rather than focus on particular applications, some of which arebriefly described. Chapter 3 focuses on singular value decomposition (SVD) a central toolto deal with matrix data. We give a from-first-principles description of the mathematicsand algorithms for SVD. Applications of singular value decomposition include principalcomponent analysis, a widely used technique which we touch upon, as well as modernapplications to statistical mixtures of probability densities, discrete optimization, etc.,which are described in more detail.Exploring large structures like the web or the space of configurations of a large systemwith deterministic methods can be prohibitively expensive. Random walks (also calledMarkov Chains) turn out often to be more efficient as well as illuminative. The stationary distributions of such walks are important for applications ranging from web search tothe simulation of physical systems. The underlying mathematical theory of such randomwalks, as well as connections to electrical networks, forms the core of Chapter 4 on Markovchains.One of the surprises of computer science over the last two decades is that some domainindependent methods have been immensely successful in tackling problems from diverseareas. Machine learning is a striking example. Chapter 5 describes the foundationsof machine learning, both algorithms for optimizing over given training examples, aswell as the theory for understanding when such optimization can be expected to lead togood performance on new, unseen data. This includes important measures such as theVapnik-Chervonenkis dimension, important algorithms such as the Perceptron Algorithm,stochastic gradient descent, boosting, and deep learning, and important notions such asregularization and overfitting.The field of algorithms has traditionally assumed that the input data to a problem ispresented in random access memory, which the algorithm can repeatedly access. This isnot feasible for problems involving enormous amounts of data. The streaming model andother models have been formulated to reflect this. In this setting, sampling plays a crucialrole and, indeed, we have to sample on the fly. In Chapter 6 we study how to draw goodsamples efficiently and how to estimate statistical and linear algebra quantities, with suchsamples.While Chapter 5 focuses on supervised learning, where one learns from labeled trainingdata, the problem of unsupervised learning, or learning from unlabeled data, is equallyimportant. A central topic in unsupervised learning is clustering, discussed in Chapter7. Clustering refers to the problem of partitioning data into groups of similar objects.After describing some of the basic methods for clustering, such as the k-means algorithm,Chapter 7 focuses on modern developments in understanding these, as well as newer algorithms and general frameworks for analyzing different kinds of clustering problems.Central to our understanding of large structures, like the web and social networks, is10

building models to capture essential properties of these structures. The simplest modelis that of a random graph formulated by Erdös and Renyi, which we study in detail inChapter 8, proving that certain global phenomena, like a giant connected component,arise in such structures with only local choices. We also describe other models of randomgraphs.Chapter 9 focuses on linear-algebraic problems of making sense from data, in particular topic modeling and non-negative matrix factorization. In addition to discussingwell-known models, we also describe some current research on models and algorithms withprovable guarantees on learning error and time. This is followed by graphical models andbelief propagation.Chapter 10 discusses ranking and social choice as well as problems of sparse representations such as compressed sensing. Additionally, Chapter 10 includes a brief discussionof linear programming and semidefinite programming. Wavelets, which are an important method for representing signals across a wide range of applications, are discussed inChapter 11 along with some of their fundamental mathematical properties. The appendixincludes a range of background material.A word about notation in the book. To help the student, we have adopted certainnotations, and with a few exceptions, adhered to them. We use lower case letters forscalar variables and functions, bold face lower case for vectors, and upper case lettersfor matrices. Lower case near the beginning of the alphabet tend to be constants, in themiddle of the alphabet, such as i, j, and k, are indices in summations, n and m for integersizes, and x, y and z for variables. If A is a matrix its elements are aij and its rows are ai .If ai is a vector its coordinates are aij . Where the literature traditionally uses a symbolfor a quantity, we also used that symbol, even if it meant abandoning our convention. Ifwe have a set of points in some vector space, and work with a subspace, we use n for thenumber of points, d for the dimension of the space, and k for the dimension of the subspace.The term “almost surely” means with probability tending to one. We use ln n for thenatural logarithm and log n for the base two logarithm. If we want base ten, we will use 2log10 . To simplify notation and to make it easier to read we use E 2 (1 x) for E(1 x) and E(1 x)2 for E (1 x)2 . When we say “randomly select” some number of pointsfrom a given probability distribution, independence is always assumed unless otherwisestated.11

2High-Dimensional Space2.1IntroductionHigh dimensional data has become very important. However, high dimensional spaceis very different from the two and three dimensional spaces we are familiar with. Generaten points at random in d-dimensions where each coordinate is a zero mean, unit varianceGaussian. For sufficiently large d, with high probability the distances between all pairsof points will be essentially the same. Also the volume of the unit ball in d-dimensions,the set of all points x such that x 1, goes to zero as the dimension goes to infinity.The volume of a high dimensional unit ball is concentrated near its surface and is alsoconcentrated at its equator. These properties have important consequences which we willconsider.2.2The Law of Large NumbersIf one generates random points in d-dimensional space using a Gaussian to generatecoordinates, the distance between all pairs of points will be essentially the same when dis large. The reason is that the square of the distance between two points y and z,dX y z (yi zi )2 ,2i 1can be viewed as the sum of d independent samples of a random variable x that is thesquared difference of two Gaussians. In particular, we are summing independent samplesxi (yi zi )2 of a random variable x of bounded variance. In such a case, a general boundknown as the Law of Large Numbers states that with high probability, the average of thesamples will be close to the expectation of the random variable. This in turn implies thatwith high probability, t

Foundations of Data Science Avrim Blum, John Hopcroft, and Ravindran Kannan Thursday 27th February, 2020 This material has been published by Cambridge University Press as Foundations of Data Science by Avrim Blum, John Hopcroft, and Ravi Kannan. This pre-publication version is free to view and download for personal use only.