165B Machine Learning Linear Models - UC Santa Barbara

Transcription

165BMachine LearningLinear ModelsLei Li (leili@cs)UCSBAcknowledgement: Slides borrowed from Bhiksha Raj’s 11485 andMu Li & Alex Smola’s 157 courses on Deep Learning, withmodification1

Recap Neural networks began as computational models of thebrain Neural network models are connectionist machines– The comprise networks of neural units Neural Network can model Boolean functions– McCullough and Pitt model: Neurons as Boolean threshold units– Hebb’s learning rule: Neurons that fire together wire together– Rosenblatt’s perceptron : A variant of the McCulloch and Pittneuron with a provably convergent learning rule‣ But individual perceptrons are limited in their capacity (Minsky andPapert)– Multi-layer perceptrons can model arbitrarily complex Booleanfunctions2

A model for boolean function(( ) ( ))(() ( ))211101121111-111X2111Y1-1Z11 -11211A𝑍𝑋𝑌𝑋𝑌𝐴𝑍𝑋𝐴3

Neural Network A network is a function– Given an input, it computes the function layerwise to predict an output‣ More generally, given one or more inputs, predictsone or more outputs Given a labeled dataset {(xn, yn)}, how totrain a model that maps from x — y Idea: develop a complex model usingmassive basic simple units4

What is Deep Learning Deep learning is a particular kind ofmachine learning that achieves great power and flexibility byrepresenting the world as a nestedhierarchy of concepts, with each concept defined in relation tosimpler concepts, and more abstractrepresentations computed in terms of lessabstract ones.Ian Goodfellow and Yoshua Bengio and Aaron Courville.Deep Learning, 20165

What is Machine Learning? A computer program is said to learn fromexperience E with respect to some class oftasks T and performance measure P, if itsperformance at tasks in T, as measured byP, improves with experience E”– [Tom Mitchell, Machine Learning, 1997]6

How to build a Machine Learningsystem Task T:– What is input and output? Experience E:– What is training data? How to get them easily? Performance Measure P– How to measure success Model:– What is the computational architecture? Training:– How to improve with experience?– What is the loss?7

Task T To find a function f: x - y– Classification: label y is categorical– Regression: label y is continuous numerical Example:– Image classificationh h 3is h x h pixels (rgb), so it is a tensor of‣ Input space: x in Rh x h x 3.‣ Output space: y is {1.10} in Cifar-10, or {1.1000} in ImageNet.– Text-to-Image generationL‣ Input: x is a sentence in V , V is vocabulary, L is lengthh h 3‣ Output: y is R8

Neural Networks that map input tooutput我很 geRecognitionI am veryhappyHi, Siri, pleaseturn on the lightCat9

Experience E Supervised Learning: if pairs of (x, y) are given Unsupervised Learning: if only x are given, butnot y Semi-supervised Learning: both paired dataand raw data Self-supervised Learning:– use raw data but construct supervision signals fromthe data itself– e.g. to predict neighboring pixel values for an image– e.g. to predict neighboring words for a sentence10

How Experience is Collected? Offline/batch Learning:– All data are available at training time– At inference time: fix the model and predict Online Learning:– Experience data is collected one (or one mini-batch) at a time (can be eitherlabeled or unlabeled)– Incrementally train and update the model, and make predictions on the flywith current and changing model– e.g. predicting ads click on search engine Reinforcement Learning:– A system (agent) is interacting with an environment (or other agents) bymaking an action– Experience data (reward) is collected from environment.– The system learns to maximize the total accumulative rewards.– e.g. Train a system to play chess11

Learning w/ various Number of Tasks Multi-task learning– one system/model to learn multiple tasks simultaneously, withshared or separate Experience, with different performancemeasures– e.g. training a model that can detect human face and cat faceat the same time Pre-training & Fine-tuning– Pre-training stage: A system is trained with one task, usuallywith very large easily available data– Fine-tuning stage: it is trained on another task of interest, withdifferent (often smaller) data– e.g. training an image classification model on ImageNet, thenfinetune on object detection dataset.12

Machine Translation as a MachineLearning Task Input (Source)– discrete sequence in source language, Vs Output (Target)– discrete sequence in target langauge, Vt Experience E– Supervised: parallel corpus, e.g. English-Chinese parallel pairs– Unsupervised: monolingual corpus, e.g. to learn MT with only Tamil text andEnglish text, but no Eng-Tamil pairs– Semi-supervised: both Number of languages involved– Bilingual versus Multilingual MT– Notice: it can be multilingual parallel data, or multilingual monolingual data Measure P– Human evaluation metric, or Automatic Metric (e.g. BLEU), see previouslecture13

Story so far Machine learning is the study of machinesthat can improve their performance withmore experience14

Linear Models

House Buying Pick a house, take atour, and read facts Estimate its price, bidListingPredicte16

House Price PredictionVery important, that’s real money 100K What A paidWhat B paidPriceYeHistorical priceestimation over yearsRedfin overestimatedthe price, and Bbelieved it17

A Simplified Model Assumption 1The key factors impacting the prices are#Beds, #Baths, Living Sqft, denoted byx1, x2, x3 Assumption 2The sale price is a weighted sum over thekey factors y w1x1 w2 x2 w3x3 bWeights and bias are determined later18

Linear Model (Linear Regression) Given n-dimensional inputsTx [x1, x2, , xn] Linear model has a n-dimensional weight and abias w [w , w , , w ]T , b12ny w1x1 w2 x2 wn xn b The output is a weighted sum of the inputsy ⟨w, x⟩ bVectorized version19

Linear Model as a Single-layer NeuralNetworkoOutputInputx1x2x3 20

Measure Estimation Quality Compare the true value vs the estimatedvalueReal sale price vs estimated house price Let y the true value, and ŷ theestimated value, we can compare the loss2ℓ(y, y)̂ (y y)̂It is called squared loss21

Training Data Collect multiple data points to fitparametersHouses sold in the last 6 months It is called the training data The more the better Assume n examplesT DX [x0, x1, , xn]y [y0, y1, , yn] {⟨xn, yn⟩}T22

Training Objective Training loss1 n12ℓ(X, y, w, b) (yi ⟨xi, w⟩ b) n i 1n2y Xw b Minimize loss to learn parametersw*, b* arg min ℓ(X, y, w, b)w,b23

Norm A “distance” metric l1 norm– x 1 x1 x2 l2 norm– x x12 x22 lp norm– x p p(x1 px2 )1p24

Closed-form Solution Add bias into weights byX [X, 1]1ℓ(X, y, w) ny Xwww [ ]b 2Tℓ(X, y, w) (y Xw) X wn2 Loss is convex, so the optimal solutions satisfiesℓ(X, y, w) 0 w2T (y Xw) X 0n w* (X X)T 1XT y25

Matrix Calculus

Gradients Generalize derivatives into vectorsVectorScalarxxScalar y y x y xVector y y x y x27

Gradients of vector functionsx x1x2 xnx y y y y y [ x , x , , x ]12n xxy y x y xy y x y xy x12 2x22 2x1 2x22 [2x1,4x2] xDirection (2, 4), perpendicularto the contour lines(x1, x2) (1,1)28

Examplesy y xaT0ausum(x) ua xT1yu vuv y x u v x x u vv u x x x 2a is not a function of x0 and 1 are vectorsT2x⟨u, v⟩ vT uu v x xT29

Gradients of vector functions y/ xy y1y2 ymx y1 y x x y2 x xy y x y xy y x y x ym x y/ x is a row vector, while y/ xis a column vectorIt is called numerator-layout notation. The reversedversion is called denominator-layout notation30

nx ℝ, y/ xx y x y ℝm n xmy ℝ ,x1x2 xny y1 y1 y1 x y2 x ym x, y xy y x y x y2, , x ym ym x2 y xn,,y, , x y2 y2 x1xx y1 x1 x2 x1 x2y1y2 ym, ,n ym xn31

Examplesy y xxa0Iyau y x ua xAxAxT Anx ℝ,my ℝ , y ℝm n xa, a and A are not functions of xAT0 and I are matricesAuu v uA x u v x x32

Generalize to MatricesScalarVectorx (n,1)x (1,)Scalary (1,) y(1,) xVectory (m,1) y(m,1) xMatrix Y(m, l) Y(m, l) x y x(1,n) y x(m, n) Y x(m, l, n)MatrixX (n, k) y(k, n) X y X Y X(m, k, n)(m, l, k, n)33

Generalize to Vectorsy f(u), u g(x) y y u x u x(1,n)(1,) (1,n) y y u x u x y y u x u x(1,n)(1,k) (k, n) y y u x u x(m, n) (m, k) (k, n)34

Example 1Assumex, w ℝn, y ℝ y y u x u xz (⟨x, w⟩ y)2 zCompute wDecomposea ⟨x, w⟩b a yz b2 z z b a w b a w b 2 a y ⟨x, w⟩ b a w 2b 1 xT 2 (⟨x, w⟩ y) xT35

Solving Linear Modelŵ arg min Xw y 2AssumeX ℝm n,z Xw y 2w ℝn, zCompute 0 wa XwDecompose b a yz b 2 y y u x u xy ℝm z z b a w b a w b 2 a y Xw b a w 2bT I X 2 (Xw y) XTLet2 (Xw y) X 0Tw* (X X)T 1XT y36

Optimality Condition How to find arg min f(θ)θ f 0 Optimal θ* satisfies f θ37

More about matrix calculus Matrix cookbook http://www2.imm.dtu.dk/pubdb/edoc/imm3274.pdf38

Quiz slides/15605239

Linear model in PyTorchimport torchfrom torch.autograd import Variableclass linearRegression(torch.nn.Module):def init (self, inputSize, outputSize):super(linearRegression, self). init ()self.linear torch.nn.Linear(inputSize,outputSize)def forward(self, x):out self.linear(x)return out40

Recap Machine learning is the study of machinesthat can improve their performance withmore experience Linear Regression Model– Output is linearly dependent on the inputvariables– Minimize squared loss41

Next Up Classification: Logistic Regression Multilayer Perceptron More on neural networks as universalapproximators– And the issue of depth in networks– How to train neural network from data42

Semi-supervised Learning: both paired data and raw data Self-supervised Learning: - use raw data but construct supervision signals from the data itself - e.g. to predict neighboring pixel values for an image - e.g. to predict neighboring words for a sentence 10 Experience E