Neural Networks And Lecture 4: Backpropagation

Transcription

Lecture 4:Neural Networks andBackpropagationFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 1April 11, 2019

Administrative: Assignment 1Assignment 1 due Wednesday April 17, 11:59pmIf using Google Cloud, you don’t need GPUs for thisassignment!We will distribute Google Cloud coupons by this weekendFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 2April 11, 2019

Administrative: Alternate Midterm TimeIf you need to request an alternate midterm time:See Piazza for form, fill it out by 4/25 (two weeks from today)Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 3April 11, 2019

Administrative: Project ProposalProject proposal due 4/24Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 4April 11, 2019

Administrative: Discussion SectionDiscussion section tomorrow (1:20pm in Gates B03):How to pick a project / How to read a paperFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 5April 11, 2019

Where we are.Linear score functionSVM loss (or softmax)data loss regularizationHow to find the best W?Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 6April 11, 2019

Finding the best W: Optimize with Gradient DescentLandscape image is CC0 1.0 public domainWalking man image is CC0 1.0 public domainFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 7April 11, 2019

Gradient descentNumerical gradient: slow :(, approximate :(, easy to write :)Analytic gradient: fast :), exact :), error-prone :(In practice: Derive analytic gradient, check yourimplementation with numerical gradientFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 8April 11, 2019

Problem: Linear Classifiers are not very powerfulVisual ViewpointLinear classifiers learnone template per classFei-Fei Li & Justin Johnson & Serena YeungGeometric ViewpointLinear classifierscan only draw lineardecision boundariesLecture 4 - 9April 11, 2019

One Solution: Feature Transformationf(x, y) (r(x, y), θ(x, y))Transform data with a cleverlychosen feature transform f,then apply linear classifierColor HistogramFei-Fei Li & Justin Johnson & Serena YeungHistogram of Oriented Gradients (HoG)Lecture 4 - 10April 11, 2019

Image features vs ConvNetsfFeature Extraction10 numbers givingscores for classestrainingKrizhevsky, Sutskever, and Hinton, “Imagenet classificationwith deep convolutional neural networks”, NIPS 2012.Figure copyright Krizhevsky, Sutskever, and Hinton, 2012.Reproduced with permission.10 numbers givingscores for classestrainingFei-Fei Li & Justin Johnson & Serena YeungLecture 3 - 11April 9, 2019

Today: Neural NetworksFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 12April 11, 2019

Neural networks: without the brain stuff(Before) Linear score function:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 13April 11, 2019

Neural networks: without the brain stuff(Before) Linear score function:(Now) 2-layer Neural Network(In practice we will usually add a learnable bias at each layer as well)Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 14April 11, 2019

Neural networks: without the brain stuff(Before) Linear score function:(Now) 2-layer Neural Network“Neural Network” is a very broad term; these are more accurately called“fully-connected networks” or sometimes “multi-layer perceptrons” (MLP)(In practice we will usually add a learnable bias at each layer as well)Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 15April 11, 2019

Neural networks: without the brain stuff(Before) Linear score function:(Now) 2-layer Neural Networkor 3-layer Neural Network(In practice we will usually add a learnable bias at each layer as well)Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 16April 11, 2019

Neural networks: without the brain stuff(Before) Linear score function:(Now) 2-layer Neural Networkx3072W1h100Fei-Fei Li & Justin Johnson & Serena YeungW2s10Lecture 4 - 17April 11, 2019

Neural networks: without the brain stuff(Before) Linear score function:(Now) 2-layer Neural Networkx3072W1h100Fei-Fei Li & Justin Johnson & Serena YeungW2s10Lecture 4 - 18April 11, 2019

Neural networks: without the brain stuff(Before) Linear score function:(Now) 2-layer Neural NetworkThe functionis called the activation function.Q: What if we try to build a neural network without one?Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 19April 11, 2019

Neural networks: without the brain stuff(Before) Linear score function:(Now) 2-layer Neural NetworkThe functionis called the activation function.Q: What if we try to build a neural network without one?A: We end up with a linear classifier again!Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 20April 11, 2019

Activation functionsSigmoidLeaky ReLUtanhMaxoutReLUELUFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 21April 11, 2019

ReLU is a good defaultchoice for most problemsActivation functionsSigmoidLeaky ReLUtanhMaxoutReLUELUFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 22April 11, 2019

Neural networks: Architectures“3-layer Neural Net”, or“2-hidden-layer Neural Net”“2-layer Neural Net”, or“1-hidden-layer Neural Net” “Fully-connected” layersFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 23April 11, 2019

Example feed-forward computation of a neural networkFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 24April 11, 2019

Full implementation of training a 2-layer Neural Network needs 20 lines:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 25April 11, 2019

Setting the number of layers and their sizesmore neurons more capacityFei-Fei Li & Andrej Karpathy & Justin JohnsonLecture 4 - 2613 Jan 2016

Do not use size of neural network as a regularizer. Use stronger regularization instead:(Web demo with onvnetjs/demo/classify2d.html)Fei-Fei Li & Andrej Karpathy & Justin JohnsonLecture 4 - 2713 Jan 2016

This image by Fotis Bobolas islicensed under CC-BY 2.0Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 28April 11, 2019

Impulses carried toward cell bodydendritepresynapticterminalaxoncell bodyImpulses carried awayfrom cell bodyThis image by Felipe Peruchois licensed under CC-BY 3.0Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 29April 11, 2019

Impulses carried toward cell bodydendritepresynapticterminalaxoncell bodyImpulses carried awayfrom cell bodyThis image by Felipe Peruchois licensed under CC-BY 3.0Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 30April 11, 2019

Impulses carried toward cell bodydendritepresynapticterminalaxoncell bodyImpulses carried awayfrom cell bodyThis image by Felipe Peruchois licensed under CC-BY 3.0sigmoid activation functionFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 31April 11, 2019

Impulses carried toward cell bodydendritepresynapticterminalaxoncell bodyImpulses carried awayfrom cell bodyThis image by Felipe Peruchois licensed under CC-BY 3.0Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 32April 11, 2019

Biological Neurons:Complex connectivity patternsNeurons in a neural network:Organized into regular layers forcomputational efficiencyThis image is CC0 Public DomainFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 33April 11, 2019

Biological Neurons:Complex connectivity patternsThis image is CC0 Public DomainFei-Fei Li & Justin Johnson & Serena YeungBut neural networks with randomconnections can work too!Xie et al, “Exploring Randomly Wired Neural Networks for Image Recognition”, arXiv 2019Lecture 4 - 34April 11, 2019

Be very careful with your brain analogies!Biological Neurons: Many different types Dendrites can perform complex non-linear computations Synapses are not a single weight but a complex non-linear dynamicalsystem Rate code may not be adequate[Dendritic Computation. London and Hausser]Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 35April 11, 2019

Problem: How to compute gradients?Nonlinear score functionSVM Loss on predictionsRegularizationTotal loss: data loss regularizationIf we can computeFei-Fei Li & Justin Johnson & Serena Yeungthen we can learn W1 and W2Lecture 4 - 36April 11, 2019

(Bad) Idea: Deriveon paperProblem: Very tedious: Lots ofmatrix calculus, need lots of paperProblem: What if we want tochange loss? E.g. use softmaxinstead of SVM? Need tore-derive from scratch (Problem: Not feasible for verycomplex models!Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 37April 11, 2019

Better Idea: Computational graphs Backpropagationx*s (scores)hingeloss LWRFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 38April 11, 2019

Convolutional network(AlexNet)input imageweightslossFigure copyright Alex Krizhevsky, Ilya Sutskever, andGeoffrey Hinton, 2012. Reproduced with permission.Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 39April 11, 2019

Neural Turing Machineinput imagelossFigure reproduced with permission from a Twitter post by Andrej Karpathy.Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 40April 11, 2019

Neural Turing MachineFigure reproduced with permission from a Twitter post by Andrej Karpathy.Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 -April 11, 2019

Backpropagation: a simple exampleFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 42April 11, 2019

Backpropagation: a simple exampleFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 43April 11, 2019

Backpropagation: a simple examplee.g. x -2, y 5, z -4Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 44April 13, 2017

Backpropagation: a simple examplee.g. x -2, y 5, z -4Want:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 45April 13, 2017

Backpropagation: a simple examplee.g. x -2, y 5, z -4Want:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 46April 13, 2017

Backpropagation: a simple examplee.g. x -2, y 5, z -4Want:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 47April 13, 2017

Backpropagation: a simple examplee.g. x -2, y 5, z -4Want:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 48April 13, 2017

Backpropagation: a simple examplee.g. x -2, y 5, z -4Want:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 49April 13, 2017

Backpropagation: a simple examplee.g. x -2, y 5, z -4Want:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 50April 13, 2017

Backpropagation: a simple examplee.g. x -2, y 5, z -4Want:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 51April 13, 2017

Backpropagation: a simple examplee.g. x -2, y 5, z -4Chain rule:Want:Fei-Fei Li & Justin Johnson & Serena YeungUpstreamgradientLecture 4 - 52LocalgradientApril 13, 2017

Backpropagation: a simple examplee.g. x -2, y 5, z -4Chain rule:Want:Fei-Fei Li & Justin Johnson & Serena YeungUpstreamgradientLecture 4 - 53LocalgradientApril 13, 2017

Backpropagation: a simple examplee.g. x -2, y 5, z -4Chain rule:Want:Fei-Fei Li & Justin Johnson & Serena YeungUpstreamgradientLecture 4 - 54LocalgradientApril 13, 2017

Backpropagation: a simple examplee.g. x -2, y 5, z -4Chain rule:Want:Fei-Fei Li & Justin Johnson & Serena YeungUpstreamgradientLecture 4 - 55LocalgradientApril 13, 2017

fFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 56April 11, 2019

“local gradient”fFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 57April 11, 2019

“local gradient”f“Upstreamgradient”Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 58April 11, 2019

“local adient”Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 59April 11, 2019

“local adient”Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 60April 11, 2019

“local adient”Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 61April 11, 2019

Another example:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 62April 11, 2019

Another example:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 63April 11, 2019

Another example:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 64April 11, 2019

Another example:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 65April 11, 2019

Another example:UpstreamgradientFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 66LocalgradientApril 11, 2019

Another example:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 67April 11, 2019

Another example:UpstreamgradientFei-Fei Li & Justin Johnson & Serena YeungLocalgradientLecture 4 - 68April 11, 2019

Another example:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 69April 11, 2019

Another example:UpstreamgradientFei-Fei Li & Justin Johnson & Serena YeungLocalgradientLecture 4 - 70April 11, 2019

Another example:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 71April 11, 2019

Another example:UpstreamgradientFei-Fei Li & Justin Johnson & Serena YeungLocalgradientLecture 4 - 72April 11, 2019

Another example:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 73April 11, 2019

Another example:[upstream gradient] x [local gradient][0.2] x [1] 0.2[0.2] x [1] 0.2 (both inputs!)Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 74April 11, 2019

Another example:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 75April 11, 2019

Another example:[upstream gradient] x [local gradient]x0: [0.2] x [2] 0.4w0: [0.2] x [-1] -0.2Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 76April 11, 2019

Computational graphrepresentation may notbe unique. Choose onewhere local gradients ateach node can be easilyexpressed!Another example:SigmoidfunctionSigmoidFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 77April 11, 2019

Computational graphrepresentation may notbe unique. Choose onewhere local gradients ateach node can be easilyexpressed!Another example:SigmoidfunctionSigmoidSigmoid localgradient:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 78April 11, 2019

Computational graphrepresentation may notbe unique. Choose onewhere local gradients ateach node can be easilyexpressed!Another example:SigmoidfunctionSigmoid[upstream gradient] x [local gradient][1.00] x [(1 - 0.73) (0.73)] 0.2Sigmoid localgradient:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 79April 11, 2019

Patterns in gradient flowadd gate: gradient distributor3242 72Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 80April 11, 2019

Patterns in gradient flowadd gate: gradient distributor3242 72Fei-Fei Li & Justin Johnson & Serena Yeungmul gate: “swap multiplier”25*3 1532*5 10Lecture 4 - 81 65April 11, 2019

Patterns in gradient flowadd gate: gradient distributor3242 72mul gate: “swap multiplier”25*3 1532*5 10 65copy gate: gradient adder74 2 67472Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 82April 11, 2019

Patterns in gradient flowadd gate: gradient distributor3242 72copy gate: gradient adder74 2 67472Fei-Fei Li & Justin Johnson & Serena Yeungmul gate: “swap multiplier”25*3 1532*5 10 65max gate: gradient router405max59Lecture 4 - 839April 11, 2019

Backprop Implementation:“Flat” codeForward pass:Compute outputBackward pass:Compute gradsFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 84April 11, 2019

Backprop Implementation:“Flat” codeForward pass:Compute outputBase caseFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 85April 11, 2019

Backprop Implementation:“Flat” codeForward pass:Compute outputSigmoidFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 86April 11, 2019

Backprop Implementation:“Flat” codeForward pass:Compute outputAdd gateFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 87April 11, 2019

Backprop Implementation:“Flat” codeForward pass:Compute outputAdd gateFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 88April 11, 2019

Backprop Implementation:“Flat” codeForward pass:Compute outputMultiply gateFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 89April 11, 2019

Backprop Implementation:“Flat” codeForward pass:Compute outputMultiply gateFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 90April 11, 2019

“Flat” Backprop: Do this for assignment 1!Stage your forward/backward computation!E.g. for the SVM:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 91marginsApril 11, 2019

“Flat” Backprop: Do this for assignment 1!E.g. for two-layer neural net:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 92April 11, 2019

Backprop Implementation: Modularized APIGraph (or Net) object (rough pseudo code)Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 93April 11, 2019

Modularized implementation: forward / backward APIGate / Node / Function object: Actual PyTorch codex*zNeed to stashsome values foruse in backwardyUpstreamgradient(x,y,z are scalars)Multiply upstreamand local gradientsFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 94April 11, 2019

Example: PyTorch operatorsFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 95April 11, 2019

PyTorch sigmoid layerForwardSourceFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 96April 11, 2019

PyTorch sigmoid layerForwardForward actuallydefined elsewhere.SourceFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 97April 11, 2019

PyTorch sigmoid layerForwardForward actuallydefined elsewhere.BackwardSourceFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 98April 11, 2019

So far: backprop with scalarsWhat about vector-valued functions?Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 99April 11, 2019

Recap: Vector derivativesScalar to ScalarRegular derivative:If x changes by asmall amount, howmuch will y change?Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 100April 11, 2019

Recap: Vector derivativesScalar to ScalarVector to ScalarRegular derivative:Derivative is Gradient:If x changes by asmall amount, howmuch will y change?For each element of x,if it changes by a smallamount then how muchwill y change?Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 101April 11, 2019

Recap: Vector derivativesScalar to ScalarVector to ScalarVector to VectorRegular derivative:Derivative is Gradient:Derivative is Jacobian:If x changes by asmall amount, howmuch will y change?For each element of x,if it changes by a smallamount then how muchwill y change?For each element of x, if itchanges by a small amountthen how much will eachelement of y change?Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 102April 11, 2019

Backprop with s”Loss L still a scalar!DzfDy“Upstream gradient”Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 103April 11, 2019

Backprop with VectorsDxLoss L still a ��DzfDzDy“Upstream gradient”For each element of z, howmuch does it influence L?Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 104April 11, 2019

Backprop with VectorsDxLoss L still a scalar!“localgradients”[Dx x Dz]“Downstreamgradients”DyDzf[Dy x Dz]JacobianmatricesFei-Fei Li & Justin Johnson & Serena YeungDz“Upstream gradient”For each element of z, howmuch does it influence L?Lecture 4 - 105April 11, 2019

Backprop with VectorsDxLoss L still a ectorgradients”multiplyDy[Dx x Dz]Dzf[Dy x Dz]JacobianmatricesDyFei-Fei Li & Justin Johnson & Serena YeungDz“Upstream gradient”For each element of z, howmuch does it influence L?Lecture 4 - 106April 11, 2019

Backprop with Vectors4D input x:[ 1 ][ -2 ][ 3 ][ -1 ]f(x) max(0,x)(elementwise)Fei-Fei Li & Justin Johnson & Serena Yeung4D output y:[ 1 ][ 0 ][ 3 ][ 0 ]Lecture 4 - 107April 11, 2019

Backprop with Vectors4D input x:[ 1 ][ -2 ][ 3 ][ -1 ]f(x) max(0,x)(elementwise)4D output y:[ 1 ][ 0 ][ 3 ][ 0 ]4D dL/dy:[ 4 ][ -1 ][ 5 ][ 9 ]Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 108UpstreamgradientApril 11, 2019

Backprop with Vectors4D input x:[ 1 ][ -2 ][ 3 ][ -1 ]f(x) max(0,x)(elementwise)Jacobian dy/dx[1000][0000][0010][0000]Fei-Fei Li & Justin Johnson & Serena Yeung4D output y:[ 1 ][ 0 ][ 3 ][ 0 ]4D dL/dy:[ 4 ][ -1 ][ 5 ][ 9 ]Lecture 4 - 109UpstreamgradientApril 11, 2019

Backprop with Vectors4D input x:[ 1 ][ -2 ][ 3 ][ -1 ]f(x) max(0,x)(elementwise)[dy/dx] [dL/dy][1000][4 ][ 0 0 0 0 ] [ -1 ][0010][5 ][0000][9 ]Fei-Fei Li & Justin Johnson & Serena Yeung4D output y:[ 1 ][ 0 ][ 3 ][ 0 ]4D dL/dy:[ 4 ][ -1 ][ 5 ][ 9 ]Lecture 4 - 110UpstreamgradientApril 11, 2019

Backprop with Vectors4D input x:[ 1 ][ -2 ][ 3 ][ -1 ]4D dL/dx:[4][0][5][0]f(x) max(0,x)(elementwise)[dy/dx] [dL/dy][1000][4 ][ 0 0 0 0 ] [ -1 ][0010][5 ][0000][9 ]Fei-Fei Li & Justin Johnson & Serena Yeung4D output y:[ 1 ][ 0 ][ 3 ][ 0 ]4D dL/dy:[ 4 ][ -1 ][ 5 ][ 9 ]Lecture 4 - 111UpstreamgradientApril 11, 2019

Backprop with VectorsJacobian is sparse:off-diagonal entriesalways zero! Neverexplicitly formJacobian -- insteaduse implicitmultiplication4D input x:[ 1 ][ -2 ][ 3 ][ -1 ]4D dL/dx:[4][0][5][0]f(x) max(0,x)(elementwise)[dy/dx] [dL/dy][1000][4 ][ 0 0 0 0 ] [ -1 ][0010][5 ][0000][9 ]Fei-Fei Li & Justin Johnson & Serena Yeung4D output y:[ 1 ][ 0 ][ 3 ][ 0 ]4D dL/dy:[ 4 ][ -1 ][ 5 ][ 9 ]Lecture 4 - 112UpstreamgradientApril 11, 2019

Backprop with VectorsJacobian is sparse:off-diagonal entriesalways zero! Neverexplicitly formJacobian -- insteaduse implicitmultiplication4D input x:[ 1 ][ -2 ][ 3 ][ -1 ]4D dL/dx:[4][0][5][0]f(x) max(0,x)(elementwise)[dy/dx] [dL/dy]Fei-Fei Li & Justin Johnson & Serena Yeung4D output y:[ 1 ][ 0 ][ 3 ][ 0 ]4D dL/dy:[ 4 ][ -1 ][ 5 ][ 9 ]Lecture 4 - 113UpstreamgradientApril 11, 2019

Backprop with Matrices (or Tensors)[Dx Mx]Loss L still a scalar!“localgradients”dL/dx always has thesame shape as x![Dx Mx][Dz Dy My][Dy My]JacobianmatricesFor each element of y, how muchdoes it influence each element of z?Fei-Fei Li & Justin Johnson & Serena Yeung[Dz Mz]“Upstream gradient”For each element of z, howmuch does it influence L?Lecture 4 - 114April 11, 2019

Backprop with Matrices (or Tensors)[Dx Mx]“localgradients”[Dx Mx][Dy My]dL/dx always has thesame shape as x![(Dx Mx) (Dz [Dy My]Loss L still a scalar![Dz Mz][(Dy My) (Dz Mz)]JacobianmatricesFor each element of y, how muchdoes it influence each element of z?Fei-Fei Li & Justin Johnson & Serena Yeung[Dz Mz]“Upstream gradient”For each element of z, howmuch does it influence L?Lecture 4 - 115April 11, 2019

Backprop with Matricesx: [N D][ 2 1 -3 ][ -3 4 2 ]w: [D M][ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]Matrix Multiplyy: [N M][13 9 -2 -6 ][ 5 2 17 1 ]dL/dy: [N M][ 2 3 -3 9 ][ -8 1 4 6 ]Also see derivation in the course ackprop.pdfFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 116April 11, 2019

Backprop with Matricesx: [N D][ 2 1 -3 ][ -3 4 2 ]w: [D M][ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]Matrix MultiplyJacobians:dy/dx: [(N D) (N M)]dy/dw: [(D M) (N M)]y: [N M][13 9 -2 -6 ][ 5 2 17 1 ]dL/dy: [N M][ 2 3 -3 9 ][ -8 1 4 6 ]For a neural net we may haveN 64, D M 4096Each Jacobian takes 256 GB of memory!Must work with them implicitly!Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 117April 11, 2019

Backprop with Matricesx: [N D][ 2 1 -3 ][ -3 4 2 ]w: [D M][ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]Matrix MultiplydL/dy: [N M][ 2 3 -3 9 ][ -8 1 4 6 ]Q: What parts of yare affected by oneelement of x?Fei-Fei Li & Justin Johnson & Serena Yeungy: [N M][13 9 -2 -6 ][ 5 2 17 1 ]Lecture 4 - 118April 11, 2019

Backprop with Matricesx: [N D][ 2 1 -3 ][ -3 4 2 ]w: [D M][ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]Matrix MultiplydL/dy: [N M][ 2 3 -3 9 ][ -8 1 4 6 ]Q: What parts of yare affected by oneelement of x?A:affects thewhole rowFei-Fei Li & Justin Johnson & Serena Yeungy: [N M][13 9 -2 -6 ][ 5 2 17 1 ]Lecture 4 - 119April 11, 2019

Backprop with Matricesx: [N D][ 2 1 -3 ][ -3 4 2 ]w: [D M][ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]Matrix MultiplyQ: What parts of yare affected by oneelement of x?A:affects thewhole rowFei-Fei Li & Justin Johnson & Serena YeungQ: How muchdoesaffect?Lecture 4 - 120y: [N M][13 9 -2 -6 ][ 5 2 17 1 ]dL/dy: [N M][ 2 3 -3 9 ][ -8 1 4 6 ]April 11, 2019

Backprop with Matricesx: [N D][ 2 1 -3 ][ -3 4 2 ]w: [D M][ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]Matrix MultiplyQ: What parts of yare affected by oneelement of x?A:affects thewhole rowFei-Fei Li & Justin Johnson & Serena YeungQ: How muchdoesaffect?A:Lecture 4 - 121y: [N M][13 9 -2 -6 ][ 5 2 17 1 ]dL/dy: [N M][ 2 3 -3 9 ][ -8 1 4 6 ]April 11, 2019

Backprop with Matricesx: [N D][ 2 1 -3 ][ -3 4 2 ]w: [D M][ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2][N D] [N M] [M D]Matrix MultiplyQ: What parts of yare affected by oneelement of x?A:affects thewhole rowFei-Fei Li & Justin Johnson & Serena YeungQ: How muchdoesaffect?A:Lecture 4 - 122y: [N M][13 9 -2 -6 ][ 5 2 17 1 ]dL/dy: [N M][ 2 3 -3 9 ][ -8 1 4 6 ]April 11, 2019

y: [N M][13 9 -2 -6 ][ 5 2 17 1 ]Backprop with Matricesx: [N D][ 2 1 -3 ][ -3 4 2 ]w: [D M][ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2][N D] [N M] [M D]Matrix MultiplydL/dy: [N M][ 2 3 -3 9 ][ -8 1 4 6 ]By similar logic:[D M] [D N] [N M]Fei-Fei Li & Justin Johnson & Serena YeungThese formulas areeasy to remember: theyare the only way tomake shapes match up!Lecture 4 - 123April 11, 2019

Summary for today: (Fully-connected) Neural Networks are stacks of linear functions andnonlinear activation functions; they have much more representationalpower than linear classifiersbackpropagation recursive application of the chain rule along acomputational graph to compute the gradients of allinputs/parameters/intermediatesimplementations maintain a graph structure, where the nodes implementthe forward() / backward() APIforward: compute result of an operation and save any intermediatesneeded for gradient computation in memorybackward: apply the chain rule to compute the gradient of the lossfunction with respect to the inputsFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 124April 11, 2019

Next Time: Convolutional Networks!Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 125April 11, 2019

A vectorized example:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 126April 11, 2019

A vectorized example:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 127April 11, 2019

A vectorized example:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 128April 11, 2019

A vectorized example:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 129April 11, 2019

A vectorized example:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 130April 11, 2019

A vectorized example:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 131April 11, 2019

A vectorized example:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 132April 11, 2019

A vectorized example:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 133April 11, 2019

A vectorized example:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 134April 11, 2019

A vectorized example:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 135April 11, 2019

A vectorized example:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 136April 11, 2019

A vectorized example:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 137April 11, 2019

A vectorized example:Always check: Thegradient withrespect to a variableshould have thesame shape as thevariableFei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 138April 11, 2019

A vectorized example:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 139April 11, 2019

A vectorized example:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 140April 11, 2019

A vectorized example:Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 141April 11, 2019

In discussion section: A matrix example.?Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 -142April 13, 2017

Neural networks: without the brain stuff A: We end up with a linear classifier again! Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 11, 2019 21 Sigmoid tanh ReLU Leaky ReLU Maxout ELU Activation functions. Fei-Fei Li & Justin Johnson &