CS224N/Ling284 - Stanford University

Transcription

NaturalLanguageProcessingProcessingNatural LanguagewithDeepLearningLearningwith DeepCS224N/Ling284CS224N/Ling284Christopher ManningChristopher Manning and Richard SocherLecture 4: Gradients by hand (matrix calculus) andLecture2: Word Vectors algorithm)algorithmically(the backpropagation

1. IntroductionAssignment 2 is all about making sure you really understand themath of neural networks then we’ll let the software do it!We’ll go through it quickly today, but also look at the readings!This will be a tough week for some! àMake sure to get help if you need itVisit office hours Friday/TuesdayNote: Monday is MLK Day – No office hours, sorry!But we will be on PiazzaRead tutorial materials given in the syllabus2

NER: Binary classification for center word being location We do supervised training and want high score if it’s a location1𝐽" 𝜃 𝜎 𝑠 1 𝑒 * x [ xmuseums3xinxParisxarexamazing ]

Remember: Stochastic Gradient DescentUpdate equation:𝛼 step size or learning rateHow can we compute - 𝐽(𝜃)?1. By hand2. Algorithmically: the backpropagation algorithm4

Lecture PlanLecture 4: Gradients by hand and algorithmically1. Introduction (5 mins)2. Matrix calculus (40 mins)3. Backpropagation (35 mins)5

Computing Gradients by Hand Matrix calculus: Fully vectorized gradients6 “multivariable calculus is just like single-variable calculus ifyou use matrices” Much faster and more useful than non-vectorized gradients But doing a non-vectorized gradient can be good forintuition; watch last week’s lecture for an example Lecture notes and matrix calculus notes cover thismaterial in more detail You might also review Math 51, which has a new 1/textbook.html

Gradients Given a function with 1 output and 1 input𝑓 𝑥 𝑥3 It’s gradient (slope) is its derivative4546 3𝑥 8“How much will the output change if we change the input a bit?”7

Gradients Given a function with 1 output and n inputs Its gradient is a vector of partial derivatives withrespect to each input8

Jacobian Matrix: Generalization of the Gradient Given a function with m outputs and n inputs It’s Jacobian is an m x n matrix of partial derivatives9

Chain Rule For one-variable functions: multiply derivatives For multiple variables at once: multiply Jacobians10

Example Jacobian: Elementwise activation Function11

Example Jacobian: Elementwise activation FunctionFunction has n outputs and n inputs n by n Jacobian12

Example Jacobian: Elementwise activation Function13

Example Jacobian: Elementwise activation Function14

Example Jacobian: Elementwise activation Function15

Other Jacobians Compute these at home for practice! Check your answers with the lecture notes16

Other Jacobians Compute these at home for practice! Check your answers with the lecture notes17

Other JacobiansFine print: This is the correct Jacobian.Later we discuss the “shape convention”;using it the answer would be h. Compute these at home for practice! Check your answers with the lecture notes18

Other Jacobians Compute these at home for practice! 19Check your answers with the lecture notes

Back to our Neural Net!20x [ xmuseumsxinxParisxarexamazing ]

Back to our Neural Net! Let’s find 21Really, we care about the gradient of the loss, but wewill compute the gradient of the score for simplicityx [ xmuseumsxinxParisxarexamazing ]

1. Break up equations into simple pieces22

2. Apply the chain rule23

2. Apply the chain rule24

2. Apply the chain rule25

2. Apply the chain rule26

3. Write out the JacobiansUseful Jacobians from previous slide27

3. Write out the Jacobians𝒖:Useful Jacobians from previous slide28

3. Write out the Jacobians𝒖:Useful Jacobians from previous slide29

3. Write out the Jacobians𝒖:Useful Jacobians from previous slide30

3. Write out the Jacobians𝒖:Useful Jacobians from previous slide31𝒖:

Re-using Computation Suppose we now want to compute 32Using the chain rule again:

Re-using Computation Suppose we now want to compute Using the chain rule again:The same! Let’s avoid duplicated computation 33

Re-using Computation Suppose we now want to compute Using the chain rule again:𝒖:34𝛿 is local error signal

Derivative with respect to Matrix: Output shape What doeslook like? 1 output, nm inputs: 1 by nm Jacobian? 35Inconvenient to do

Derivative with respect to Matrix: Output shape What doeslook like? 1 output, nm inputs: 1 by nm Jacobian? Inconvenient to do Instead we use shape convention: the shape ofthe gradient is the shape of the parameters 36Sois n by m:

Derivative with respect to Matrix Remember is going to be in our answerThe other term should bebecause Answer is:𝛿 is local error signal at 𝑧𝑥 is local input signal37

Why the Transposes? Hacky answer: this makes the dimensions work out! Useful trick for checking your work! Full explanation in the lecture notes; intuition next 38Each input goes to each output – you get outer product

Why the Transposes?39

Deriving local input gradient in backprop For this function: 𝑠 𝒛 𝜹 𝜹𝑾𝒙 𝒃 𝑾 𝑾 𝑾 Let’s consider the derivativeof a single weight Wij Wij only contributes to zi For example: W23 is onlyused to compute z2 not z1u2sf(z1) h1h2 f(z2)W23b2 𝑧C 𝑾CF 𝒙 𝑏C 𝑊CE 𝑊CE 40HHIJK 4MNO 𝑊CM 𝑥M 𝑥Ex1x2x3 1

What shape should derivatives be?is a row vector But convention says our gradient should be a column vectorbecause is a column vector Disagreement between Jacobian form (which makesthe chain rule easy) and the shape convention (whichmakes implementing SGD easy)41 We expect answers to follow the shape convention But Jacobian form is useful for computing the answers

What shape should derivatives be?Two options:1. Use Jacobian form as much as possible, reshape tofollow the convention at the end: What we just did. But at the end transposederivative a column vector, resulting into make the2. Always follow the convention 42Look at dimensions to figure out when to transpose and/orreorder terms

Deriving gradients: Tips Tip 1: Carefully define your variables and keep track of theirdimensionality!Tip 2: Chain rule! If y f(u) and u g(x), i.e., y f(g(x)), then: 𝒚 𝒚 𝒖 𝒙 𝒖 𝒙Keep straight what variables feed into what computations Tip 3: For the top softmax part of a model: First consider thederivative wrt fc when c y (the correct class), then considerderivative wrt fc when c ¹ y (all the incorrect classes) Tip 4: Work out element-wise partial derivatives if you’re gettingconfused by matrix calculus! Tip 5: Use Shape Convention. Note: The error message 𝜹 thatarrives at a hidden layer has the same dimensionality as thathidden layer43

3. BackpropagationWe’ve almost shown you backpropagationIt’s taking derivatives and using the (generalized,multivariate, or matrix) chain ruleOther trick:We re-use derivatives computed for higher layers incomputing derivatives for lower layers to minimizecomputation44

Computation Graphs and Backpropagation We represent our neural netequations as a graph Source nodes: inputs Interior nodes: operations 45

Computation Graphs and Backpropagation We represent our neural netequations as a graph Source nodes: inputs Interior nodes: operations Edges pass along result of theoperation 46

Computation Graphs and Backpropagation Representing our neural netequations as a graph Source nodes: inputs“ForwardPropagation”Interior nodes: operationsEdges pass along result of theoperation 47

Backpropagation Go backwards along edges Pass along gradients 48

Backpropagation: Single Node Node receives an “upstream gradient” Goal is to pass on the correct“downstream gradient”49DownstreamgradientUpstreamgradient

Backpropagation: Single NodeEach node has a local gradient 50The gradient of its output withrespect to its nt

Backpropagation: Single NodeEach node has a local gradient The gradient of its output withrespect to its streamgradient

Backpropagation: Single NodeEach node has a local gradient 52The gradient of it’s output withrespect to it’s input[downstream gradient] [upstream gradient] x [local adient

Backpropagation: Single Node What about nodes with multiple inputs?*53

Backpropagation: Single Node Multiple inputs multiple local eamgradient

An Example55

An ExampleForward prop steps *max56

An ExampleForward prop steps12 2057max32*6

An ExampleForward prop stepsLocal gradients12 2058max32*6

An ExampleForward prop stepsLocal gradients12 2059max32*6

An ExampleForward prop stepsLocal gradients12 2060max32*6

An ExampleForward prop stepsLocal gradients12 2061max32*6

An ExampleForward prop stepsLocal gradients12 1*2 220623max2*611*3 3upstream * local downstream

An ExampleForward prop stepsLocal gradients126323*1 303*0 0 max3223*61upstream * local downstream

An ExampleForward prop steps12*1 222*1 2642300Local gradients max3223*61upstream * local downstream

An ExampleForward prop steps1222652300Local gradients max3223*61

Gradients sum at outward branches 66

Gradients sum at outward branches 67

Node Intuitions “distributes” the upstream gradient to each summand1222 2068max322*61

Node Intuitions “distributes” the upstream gradient to each summand max “routes” the upstream gradient12692300 max323*61

Node Intuitions “distributes” the upstream gradient max “routes” the upstream gradient * “switches” the upstream gradient12 2070max3223*61

Efficiency: compute all gradients at once Incorrect way of doing backprop: First compute*71

Efficiency: compute all gradients at once Incorrect way of doing backprop: First compute Then independently compute Duplicated computation!*72

Efficiency: compute all gradients at once Correct way: Compute all the gradients at once Analogous to using 𝜹 when wecomputed gradients by hand*73

Back-Prop in General Computation GraphSingle scalar output 741. Fprop: visit nodes in topological sort order- Compute value of node given predecessors2. Bprop:- initialize output gradient 1- visit nodes in reverse order:Compute gradient wrt each node usinggradient wrt successors successors ofDone correctly, big O() complexity of fprop andbprop is the sameIn general our nets have regular layer-structureand so we can use matrices and Jacobians

Automatic Differentiation The gradient computation can beautomatically inferred from thesymbolic expression of the fprop Each node type needs to know howto compute its output and how tocompute the gradient wrt its inputsgiven the gradient wrt its output Modern DL frameworks (Tensorflow,PyTorch, etc.) do backpropagationfor you but mainly leave layer/nodewriter to hand-calculate the localderivative75

Backprop Implementations76

Implementation: forward/backward API77

Implementation: forward/backward API78

Manual Gradient checking: Numeric Gradient For small h ( 1e-4), Easy to implement correctly But approximate and very slow: Have to recompute f for every parameter of our model Useful for checking your implementation79 In the old days when we hand-wrote everything, it was keyto do this everywhere. Now much less needed, when throwing together layers

SummaryWe’ve mastered the core technology of neural nets! Backpropagation: recursively (and hence efficiently)apply the chain rule along computation graph [downstream gradient] [upstream gradient] x [local gradient] Forward pass: compute results of operations and saveintermediate values Backward pass: apply chain rule to compute gradients80

Why learn all these details about gradients? Modern deep learning frameworks compute gradients for you! But why take a class on compilers or systems when they areimplemented for you? Understanding what is going on under the hood is useful! Backpropagation doesn’t always work perfectly Understanding why is crucial for debugging and improvingmodels See Karpathy article (in syllabus): rstandbackprop-e2f06eab496bExample in future lecture: exploding and vanishing gradients

Matrix calculus: Fully vectorized gradients "multivariable calculus is just like single-variable calculus if you use matrices" Much faster and more useful than non-vectorizedgradients But doing a non-vectorized gradient can be good for intuition; watch last week's lecture for an example