Transcription
Backpropagation andGradients
Agenda MotivationBackprop Tips & TricksMatrix calculus primerExample: 2-layer Neural Network
MotivationRecall: Optimization objective is minimize lossGoal: how should we tweak the parameters to decreasethe loss slightly?Plotted on WolframAlpha
Approach #1: Random searchIntuition: the way we tweak parameters is thedirection we step in our optimizationWhat if we randomly choose a direction?
Approach #2: Numerical gradientIntuition: gradient describes rate of change of afunction with respect to a variable surrounding aninfinitesimally small regionFinite Differences:Challenge: how do we compute the gradientindependent of each input?
Approach #3: Analytical gradientRecall: chain ruleAssuming we know the structure of thecomputational graph beforehand Intuition: upstream gradient values propagatebackwards -- we can reuse them!
What about autograd? Deep learning frameworks can automatically perform backprop! Problems might surface related to underlying gradients when debugging yourmodels“Yes You Should Understand ould-understand-backprop-e2f06eab496b
Problem StatementGiven a function f with respect to inputs x, labels y, and parameterscompute the gradient of Loss with respect to
BackpropagationAn algorithm for computing the gradient of a compound function as aseries of local, intermediate gradients
Backpropagation1. Identify intermediate functions (forward prop)2. Compute local gradients3. Combine with upstream error signal to get full gradient
Modularity - Simple ExampleCompound functionIntermediate Variables(forward propagation)
Modularity - Neural Network ExampleCompound functionIntermediate Variables(forward propagation)
Intermediate VariablesIntermediate Gradients(forward propagation)(backward propagation)
Chain Rule BehaviorKey chain rule intuition:Slopes multiply
Circuit Intuition
Matrix Calculus PrimerScalar-by-VectorVector-by-Vector
Matrix Calculus PrimerScalar-by-MatrixVector-by-Matrix
Vector-by-Matrix GradientsLet
Backpropagation Shape RuleWhen you take gradients against a scalarThe gradient at each intermediate step has shape of denominator
Dimension Balancing
Dimension Balancing
Dimension BalancingDimension balancing is the “cheap” but efficient approach to gradient calculations inmost practical settingsRead gradient computation notes to understand how to derivematrix expressions for gradients from first principles
Activation Function Gradientsis an element-wise function on each index of h (scalar-to-scalar)Officially,Diagonal matrix represents thatandhave no dependence if
Activation Function GradientsElement-wise multiplication(hadamard product) corresponds tomatrix product with a diagonalmatrix
Backprop Menu for Success1.Write down variable graph2.Compute derivative of cost function3.Keep track of error signals4.Enforce shape rule on error signals5.Use matrix balancing when deriving over a linear transformation
As promised: A matrix example.?Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 76April 13, 2017
As promised: A matrix example.import numpy as np# forward propz 1 np.dot(X, W 1)h 1 np.maximum(z 1, 0)y hat np.dot(h 1, W 2)L np.sum(y hat**2)# backward propdy hat 2.0*y hatdW2 h 1.T.dot(dy hat)dh1 dy hat.dot(W 2.T)dz1 dh1.copy()dz1[z1 0] 0dW1 X.T.dot(dz1)
Matrix Calculus Primer Vector-by-Matrix Scalar-by-Matrix. Vector-by-Matrix Gradients Let . Backpropagation Shape Rule When you take gradients against a scalar The gradient at each intermediate step has shape of denominator. Dimension Balancing. Dimension Balancing. Dimension Balancing