Backpropagation And Gradients - Stanford University

Transcription

Backpropagation andGradients

Agenda MotivationBackprop Tips & TricksMatrix calculus primerExample: 2-layer Neural Network

MotivationRecall: Optimization objective is minimize lossGoal: how should we tweak the parameters to decreasethe loss slightly?Plotted on WolframAlpha

Approach #1: Random searchIntuition: the way we tweak parameters is thedirection we step in our optimizationWhat if we randomly choose a direction?

Approach #2: Numerical gradientIntuition: gradient describes rate of change of afunction with respect to a variable surrounding aninfinitesimally small regionFinite Differences:Challenge: how do we compute the gradientindependent of each input?

Approach #3: Analytical gradientRecall: chain ruleAssuming we know the structure of thecomputational graph beforehand Intuition: upstream gradient values propagatebackwards -- we can reuse them!

What about autograd? Deep learning frameworks can automatically perform backprop! Problems might surface related to underlying gradients when debugging yourmodels“Yes You Should Understand ould-understand-backprop-e2f06eab496b

Problem StatementGiven a function f with respect to inputs x, labels y, and parameterscompute the gradient of Loss with respect to

BackpropagationAn algorithm for computing the gradient of a compound function as aseries of local, intermediate gradients

Backpropagation1. Identify intermediate functions (forward prop)2. Compute local gradients3. Combine with upstream error signal to get full gradient

Modularity - Simple ExampleCompound functionIntermediate Variables(forward propagation)

Modularity - Neural Network ExampleCompound functionIntermediate Variables(forward propagation)

Intermediate VariablesIntermediate Gradients(forward propagation)(backward propagation)

Chain Rule BehaviorKey chain rule intuition:Slopes multiply

Circuit Intuition

Matrix Calculus PrimerScalar-by-VectorVector-by-Vector

Matrix Calculus PrimerScalar-by-MatrixVector-by-Matrix

Vector-by-Matrix GradientsLet

Backpropagation Shape RuleWhen you take gradients against a scalarThe gradient at each intermediate step has shape of denominator

Dimension Balancing

Dimension Balancing

Dimension BalancingDimension balancing is the “cheap” but efficient approach to gradient calculations inmost practical settingsRead gradient computation notes to understand how to derivematrix expressions for gradients from first principles

Activation Function Gradientsis an element-wise function on each index of h (scalar-to-scalar)Officially,Diagonal matrix represents thatandhave no dependence if

Activation Function GradientsElement-wise multiplication(hadamard product) corresponds tomatrix product with a diagonalmatrix

Backprop Menu for Success1.Write down variable graph2.Compute derivative of cost function3.Keep track of error signals4.Enforce shape rule on error signals5.Use matrix balancing when deriving over a linear transformation

As promised: A matrix example.?Fei-Fei Li & Justin Johnson & Serena YeungLecture 4 - 76April 13, 2017

As promised: A matrix example.import numpy as np# forward propz 1 np.dot(X, W 1)h 1 np.maximum(z 1, 0)y hat np.dot(h 1, W 2)L np.sum(y hat**2)# backward propdy hat 2.0*y hatdW2 h 1.T.dot(dy hat)dh1 dy hat.dot(W 2.T)dz1 dh1.copy()dz1[z1 0] 0dW1 X.T.dot(dz1)

Matrix Calculus Primer Vector-by-Matrix Scalar-by-Matrix. Vector-by-Matrix Gradients Let . Backpropagation Shape Rule When you take gradients against a scalar The gradient at each intermediate step has shape of denominator. Dimension Balancing. Dimension Balancing. Dimension Balancing