Deep Learning Tutorial - Massachusetts Institute Of Technology

Transcription

Deep Learning TutorialBrains, Minds, and Machines Summer Course 2018TA: Eugenio Piasini & Yen-Ling Kuo

Roadmap Supervised Learning with Neural NetsConvolutional Neural Networks for Object RecognitionRecurrent Neural NetworkOther Deep Learning Models

Supervised Learningwith Neural NetsGeneral references:Hertz, Krogh, Palmer 1991Goodfellow, Bengio, Courville 2016

Supervised learningGiven example input-output pairs (X,Y),learn to predict output Y from input XLogistic regression, support vector machines, decision trees, neural networks.

Binary classification:simple perceptron(McCulloch & Pitts 1943)Perceptron learning rule(Rosenblatt 1962)g is a nonlinear activation function, in this case

Linear separabilitySimple perceptrons can only learn to solve linearly separable problems (Minsky and Papert 1969).We can solve more complex problems by composing many units in multiple layers.

Multilayer perceptron (MLP)(“forward propagation”)MLPs are universal function approximators (Cybenko 1989; Hornik 1989).(under some assumptions exercise: show that if g is linear, this architecture reduces to a simple perceptron)

Deep vs shallowUniversality: “shallow” MLPs with one hidden layer can represent any continuous function to arbitraryprecision, given a large enough number of units. But: No guarantee that the number of required units is reasonably small (expressivity).No guarantee that the desired MLP can actually be found with our chosen learning method(learnability).Two motivations for using deep nets instead (see Goodfellow et al 2016, section 6.4.1): Statistical: deep nets are compositional, and naturally well suited to representing hierarchicalstructures where simpler patterns are composed and reused to form more complex onesrecursively. It can be argued that many interesting structures in real world data are like this.Computational: under certain conditions, it can be proved that deep architectures are moreexpressive than shallow ones, i.e. they can learn more patterns for a given total size of the network.

Problem: compute allKey insights: the loss depends on the weights w of a unit only through that unit’sactivation h on a unit’s activation h only through the activation ofthose units that are downstream from h.Backpropagation(Rumelhart, Hinton, Williams 1986)The “errors” beingbackpropagatedThese give the gradient of the loss with respect to the weights,which you can then use with your favorite gradient descent method.

Backpropagation - example(exercise: derive gradient wrt bias terms b)

The Navy revealed the embryo ofan electronic computer today that itexpects will be able to walk, talk,see, write, reproduce itself andbe conscious of its existence [ ]Dr. Frank Rosenblatt, a researchpsychologist at the CornellAeronautical Laboratory, Buffalo,said Perceptrons might be fired tothe planets as mechanical spaceexplorers.The New York TimesJuly 8th, 1958The perceptron has shown itselfworthy of study despite (and evenbecause of!) its severe limitations. Ithas many features to attractattention: its linearity; its intriguinglearning theorem; its clearparadigmatic simplicity as a kind ofparallel computation. There is noreason to suppose that any ofthese virtues carry over to themany-layered version.Nevertheless, we consider it to bean important research problem toelucidate (or reject) our intuitivejudgement that the extension tomultilayer systems is sterile.Minsky and Papert 1969(section 13.2)

Convolutional Neural Networksfor Object RecognitionGeneral (excellent!) reference:“Convolutional Networks for Visual Recognition”, Stanford universityhttp://cs231n.stanford.edu/

Traditional Object Detection/Recognition Idea Match low-levelvision features(e.g. edge, HOG,SIFT, etc) Parts-basedmodels(Lowe 2004)

Learning the features - inspiration from neuroscienceHubel and Wiesel: Topographic organization ofconnections Hierarchical organization ofsimple/complex cells(Hubel and Wiesel 1962)(Fukushima 1980)

“Canonical” CNN structureINPUT - [[CONV - RELU]*K - POOL?]*L - [FC - RELU]*M - FCCredit: cs231n.github.ioFour basic operations:1. Convolution2. Nonlinearity (ReLU)3. Pooling4. Fully connected layers(LeCun et al 1998)

2D ConvolutionExample: blurring an imageReplacing each pixel with anaverage of its neighbors

2D Convolutionkernel / filterInput imageOutput image

2D Convolutionkernel / filterInput imageOutput image

2D Convolutionkernel / filterInput imageOutput image

2D Convolutionkernel / filterInput imageOutput image

2D Convolutionkernel / filterInput imageOutput image

2D ConvolutionIf N input size, K filter size, S stride(stride is the size of the step you takeon the input every time you move byone on the output)Output size (N-K)/S 1kernel / filterInput imageOutput image

N 32, K 5, S 1 (N-K)/S 1 28More on convolution sizing325x5x3 filter322811 132322813Input depth # of channels in previous layer(often 3 for input layer (RGB); can be arbitraryfor deeper layers)3Output depth # of filters(feature maps)

Convolve with Different Filters

Convolution (with learned filters) Dependencies are local Filter has few parameters to learn Share the same parametersacross different locationsinputMultiple filtersFeature map.

Fully Connected vs. Locally ConnectedCredit: Ranzato’sCVPR 2014 tutorial

outputNon-linearityinput Rectified linear function (ReLU) Applied per-pixel, output max(0, input)Input feature mapOutput feature map

Pooling Reduce size of representation in following layersIntroduce some invariance to small translationImage credit: http://cs231n.github.io/convolutional-networks/

Learning

LeNet - LeCun et al 1998Backpropagation, gradient descentKey evolutionary stepsNeocognitron - Fukushima 1980Inspired by Hubel and Wiesel“Convolutional” structure,alternating “pooling” layersAlexNet - Krizhevsky et al 2012Larger, deeper network ( 10 7 params), much more data (ImageNet 10 6 images), more compute (incl. GPUs), better regularization (Dropout)

Image classificationImage retrievaltestimagesmallest Euclidian distance to test imageBut also object detection, image segmentation, captioning.

Recurrent Neural Network

Handling Sequential Information Natural language processing: sentences, translationsSpeech / Audio: signal processing, speech recognitionVideo: action recognition, captioningSequential decision making / PlanningTime-series dataBiology / Chemistry: protein sequences, molecule structures.

Dynamic System / Hidden Markov ModelClassical form of a dynamic systemWith an external signal xHidden Markov Model

Recurrent Network / RNN A general form to process a sequence.y Applying a recurrence formula at each time step The state consists of a vector h.It summarizes input up to time t.RNNhNew statexOld state Input at time tA function withparameter W

Processing a Sequence: Unrolling in ctpredictpredictionpredictionpredictionprediction

Training: Backpropagation Through osslosslosslossPRPVBPDTNNTotal loss

Parameter Sharing Across Time The parameters are shared and derivatives are accumulated. Make it possible to generalize to sequences of different lengths.t

Vanishing GradientXfffLossexpanded quickly! . 1, gradient explodes clipping gradients . 1, gradient vanishes introducing memory via LSTMs, GRUs Have problem in learning long-term dependency.

Long Short Term Memory (LSTM) Introducing gates tooptionally let informationflow through. Forget theirrelevant part ofprevious stateSelected updatecell state valuesOutput certainparts of thecell stateAn LSTM cell has threegates to protect andcontrol the cell state.Image credit: http://harinisuresh.com/2016/10/09/lstms/

Flexibility of ranslationPOSTaggingImage credit: Andrej Karpathy

Other Deep Learning Models

Auto-encoder Learning representations a good representation should keep the information wellEncoderOriginal input DecoderLearnedrepresentationReconstructedimage objective: minimize reconstruction error[LeCun, 1987]

latent variables:color, shape, position, .Generative Models What are the learned representations? One view: latent variables to generate the observed dataobserved data Goal of learning a generative model: to recover p(x) from dataDesirable propertiesProblemSampling new dataEvaluating likelihood of dataExtracting latent featuresDirectly computingis intractable!Adapt from IJCAI 2018 deep generative model tutorial

Variational Autoencoder (VAE)p(z)zp(x z)Decoder Idea: approximate p(z x)with a simpler, tractable q(z x)x Learning objectiveReconstruction errorzq(z x)EncoderxMeasure how close q is to p[Kingma et al., 2013]

Generative Adversarial Network (GAN) An implicit generative model, formulated as a minimax game. The discriminator is trying to distinguish real and fake samples.The generator is trying to generate fake samples to fool the discriminator.[Goodfellow et al., 2014]

Thanks & Questions? Link to the slides https://goo.gl/pUXdc1 Hands-on session onMonday!Eugenio Piasini (epiasini@sas.upenn.edu)Yen-Ling Kuo (ylkuo@mit.edu)

Deep Learning Tutorial Brains, Minds, and Machines Summer Course 2018 TA: Eugenio Piasini & Yen-Ling Kuo. Roadmap Supervised Learning with Neural Nets Convolutional Neural N