CS224d Deep NLP Lecture 8: Recurrent Neural Networks

Transcription

CS224dDeep NLPLecture 8:Recurrent Neural NetworksRichard Socherrichard@metamind.io

Overview Feedback Traditional language models RNNs RNN language models Important training problems and tricks Vanishing and exploding gradient problems RNNs for other sequence tasks Bidirectional and deep RNNs2Richard Socher4/21/16

Feedback3Richard Socher4/21/16

Feedback à Super useful à Thanks!Explain the intuition behind the math and models moreà some today :)Give more examples, more toy examples and recap slides can help usunderstand fasterà Some toy examples today. Recap of main concepts next weekConsistency issues in dimensionality, row vs column, etc.à All vectors should be column vectors unless I messed up, please senderrataI like the quality of the problem sets and especially the starter code. It would benice to include ballpark values we should expectà Will add in future Psets and on Piazza. We’ll also add dimensionality.4Richard Socher4/21/16

Feedback on ProjectPlease give list of proposed projectsà Great feedback, I asked research groups at Stanfordand will compile a list for next Tuesday. We’ll move project proposal deadline to next weekThursday. Extra credit deadline for dataset first baseline is forproject milestone.5Richard Socher4/21/16

Language ModelsA language model computes a probability for a sequenceof words: Useful for machine translation6 Word ordering:p(the cat is small) p(small the is cat) Word choice:p(walking home after school) p(walking house afterschool)Richard Socher4/21/16

Traditional Language Models Probability is usually conditioned on window of nprevious words An incorrect but necessary Markov assumption! To estimate probabilities, compute for unigrams andbigrams (conditioning on one/two previous word(s):7Richard Socher4/21/16

Traditional Language Models Performance improves with keeping around higher ngrams counts and doing smoothing and so-calledbackoff (e.g. if 4-gram not found, try 3-gram, etc) There are A LOT of n-grams!à Gigantic RAM requirements! Recent state of the art: Scalable Modified Kneser-NeyLanguage Model Estimation by Heafield et al.:“Using one machine with 140 GB RAM for 2.8 days,we built an unpruned model on 126 billion tokens”8Richard Socher4/21/16

Recurrent Neural Networks! RNNs tie the weights at each time step Condition the neural network on all previous words RAM requirement only scales with number of wordsyt 1ytht 1xt 19yt 1ht 1htWWxtxt 1Richard Socher4/21/16

Recurrent Neural Network language modelGiven list of word vectors:At a single time step:ßàxt10Richard Socherht4/21/16

Recurrent Neural Network language modelMain idea: we use the same set of W weights at all timesteps!Everything else is the same:is some initialization vector for the hidden layerat time step 0is the column vector of L at index [t] at time step t

Recurrent Neural Network language modelis a probability distribution over the vocabularySame cross entropy loss function but predicting wordsinstead of classes12Richard Socher4/21/16

Recurrent Neural Network language modelEvaluation could just be negative of average logprobability over dataset of size (number of words) T:But more common: Perplexity:2JLower is better!13Richard Socher4/21/16

Training RNNs is hard Multiply the same matrix at each time step during forward propyt 1ytht 1xt 1yt 1ht 1htWWxtxt 1 Ideally inputs from many time steps ago can modify output y Takefor an example RNN with 2 time steps! Insightful!Lecture 1, Slide 14Richard Socher4/21/16

The vanishing/exploding gradient problem Multiply the same matrix at each time step during backpropyt 1ytht 1xt 1Lecture 1, Slide 15yt 1ht 1htWWxtxt 1Richard Socher4/21/16

The vanishing gradient problem - Details Similar but simpler RNN formulation: Total error is the sum of each error at time steps t Hardcore chain rule application:Lecture 1, Slide 16Richard Socher4/21/16

The vanishing gradient problem - Details Similar to backprop but less efficient formulation Useful for analysis we’ll look at: Remember: More chain rule, remember: Each partial is a Jacobian:Lecture 1, Slide 17Richard Socher4/21/16

The vanishing gradient problem - Detailsht 1 From previous slide:ht Remember: To compute Jacobian, derive each element of matrix:Check at homethat you understandthe diag matrixformulation Where:Lecture 1, Slide 18Richard Socher4/21/16

The vanishing gradient problem - Details Analyzing the norms of the Jacobians, yields: Where we defined ‘s as upper bounds of the norms The gradient is a product of Jacobian matrices, each associatedwith a step in the forward computation. This can become very small or very large quickly [Bengio et al1994], and the locality assumption of gradient descent breaksdown. à Vanishing or exploding gradientLecture 1, Slide 19Richard Socher4/21/16

Why is the vanishing gradient a problem? The error at a time step ideally can tell a previous time stepfrom many steps away to change during backpropyt 1ytht 1xt 1Lecture 1, Slide 20yt 1ht 1htWWxtxt 1Richard Socher4/21/16

The vanishing gradient problem for language models In the case of language modeling or question answering wordsfrom time steps far away are not taken into consideration whentraining to predict the next word Example:Jane walked into the room. John walked in too. It was late in theday. Jane said hi toLecture 1, Slide 21Richard Socher4/21/16

IPython Notebook with vanishing gradient example Example of simple and clean NNet implementation Comparison of sigmoid and ReLu units A little bit of vanishing gradientLecture 1, Slide 22Richard Socher4/21/16

Lecture 1, Slide 23Richard Socher4/21/16

ions of this basicress explicitly thenents element-wise (clipping an entry when it exceedsin absolute value a fixed threshold). Clipping has beenshown to do well in practice and it forms the backboneof our approach.Trick for exploding gradient: clipping trickessian-Free opti3.2. Scaling down the gradientsdamping, a speAs suggestedin section 2.3,simple mechanismn. This approachThe solutionfirst introducedbyoneMikolovis to cliptogradientsdeal with a sudden increase in the norm of the gradinishing gradient,to a maximumvalue.entsistorescale them whenever they go over a threshll missing. Preold (see algorithm 1).e in high dimenity for long termAlgorithm 1 Pseudo-code for norm clipping the grat term ones. Thisdients whenever they explodehese components@Eĝ@ ot guarantee thatifkĝkthreshold thensection 2.3, thisthresholdĝĝkĝkploding gradientend ifnhancement thatmall, when the paThis algorithm is very similar to the one proposed by . This asks forTomas Mikolov and we only diverged from the originalmall norm,hence a bigMakesdifferencein RNNs.proposalin an attemptto provide a better theoreticalradients problem.foundation (ensuring that we always move in a derecurrent neuralscent direction with respect to the current mini-batch),hat while the curthough in practice both variants behave similarly.me with the gradiThe proposed clipping is simple to implement andte and hence notcomputationally efficient, but it does however inng gradient.24troduce an additional hyper-parameter, namely the

the difficulty of training Recurrent Neural NetwGradient clippingOnintuitionrithmFigurefromshouldpaper: work evengradientis ofnot the samOn thedifficulty(a casefor whichtrainingRecurrentNeural a seas thePascanuratio betweenthNetworks,et al.2013still explode).Our hypothesis couldcent success of the Heto other second order mferences between Hessiorder algorithms. Firstand hence can deal wit Error surface of a single hidden unit RNN,Figure 6. We plot the error surface of a single hidden unitnot necessarily axis-alhighlighting the existence of high curnew estimate of the H recurrentHigh network,curvaturewallsvature walls. The solid lines depicts standard trajectoriesdate step and can takethat gradient descent might follow. Using dashed arrowcurvature (such as the the ramshowswhat wouldhappen ifthe gradientsissis) while most other arescaled to a fixed size when its norm is above a threshold. Dashed lines gradients rescaled to fixed size sumption, i.e., averaginsteps.Richard Socher254/21/16

eaches 1,000,000 iterations and report the results in figure 3 (best hyperparameters are listed inle 2).For vanishing gradients: Initialization ReLus!Pixel by pixel MNISTPixel by pixel permuted MNIST100100LSTMRNN TanhRNN ReLUsIRNNInitialize W(*)‘s toidentity matrix Iandf(z) rect(z) max(z, 0) 80Test Accuracy7060504030807060504030à Huge difference! 2020100LSTMRNN TanhRNN ReLUsIRNN90Test Accuracy901001 2345Steps6789105x 100012345Steps6789105x 10Initialization idea first introduced in Parsing with Compositionalgure 3: The results of recurrent methods on the “pixel-by-pixel MNIST” problem. We report theVectoret al. 2013t set accuracyfor allGrammars,methods. Left:Sochernormal MNIST.Right: permuted MNIST.New experiments with recurrent neural nets 2 weeks ago (!) inProblemA SimpleLSTM Way to RNN TanhRecurrentRNNNetworks ReLUsIRNNInitializeofRectified 8 8MNISTlr 0.01, gc 1 lr 10 , gc 10 lr 10 , gc 10 lr 10 8 , gc 1LinearUnits, Le et al. 2015f b 1.0 26permutedMNISTlr 0.01, gc 1f b 1.0lr 10 8 , gcRichard 1 Socherlr 10 6 , gc 10lr4/21/16 10 9 , gc 1

Perplexity ResultsKN5 Count-based language model with Kneser-Neysmoothing & 5-gramsTable from paper Extensions of recurrent neural networklanguage model by Mikolov et al 201127Richard Socher4/21/16

Problem: Softmax is huge and slowTrick: Class-based word predictionp(wt history) p(ct history)p(wt ct) p(ct ht)p(wt ct)The more classes,the better perplexitybut also worse speed:28Richard Socher4/21/16

One last implementation trick You only need to pass backwards through yoursequence once and accumulate all the deltas fromeach Et29Richard Socher4/21/16

Sequence modeling for other tasks Classify each word into: NER Entity level sentiment in context opinionated expressions Example application and slides from paper OpinionMining with Deep Recurrent Nets by Irsoy and Cardie201430Richard Socher4/21/16

Opinion Mining with Deep Recurrent NetsGoal: Classify each word asdirect subjective expressions (DSEs) andexpressive subjective expressions (ESEs).DSE: Explicit mentions of private states or speech eventsexpressing private statesESE: Expressions that indicate sentiment, emotion, etc.without explicitly conveying them.31Richard Socher4/21/16

Example AnnotationIn BIO notation (tags either begin-of-entity (B X) orcontinuation-of-entity (I X)):The committee, [as usual]ESE, [has refused to make anystatements]DSE.32Richard Socher4/21/16

Approach: Recurrent Neural Network Notation from paper (so you get used to different ones)Recurrent Neural Networkyht f (Wxt Vht 1 b)hyt g(Uht c)x xx represents a token (word) as a vector.representsa token(word)as(B,aI vector.the outputlabelor O).y representsh is the memory, computed from the past memory andcurrenttheword.It summarizessentenceup–togthattime. y representsoutputlabelthe(B,I or O) softmax! h is the memory, computed from the past memory and currentword. It summarizes the sentence up to that time.33Richard Socher4/21/16

Bidirectional RNNsProblem: For classification you want to incorporateBidirectionalityinformationfrom words both preceding and followingyIdeas?!!"!!" !!h t f (W xt V h t 1 b)!!""!" !!h t f (W xt V h t 1 b)h! !yt g(U[h t ;h t ] c)x! !h [h;h ] now represents (summarizes) the past and futurearound a single token.34Richard Socher4/21/16

DeepBidirectionalRNNsGoing Deepyh (3)h (2)h (1)! (i)!"! (i) (i 1) !" (i) ! (i) ! (i)h t f (W ht V h t 1 b )! (i)!"" (i) (i 1) !" (i) ! (i) ! (i)h t f (W ht V h t 1 b )! (L ) ! (L )yt g(U[h t ;h t ] c)xEach memory layer passes an intermediate sequentialrepresentation to the next.35Richard Socher4/21/16

Data MPQA 1.2 corpus (Wiebe et al., 2005) consists of 535 news articles (11,111 sentences) manually labeled with DSE and ESEs at the phraselevel Evaluation: F136Richard Socher4/21/16

Prop63536151Bin F1Evaluation594957477472706866646866646260581234# Layers51234# LayersResults: Deep vs Shallow RNNsProp F167DSE57655563533761Richard Socher51524k200kESE4/21/16

Recap Recurrent Neural Network is one of the best deepNLPmodel families Training them is hard because of vanishing andexploding gradient problems They can be extended in many ways and their trainingimproved with many tricks (more to come) Next week: Most important and powerful RNNextensions with LSTMs and GRUs38Richard Socher4/21/16

Deep NLP Lecture 8: Recurrent Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard Socher 4/21/16 Feedback Traditional language models RNNs RNN language models Important training problems and tricks Vanishing and exploding gradient problems R