Neural Networks For NLP

Transcription

Neural Networks for NLPCOMP-599Nov 30, 2016

OutlineNeural networks and deep learning: introductionFeedforward neural networksword2vecComplex neural network architecturesConvolutional neural networksRecurrent and recursive neural networks2

ReferenceMuch of the figures and explanations are drawn fromthis reference:A Primer on Neural Network Models for Natural LanguageProcessing. Yoav Goldberg. 2015.http://u.cs.biu.ac.il/ yogo/nnlp.pdf3

Classification Review𝑦 𝑓(π‘₯)inputoutput labelclassifierRepresent input π‘₯ as a list of featuresLorem ipsum dolor sit amet, consecteturadipiscing elit, sed do eiusmod temporincididunt ut labore et dolore magna aliqua. Utenim ad minim veniam, quis nostrudexercitation ullamco laboris nisi ut aliquip exea commodo consequat. Duis aute irure dolorin reprehenderit in voluptate velit esse cillumdolore eu fugiat nulla pariatur. Excepteur sintoccaecat cupidatat non proident, sunt in culpaqui officia deserunt mollit anim id est laborum.π‘₯1 π‘₯2 π‘₯3 π‘₯4 π‘₯5 π‘₯6 π‘₯7 π‘₯8 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0 4

Logistic RegressionLinear regression:𝑦 π‘Ž1 π‘₯1 π‘Ž2 π‘₯2 π‘Žπ‘› π‘₯𝑛 𝑏Intuition: Linear regression gives as continuous valuesin [- , ] β€”let’s squish the values to be in [0, 1]!Function that does this: logit function1 π‘Ž π‘₯ π‘Ž π‘₯ π‘Ž π‘₯ 𝑏𝑛 𝑛𝑃(𝑦 π‘₯) 𝑒 1 1 2 2𝑍This 𝑍 is a normalizing constant to ensurethis is a probability distribution.(a.k.a., maximum entropy or MaxEnt classifier)N.B.: Don’t be confused by nameβ€”this method is most often used to solveclassification problems.5

Linear ModelLogistic regression, support vector machines, etc. areexamples of linear models.1 π‘Ž π‘₯ π‘Ž π‘₯ π‘Ž π‘₯ 𝑏𝑛 𝑛𝑃(𝑦 π‘₯) 𝑒 1 1 2 2𝑍Linear combination of featureweights and valuesCannot learn complex, non-linear functions from inputfeatures to output labels (without adding features)e.g., Starts with a capital AND not at beginning of sentence - proper noun6

(Artificial) Neural NetworksA kind of learning model which automatically learnsnon-linear functions from input to outputBiologically inspired metaphor: Network of computational units called neurons Each neuron takes scalar inputs, and produces a scalaroutput, very much like a logistic regression modelNeuron(π‘₯) 𝑔(π‘Ž1 π‘₯1 π‘Ž2 π‘₯2 π‘Žπ‘› π‘₯𝑛 𝑏)As a whole, the network can theoretically compute anycomputable function, given enough neurons. (Thesenotions can be formalized.)7

Responsible For:AlphaGo (Google) (2015) Beat Go champion Lee Sedol in a series of 5 matches, 4-1Atari game-playing bot (Google) (2015)State of the art in: Speech recognitionMachine translationObject detectionMany other NLP tasks8

Feedforward Neural NetworksAll connections flow forward (no loops); each layer ofhidden units is fully connected to the next.Figure from Goldberg (2015)9

Inference in a FF Neural NetworkPerform computations forwadsthrough the graph:𝐑𝟏 𝑔1 (𝐱𝐖 𝟏 π›πŸ )𝐑𝟐 𝑔2 (𝐑𝟏 𝐖 𝟐 π›πŸ )𝐲 𝐑𝟐 𝐖 πŸ‘Note that we are now representing each layer as avector; combining all of the weights in a layer acrossthe units into a weight matrix10

Activation FunctionIn one unit:Linear combination of inputs and weight values nonlinearity𝐑𝟏 𝑔1 (𝐱𝐖 𝟏 π›πŸ )Popular choices:Sigmoid function (just like logistic regression!)tanh functionRectifier/ramp function: 𝑔 π‘₯ max(0, π‘₯)Why do we need the non-linearity?11

Softmax LayerIn NLP, we often care about discrete outcomes e.g., words, POS tags, topic labelOutput layer can be constructed such that the outputvalues sum to one:Let 𝐱 π‘₯1 π‘₯π‘˜exp(π‘₯𝑖 )π‘ π‘œπ‘“π‘‘π‘šπ‘Žπ‘₯ π‘₯𝑖 π‘˜π‘— exp(π‘₯𝑗 )Interpretation: unit π‘₯𝑖 represents probability thatoutcome is 𝑖.Essentially, the last layer is like a multinomial logisticregression12

Loss FunctionA neural network is optimized with respect to a lossfunction, which measures how much error it is makingon predictions:𝐲: correct, gold-standard distribution over class labels𝐲: system predicted distribution over class labels𝐿(𝐲, 𝐲): loss function between the twoPopular choice for classification (usually with a softmaxoutput layer) – cross entropy:𝐿𝑐𝑒 𝐲, 𝐲 𝑦𝑖 log(𝑦𝑖 )𝑖13

Training Neural NetworksTypically done by stochastic gradient descent For one training example, find gradient of loss functionwrt parameters of the network (i.e., the weights of eachlayer); β€œtravel along in that direction”.Network has very many parameters!Efficient algorithm to compute the gradient withrespect to all parameters: backpropagation (Rumelhartet al., 1986) Boils down to an efficient way to use the chain rule ofderivatives to propagate the error signal from the lossfunction backwards through the network back to theinputs14

SGD OverviewInputs: Function computed by neural network, 𝑓(𝐱; πœƒ) Training samples {𝐱 𝐀 , 𝐲 𝐀 } Loss function 𝐿Repeat for a while:Sample a training case, 𝐱 𝐀 , 𝐲 𝐀Compute loss 𝐿(𝑓 𝐱 𝐀 ; πœƒ , 𝐲 𝐀 )Compute gradient 𝐿(𝐱 𝐀 ) wrt the parameters πœƒBy backpropagationUpdate πœƒ πœƒ πœ‚ 𝐿(𝐱 𝐀 )Return πœƒ15

Example: Time Delay Neural NetworkLet’s draw a neural network architecture for POStagging using a feedforward neural network.We’ll construct a context window around each word,and predict the POS tag of that word as the output.16

Unsupervised LearningNeural networks can be used for unsupervised learningUsual approach: the outputs to be predicted arederived from the inputPopular task: representation learning Learn an embedding or representation of somevocabulary of objects using a neural network Usually, these embeddings are vectors which are takenfrom a hidden layer of a neural network These embeddings can then be used to compute similarityor relatedness, or in downstream tasks (rememberdistributional semantics)17

word2vec (Mikolov et al., 2013)Learn vector representations of wordsActually two models: Continuous bag of words (CBOW) – use context words topredict a target word Skip-gram – use target word to predict context wordsIn both cases, the representation that is associatedwith the target word is the embedding that is learned.18

word2vec Architectures(Mikolov et al., 2013)19

ValidationWord similarity Analogical reasoning taska : b :: c : d – what is d?New York : New York Times :: Baltimore : ?(Answer: Baltimore Sun)man : king :: woman : ?(Answer: queen)Solve by ranking vocabulary items according to assumption:π‘£π‘Ž 𝑣𝑏 𝑣𝑐 𝑣𝑑or 𝑣𝑑 𝑣𝑏 π‘£π‘Ž 𝑣𝑐(? king – man queen)20

Impact of HyperparametersThere are many hyperparameters that are important tothe performance of word2vec: Weighting relative importance of context words Sampling procedure during training of target and contextwordsThese have a large impact on performance (Levy et al.,2015) Applying analogous changes to previous approaches suchas Singular Value Decomposition results in similarperformance on word similarity tasks21

Use of Word Embeddings in NLPCan download pretrained word2vec embeddings fromthe webUse word vectors as features in other NLP tasks Now very widespreadAdvantages: Vectors are trained from a large amount of generaldomain text: proxy of background knowledge! Cheap and easy to tryDisadvantages: Doesn’t always work (especially if you have a lot of taskspecific data)22

Other Neural Network ArchitecturesConvolutional neural networksRecurrent neural networksRecursive neural networks23

Convolutional Neural Networks (CNN)Apply a neural network to a window around elementsin a sequence, then pool the results into a single vector Convolution layer: much like the TDNN Pooling layer: combine output vectors of convolutionlayerFigure from Goldberg, (2015)24

Uses of CNNsWidely used in image processing for object detectionIntuition: the convolution pooling is a kind of search throughthe data to find features that make up some objectIn NLP, effective at similar tasks, which requiresearching for a part of an indicative part of a sentence.e.g., Kalchbrenner et al. (2014) apply CNNs to: Sentiment analysis Question type classification25

Recurrent Neural NetworksA neural network sequence model:𝑅𝑁𝑁 𝐬𝟎 , 𝐱𝟏:𝐧 𝐬𝟏:𝐧 , 𝐲𝟏:𝐧𝐬𝐒 𝑅(𝐬𝐒 𝟏 , 𝐱𝐒 )# 𝐬𝐒 : state vector𝐲𝐒 𝑂 𝐬𝐒# 𝐲𝐒 : output vector𝑅 and 𝑂 are parts of the neural network that compute thenext state vector and the output vector26

Long Short-Term Memory NetworksCurrently one of the most popular RNN architecturesfor NLP (Hochreiter and Schmidhuber, 1997) Explicitly models a β€œmemory” cell (i.e., a hidden-layervector), and how it is updated as a response to the currentinput and the previous state of the memory cell.Visual step-by-step nderstanding-LSTMs/27

Recursive Neural NetworksSimilar to recurrent neural networks, but with a (static)tree structureWhat makes itrecursive is that thecompositionfunction at each stepis the same (theweights are tied)(Socher et al., 2013)28

Advantages of Neural NetworksLearn relationships between inputs and outputs: Complex features and dependencies between inputs andstates over long ranges with no fixed horizon assumption(i.e., non-Markovian) Reduces need for feature engineering More efficient use of input data via weight sharingHighly flexible, generic architecture Multi-task learning: jointly train model that solvesmultiple tasks simultaneously Transfer learning: Take part of a neural network used foran initial task, use that as an initialization for a second,related task29

Challenges of Neural Networks in NLPComplex models may need a lot of training dataMany fiddly hyperparameters to tune, little guidanceon how to do so, except empirically or throughexperience: Learning rate, number of hidden units, number of hiddenlayers, how to connect units, non-linearity, loss function,how to sample data, training procedure, etc.Can be difficult to interpret the output of a system Why did the model predict a certain label? Have toexamine weights in the network. Important to convince people to act on the outputs of themodel!30

NNs for NLPNeural networks have β€œtaken over” mainstream NLPsince 2014; most empirical work at recent conferencesuse them in some wayLots of interesting open research questions: How to use linguistic structure (e.g., word senses, parses,other resources) with NNs, either as input or output? When is linguistic feature engineering a good idea, ratherthan just throwing more data with a simple representationfor the NN to learn the features? Multitask and transfer learning for NLP Defining and solving new, challenging NLP tasks31

References Mikolov, Sutskever, Chen, Corrado, and Dean.Distributed Representations of Words and Phrases andTheir Compositionality. NIPS 2013.Hochreiter and Schmidhuber. Long Short-Term Memory.Neural Computation. 1997.Kalchbrenner, Grefenstette, and Blunsom. AConvolutional Neural Network for Modelling y, Goldberg, and Dagan. Improving distributionalsimilarity with lessons learned from word embeddings.TACL 2015.Socher, Perelygin, Wu, Chuang, Manning, Ng, and Potts.Recursive Deep Models for Semantic CompositionalityOver a Sentiment Treebank. EMNLP 2013.32

NNs for NLP Neural networks have β€œtaken over” mainstream NLP since 2014; most empirical work at recent conferences use them in some way Lots of interesting open research questions: How to use linguistic structure (e.g., word senses, pars