Deep Learning For NLP Part 2 - Stanford University

Transcription

Deep Learning for NLPPart 2CS224NChristopher Manning(Many slides borrowed from ACL 2012/NAACL 2013Tutorials by me, Richard Socher and Yoshua Bengio)

Part 1.3: The BasicsWord Representations2

The standard word representationThe vast majority of rule-based and statistical NLP work regardswords as atomic symbols: hotel, conference, walkIn vector space terms, this is a vector with one 1 and a lot of zeroes[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]Dimensionality: 20K (speech) – 50K (PTB) – 500K (big vocab) – 13M (Google 1T)We call this a “one-hot” representation. Its problem:motel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] ANDhotel [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] 03

Distributional similarity basedrepresentationsYou can get a lot of value by representing a word bymeans of its neighbors“You shall know a word by the company it keeps”(J. R. Firth 1957: 11)One of the most successful ideas of modern statistical NLPgovernment debt problems turning into banking crises as has happened insaying that Europe needs unified banking regulation to replace the hodgepodgeë These words will represent banking ì4You can vary whether you use local or large contextto get a more syntactic or semantic clustering

Distributional word vectorsDistributional counts give same dimension, denser 719bank171152183138

Two traditional word representations:Class-based and soft clusteringClass based models learn word classes of similar words based ondistributional information ( class HMM) Brown clustering (Brown et al. 1992, Liang 2005) Exchange clustering (Martin et al. 1998, Clark 2003)1.5.6.Clinton, Jiang, Bush, Wilensky, Suharto, Reagan, also, still, already, currently, actually, typically, .recovery, strength, expansion, freedom, resistance, .Soft clustering models learn for each cluster/topic a distributionover words of how likely that word is in each cluster Latent Semantic Analysis (LSA/LSI), Random projections Latent Dirichlet Analysis (LDA), HMM clustering6

Neural word embeddingsas a distributed representationSimilar idea:Word meaning is represented as a(dense) vector – a point in a(medium-dimensional) vector spaceNeural word embeddings combinevector space semantics with theprediction of probabilistic models(Bengio et al. 2003, Collobert &Weston 2008, Huang et al. 2012)7linguistics 0.2860.792 0.177 0.1070.109 0.5420.3490.271

Neural word embeddings visualization8

Kulkarni, Al-Rfou, Perozzi, & Skiena (2015)

Distributed vs. distributional representationsDistributed: A concept isrepresented as continuousactivation levels in anumber of elements[Contrast: local]Distributional: Meaning isrepresented by contexts ofuse[Contrast: denotational]

Advantages of the neural wordembedding approachCompared to other methods, neural word embeddingscan become more meaningful through adding supervisionfrom one or multiple tasksFor instance, sentiment is usually not captured in unsupervisedword embeddings but can be in neural word vectorsWe can build compositional vector representations forlonger phrases (next lecture)11

Part 1.4: The BasicsUnsupervised word vectorlearning12

The mainstream methods for neuralword vector learning in 20151. word2vec https://code.google.com/p/word2vec/Code for two different algorithms (Skipgram with negative sampling –SGNS, and Continuous Bag of Words – CBOW)Uses simple, fast bilinear models to learn very good word representationsMikolov, Sutskever, Chen, Corrado, and Dean. NIPS 20132. GloVe 13http://nlp.stanford.edu/projects/glove/A non-linear matrix factorization: Starts with count matrix and factorizesit with an explicit loss functionSort of similar to SGNS, really, but from Stanford JPennington, Socher, and Manning. EMNLP 2014

But we won’t look at either, but Contrastive Estimation of Word VectorsA neural network for learning word vectors(Collobert et al. JMLR 2011)Idea: A word and its context is a positive trainingsample; a random word in that same context givesa negative training sample:cat chills on a mat14cat chills Ohio a mat

A neural network for learning wordvectorsHow do we formalize this idea? Ask thatscore(cat chills on a mat) score(cat chills Ohio a mat)How do we compute the score? With a neural network Each word is associated with ann-dimensional vector15

Word embedding matrix Initialize all word vectors randomly to form a word embeddingmatrix V L [ ]nthe cat mat These are the word features we want to learn Also called a look-up table Mathematically you get a word’s vector by multiplying L witha one-hot vector e: x Le16

Word vectors as input to a neuralnetwork score(cat chills on a mat) To describe a phrase, retrieve (via index) the correspondingvectors from Lcat chills on a mat Then concatenate them to form a 5n vector: x [ How do we then compute score(x)?17]

Scoring a Single Layer Neural Network A single layer is a combination of a linear layerand a nonlinearity: The neural activations can thenbe used to compute some function. For instance, the score we care about:18

Summary: Feed-forward ComputationComputing a window’s score with a 3-layer NeuralNet: s score(cat chills on a mat)cat19chillsonamat

Summary: Feed-forward Computation s score(cat chills on a mat) sc score(cat chills Ohio a mat) Idea for training objective: make score of true windowlarger and corrupt window’s score lower (until they’resufficiently separated). Minimize This is continuous, can perform SGD Look at a few examples, nudge weights to make J smaller20

The Backpropagation Algorithm Backpropagation is a way of computing gradients of expressionsefficiently through recursive application of chain rule. Either Rumelhart, Hinton & McClelland 1986 or Werbos 1974or Linnainmaa 1970 The derivative of a loss (an objective function) on each variabletells you the sensitivity of the loss to its value. An individual step of the chain rule is local. It tells you how thesensitivity of the objective function is modulated by thatfunction in the network The input changes at some rate for a variable The function at the node scales that rate21

Training the parameters of the modelIf cost J is 0 (or 0!), the derivative is 0. Do nothing.Assuming cost J is 0, it is simple to see that we cancompute the derivatives of s and sc wrt all the involvedvariables: U, W, b, x. Then take diffences for J.22

Training with Backpropagation Let’s consider the derivative of a single weight Wij This only appears inside ai For example: W23 is onlyused to compute a223U2sx1a1a2W23x2x3 1

Training with BackpropagationDerivative of weight Wij:U2s24x1a1a2W23x2x3 1

Training with BackpropagationDerivative of single weight Wij :U2sLocal errorsignal25wherea1a2W23x2x3 1Local inputsignalfor logistic fx1

Training with Backpropagation From single weight Wij to full W: We want all combinations ofi 1, 2 and j 1, 2, 3 Solution: Outer product:whereis the“responsibility” coming fromeach activation a δ x δ x δ x1 1261 21 3δ2x1 δ2x2 δ2x3U2sx1a1a2W23x2x3 1

Training with Backpropagation For biases b, we get:U2s27x1a1a2W23x2x3 1

Training with BackpropagationThat’s almost backpropagationIt’s simply taking derivatives and using the chain rule!Remaining trick: Efficiencywe can re-use derivatives computed for higher layers incomputing derivatives for lower layersExample: last derivatives of model, the word vectors in x28

Training with Backpropagation Take derivative of score withrespect to single word vector(for simplicity a 1d vector,but same if it were longer) Now, we cannot just takeinto consideration one aibecause each xj is connectedto all the neurons above andhence xj influences theoverall score through all ofthese, hence:29Re-used part of previous derivative

Part 1.6: The BasicsLearning word-level classifiers:POS and NER30

The Model(Collobert & Weston 2008;Collobert et al. 2011) Similar to word vector learningbut replaces the single scalarscore with a Softmax/Maxentclassifier Training is again done viabackpropagation which gives anerror similar to (but not the sameas!) the score in the unsupervisedword vector learning model31U2sx1a1a2W23x2x3 1

The Model - Training We already know softmax/MaxEnt and how to optimize it The interesting twist in deep learning is that the input featuresare also learned, similar to learning word vectors with a score:U2sx132c1c2c3Sa1a2W23x2x3 1x1a1a2x2x3 1

Training with Backpropagation:softmaxWhat is the major benefit of learned word vectors?Ability to also propagate labeled information into them,via softmax/maxent and hidden layer: ccc1P(c d, λ ) e 3Sλ T f (c,d )c!eλ T f ( c!,d )x1332a1a2x2x3 1

For small supervised data sets,unsupervised pre-training helps a lotState-of-the-art*Supervised NNUnsupervised pre-trainingfollowed by supervised NN** hand-crafted features***POSWSJ (acc.)97.2496.3797.20NERCoNLL (F1)89.3181.4788.8797.2989.59* Results used in Collobert & Weston (2011).Representative systems: POS: (Toutanova et al. 2003), NER: (Ando & Zhang 2005)** 130,000-word embedding trained on Wikipedia and Reuters with 11 wordwindow, 100 unit hidden layer – for 7 weeks! – then supervised task training***Features are character suffixes for POS and a gazetteer for NER34

Supervised refinement of theunsupervised word representation helpsSupervised NNNN with Brown clustersFixed embeddings*C&W 2011**POSWSJ (acc.)NERCoNLL (F1)96.3796.9297.1097.2981.4787.1588.8789.59* Same architecture as C&W 2011, but word embeddings are kept constantduring the supervised training phase** C&W is unsupervised pre-train supervised NN features model of last slide35

Part 1.5: The BasicsBackpropagation Training36

Back-Prop Compute gradient of example-wise loss wrtparameters Simply applying the derivative chain rule wisely If computing the loss(example, parameters) is O(n)computation, then so is computing the gradient37

Simple Chain Rule38

Multiple Paths Chain Rule39

Multiple Paths Chain Rule - General 40

Chain Rule in Flow GraphFlow graph: any directed acyclic graphnode computation resultarc computation dependency 41 successors of

Back-Prop in Multi-Layer Net 42h sigmoid(Vx)

Back-Prop in General Flow GraphSingle scalar output 1. Fprop: visit nodes in topo-sort order- Compute value of node given predecessors2. Bprop:- initialize output gradient 1- visit nodes in reverse order:Compute gradient wrt each node usinggradient wrt successors successors of43

Automatic Differentiation The gradient computation canbe automatically inferred fromthe symbolic expression of thefprop. Each node type needs to knowhow to compute its output andhow to compute the gradientwrt its inputs given thegradient wrt its output. Easy and fast prototyping See Theano (Python)44

Part 1.7Sharing statistical strength45

Sharing Statistical Strength Besides very fast prediction, the main advantage ofdeep learning is statistical Potential to learn from less labeled examples becauseof sharing of statistical strength: Unsupervised pre-training & Multi-task learning Semi-supervised learning à46

Semi-Supervised Learning Hypothesis: P(c x) can be more accurately computed usingshared structure with P(x)purelysupervised47

Semi-Supervised Learning Hypothesis: P(c x) can be more accurately computed usingshared structure with P(x)semisupervised48

Multi-Task Learning Generalizing better to newtasks is crucial to approachAI Deep architectures learngood intermediaterepresentations that can beshared across tasks Good representations makesense for many taskstask 1output y1task 2output y2sharedintermediaterepresentation hraw input x49task 3output y3

The vast majority of rule-based andstatistical NLP work regards words as atomic symbols: hotel, conference, walk In vector space terms, this is a vector with one 1 and a lot of zeroes . A word and its context is a positive training sample; a random word in that same context gives a negative training