Natural Language Processing With Deep Learning

Transcription

NaturalLanguageLanguage ProcessingNaturalProcessingwith DeepDeep ristopher ManningChristopher Manning and Richard SocherLecture 10:Lecture 2: Word Vectors(Textual) Question Answering Architectures,Attention and Transformers

Mid-quarter feedback surveyThanks to the many of you (!) who have filled it in!If you haven’t yet, today is a good time to do it 2

Lecture PlanLecture 10: (Textual) Question Answering1. History/The SQuAD dataset (review)2. The Stanford Attentive Reader model3. BiDAF4. Recent, more advanced architectures5. Open-domain Question Answering: DrQA6. Attention revisited; motivating transformers; ELMo and BERTpreview7. Training/dev/test data8. Getting your neural network to train3

1. Turn-of-the Millennium Full NLP QA:[architecture of LCC (Harabagiu/Moldovan) QA system, circa 2003]Complex systems but they did work fairly well on “factoid” questionsQuestion ProcessingFactoidQuestionListQuestionQuestion ParseRecognition ofExpected AnswerType (for NER)Keyword ExtractionDefinitionQuestionSingle rmationNamed EntityRecognition(CICERO LITE)Document ProcessingMultipleDefinitionPassagesQuestion ParseAnswer Extraction (NER)Answer Justification(alignment, relations)Answer RerankingAxiomatic KnowledgeBaseDocument IndexList Answer ProcessingAnswer ExtractionKeyword ExtractionListAnswerThreshold CutoffDocumentCollectionDefinition Answer ProcessingAnswer ExtractionPattern MatchingFactoidAnswer( Theorem Prover)Passage RetrievalAnswer TypeHierarchy(WordNet)Question ProcessingFactoid Answer ProcessingPatternRepositoryPattern MatchingDefinitionAnswer

Stanford Question Answering Dataset (SQuAD)(Rajpurkar et al., 2016)Question: Which team won Super Bowl 50?PassageSuper Bowl 50 was an American football game to determine thechampion of the National Football League (NFL) for the 2015 season.The American Football Conference (AFC) champion Denver Broncosdefeated the National Football Conference (NFC) champion CarolinaPanthers 24–10 to earn their third Super Bowl title. The game wasplayed on February 7, 2016, at Levi's Stadium in the San Francisco BayArea at Santa Clara, California.5100k examplesAnswer must be a span in the passageExtractive question answering/reading comprehension

SQuAD 2.0 No Answer ExampleWhen did Genghis Khan kill Great Khan?Gold Answers: No Answer Prediction: 1234[from Microsoft nlnet]6

2. Stanford Attentive Reader[Chen, Bolton, & Manning 2016][Chen, Fisch, Weston & Bordes 2017] DrQA[Chen 2018] Demonstrated a minimal, highly successfularchitecture for reading comprehension and questionanswering Became known as the Stanford Attentive Reader7

The Stanford Attentive ReaderInputPassage (P)QOutputWhich team won Super Bowl 50?Answer (A)Question (Q) WhichteamwonSuper508?

Stanford Attentive ReaderQWho did Genghis Khan unite before hebegan conquering the rest of Eurasia?Bidirectional LSTMsP9 p! # p#

Stanford Attentive ReaderQWho did Genghis Khan unite before hebegan conquering the rest of Eurasia?Bidirectional LSTMs Attention10predict start token p! #Attentionpredict end token

SQuAD 1.1 Results (single model, c. Feb 2017)F111Logistic regression51.0Fine-Grained Gating (Carnegie Mellon U)73.3Match-LSTM (Singapore Management U)73.7DCN (Salesforce)75.9BiDAF (UW & Allen Institute)77.3Multi-Perspective Matching (IBM)78.7ReasoNet (MSR Redmond)79.4DrQA (Chen et al. 2017)79.4r-net (MSR Asia) [Wang et al., ACL 2017]79.7Google Brain / CMU (Feb 2018)88.0Human performance91.2

text, and then uses the question-passage similarity scores to decide where the answerspan starts and ends.Stanford Attentive Reader pstart(1)qqqWeighted sumLSTM1similarityp1p2p3LSTM1LSTM1 p1 p2 p3q-align2GloVe1 M2LSTM1GloVepend(3) pstart(3)similarityLSTM2LSTM2GloVe similarityLSTM1LSTM2Figure from SLP3: Chapter eyonce q1 q2 q3Question Beyonce’sdebutalbum p1 p2 p3PassageFigure 23.8 The question answering system of Chen et al. (2017), considering part of the question When didTraining objective:Beyoncé12 release Dangerously in Love? and the passage starting Beyoncé’s debut album, Dangerously in Love

(Chen et al., 2018)Stanford Attentive Reader q & 𝑏' q 'Q'Which team won Super Bowl 50?exp(w 6 q ' )For learned 𝐰, 𝑏' ' 9 exp(w 6 q 𝒋9 )Deep 3 layer BiLSTMis better!weightedsum WhichteamwonSuper5013?

Stanford Attentive Reader 𝐩# : Vector representation of each token in passageMade from concatenation of Word embedding (GloVe 300d) Linguistic features: POS & NER tags, one-hot encoded Term frequency (unigram probability) Exact match: whether the word appears in the question 3 binary features: exact, uncased, lemma Aligned question embedding (“car” vs “vehicle”)14Where 𝛼 is a simple one layer FFNN

(Chen, Bolton, Manning, 2016)What do these neural models do?NN100100100Categorical Feature Classifier9590Correctness (%)78746750 504028330Easy13%Partial41%19%Hard/Error2%25%16

3. BiDAF: Bi-Directional Attention Flow for Machine Comprehension(Seo, Kembhavi, Farhadi, Hajishirzi, ICLR 2017)17

BiDAF – Roughly the CS224N DFP baseline There are variants of and improvements to the BiDAF architectureover the years, but the central idea is the Attention Flow layer Idea: attention should flow both ways – from the context to thequestion and from the question to the context Make similarity matrix (with w of dimension 6d): Context-to-Question (C2Q) attention:(which query words are most relevant to each context word)18

BiDAF Attention Flow Idea: attention should flow both ways – from thecontext to the question and from the question to the context Question-to-Context (Q2C) attention:(the weighted sum of the most important words in the contextwith respect to the query – slight asymmetry through max) For each passage position, output of BiDAF layer is:19

BiDAF There is then a “modelling” layer: Another deep (2-layer) BiLSTM over the passage And answer span selection is more complex: Start: Pass output of BiDAF and modelling layer concatenatedto a dense FF layer and then a softmax End: Put output of modelling layer M through another BiLSTMto give M2 and then concatenate with BiDAF layer and againput through dense FF layer and a softmax Editorial: Seems very complex, but it does seem like you should do a bitmore than Stanford Attentive Reader, e.g., conditioning end also onstart20

4. Recent, more advanced architecturesMost of the question answering work in 2016–2018 employedprogressively more complex architectures with a multitude ofvariants of attention – often yielding good task gains21

Dynamic Coattention Networks for Question AnsweringUnder review asXiong,a conferencepaper at ICLR2017 Richard Socher ICLR 2017)(CaimingVictorZhong, Flaw: Questions have input-independent representationsDYNAMIC C OATTENTION N ETWORKS Interdependence needed for a comprehensive QA model2Figure 1 illustrates an overview of the DCN. We first describe the encoders for the document andthe question, followed by the coattention mechanism and the dynamic decoder which produces theanswer span.Coattention encoderDynamic pointerdecoderstart index: 49end index: 51steam turbine plantsDocument encoderThe weight of boilers and condensers generallymakes the power-to-weight . However, mostelectric power is generated using steam turbineplants, so that indirectly the world's industryis .Question encoderWhat plants create mostelectric power?

CoattentionEncoderUnder review as a conference paper at ICLR 2017U:D:utbi-LSTMm roductAQQ:bi-LSTMconcatn 1Figure 2: Coattention encoder. The affinity matrix L is not shown here. We instead directly showthe normalized attention weights AD and AQ .We similarly compute the summaries QAD of the question in light of each word of the document.Similar to Cui et al. (2016), we also compute the summaries C Q AD of the previous attention con-

Coattention layer Coattention layer again provides a two-way attention betweenthe context and the question However, coattention involves a second-level attentioncomputation: attending over representations that are themselves attentionoutputs We use the C2Q attention distributions αi to take weighted sumsof the Q2C attention outputs bj. This gives us second-levelattention outputs si:24

a document-question pair may have several ground truth answers, the EM and F1 for a documentquestion pair is taken to be the maximum value across all ground truth answers. The overall metricis then computed by averaging over all document-question pairs. The offical SQuAD evaluation ishosted on CodaLab 2 . The training and development sets are publicly available while the test set iswithheld.Co-attention: Results on SQUAD CompetitionModelEnsembleDCN (Ours)Microsoft Research Asia Allen Institute Singapore Management University Google NYC Single modelDCN (Ours)Microsoft Research Asia Google NYC Singapore Management University Carnegie Mellon University Dynamic Chunk Reader (Yu et al., 2016)Match-LSTM (Wang & Jiang, 2016)Baseline (Rajpurkar et al., 2016)Human (Rajpurkar et al., 2016)Dev EMDev F1Test EMTest 73.773.371.070.351.081.491.082.391.2are at timeof ICLR submissionTable Results1: Leaderboardperformanceat the time of writing (Nov 4 2016). indicates that the D-explorer/fordevelopmentlatest resultsscores were not publiclyused foris unpublished.indicates that theavailable at the time of writing.

FusionNet (Huang, Zhu, Shen, Chen 2017)Attention functionsMLP (Additive) form:𝑆#' 𝑠 ? tanh(𝑊L 𝑐# 𝑊N 𝑞' )Space: O(mnk), W is kxdBilinear (Product) form:𝑆#' 𝑐#? 𝑊𝑞'𝑆#' 𝑐#? 𝑈 ? 𝑉𝑞'Space: O((m n)k)𝑆#' 𝑐#? 𝑊 ? 𝐷𝑊𝑞'𝑆#' 𝑈𝑐# ? (𝑉𝑞' )ceapsrell1. Sm ari tyaeniln2. No𝑆#' 𝑅𝑒𝑙𝑢(𝑐#? 𝑊 ? )𝐷𝑅𝑒𝑙𝑢(𝑊𝑞' )

FusionNet tries to combine many forms ofattention

Multi-level inter-attentionAfter multi-level inter-attention, use RNN, self-attentionand another RNN to obtain the final representation ofcontext: {𝒖Q# }

Recent, more advanced architectures Most of the question answering work in 2016–2018 employedprogressively more complex architectures with a multitude ofvariants of attention – often yielding good task gains29

SQuAD limitations SQuAD has a number of key limitations: Only span-based answers (no yes/no, counting, implicit why) Questions were constructed looking at the passages Not genuine information needs Generally greater lexical and syntactic matching between questionsand answer span than you get IRL Barely any multi-fact/sentence inference beyond coreference Nevertheless, it is a well-targeted, well-structured, clean dataset It has been the most used and competed on QA dataset It has also been a useful starting point for building systems inindustry (though in-domain data always really helps!) And we’re using it (SQuAD 2.0)30

5. Open-domain Question AnsweringDrQA (Chen, et al. ACL 2017) https://arxiv.org/abs/1704.00051Q: How many of Warsaw's inhabitantsspoke Polish in 1933?DocumentRetriever31DocumentReader833,500

Document RetrieverTraditionaltf.idfinvertedindex efficientbigramhashFor 70–86% of questions, the answersegment appears in the top 5 articles32

DrQA Demo33

General questionsCombined with Web search, DrQA cananswer 57.5% of trivia questions correctlyQ: The Dodecanese Campaign of WWII that was anattempt by the Allied forces to capture islands in theAegean Sea was the inspiration for which acclaimed 1961commando film?A: The Guns of NavaroneQ: American Callan Pinckney’s eponymously namedsystem became a best-selling (1980s-2000s) book/videofranchise in what genre?A: Fitness34

6. LSTMs, attention, and transformers intro35

SQuAD v1.1 leaderboard, 2019-02-0736

Gated Recurrent Units, againIntuitively, what happens with RNNs?1. Measure the influence of the past on the future@ log p(xt n x t n )@ log p(xt n x t n ) @g@ht n@ht 1 ···@ht@g@ht n @ht n 1@ht2. How does the perturbation at t affect p(xt n x t n ) ?xt37 2020-02-06?

Gated Recurrent Units : LSTM & GRU The signal and error must propagate through all theintermediate nodes: Perhaps we can create shortcut connections.382020-02-06

Gated Recurrent Unit Perhaps we can create adaptive shortcut connections.Let the net prune unnecessary connections adaptively. Candidate Update h̃t tanh(W [xt ] U (rtReset gate rt (Wr [xt ] Ur ht 1 br )Update gate ut (Wu [xt ] Uu ht 1 bu )39ht1) b): element-wise2020-02-06multiplication

Gated Recurrent Unittanh-RNN .Registers hExecution1. Read the whole registerh2. Update the whole registerh40tanh(W [x] U h b)2020-02-06

Gated Recurrent UnitGRU Registers hExecution1. Select a readable subset r2. Read the subset rh3. Select a writable subsetu4. Update the subsethuh̃ (1ut )hGated recurrent units are much more realistic for computation!412020-02-06

Gated Recurrent Units: LSTM & GRUTwo most widely used gated recurrent units: GRU and LSTMGated Recurrent UnitLong Short-Term Memory[Cho et al., EMNLP2014;Chung, Gulcehre, Cho, Bengio,DLUFL2014]ht u th̃t (1ut )ht[Hochreiter & Schmidhuber, NC1999;Gers, Thesis2001]1ht o tU (rh̃h̃t tanh(Wtanh(W[x[x] Ut(rt ht h1t) b) ct ftt ] t 1 )b)ut (Wu [xt ] Uu htrt (Wr [xt ] Ur htct1 itc̃t bu )c̃t tanh(Wc [xt ] Uc ht br )ot (Wo [xt ] Uo ht11tanh(ct )it (Wi [xt ] Ui htft (Wf [xt ] Uf ht42111 bc ) bo ) bi )1 bf )

Attention MechanismJesuisJeStarted in computer vision![Larochelle & Hinton, 2010],[Denil, Bazzani, Larochelle,Freitas, 2012]famous in NMT/NLMBecameétudiantPool ofsourcestatesIama studentsuis étudiant A second solution: random access memory43 Retrieve past info as needed (but usually average) Usually do content-similarity based addressing Other things like positional are occasionally tried

ELMo and BERT previewContextual word representationsUsing language model-like objectivesThe transformer architectureused in BERT is sort of attentionon steroids.Elmo(Peters et al, 2018)Bert(Devlin et al, 2018)Look at SDNet as an example of how to use BERT assubmodule: https://arxiv.org/abs/1812.03593(Vaswani et al, 2017)44

The Motivation for Transformers We want parallelization but RNNs are inherently sequential Despite LSTMs, RNNs generally need attention mechanism todeal with long range dependencies – path length betweenstates grows with distance otherwise But if attention gives us access to any state maybe we can justuse attention and don’t need the RNN? And then NLP can have deep models and solve our vision envy45

Transformer (Vaswani et al. 2017)“Attention is all you need”https://arxiv.org/pdf/1706.03762.pdf Non-recurrent sequence (orsequence-to-sequence) model A deep model with a sequence ofattention-based transformer blocks Depth allows a certain amount oflateral information transfer inunderstanding sentences, in slightlyunclear ways Final cost/error function isstandard cross-entropy erroron top of a softmax classifierInitially built for NMT46Softmax12x12x

Transformer blockEach block has two “sublayers”1. Multihead attention2. 2-layer feed-forward NNet (with ReLU)Each of these two steps also has:Residual (short-circuit) connectionLayerNorm (scale to mean 0, var 1; Ba et al. 2016)47

Multi-head (self) attentionWith simple self-attention: Only one way for a word tointeract with othersSolution: Multi-head attention𝑥#fL ([)Map input into ℎ 12 many lowerdimensional spaces via 𝑊V matricesThen apply attention, then concatenateoutputs and pipe through linear layerMultihead 𝑥#[ℎ𝑒𝑎𝑑' Attention(𝑥# Concat(ℎ𝑒𝑎𝑑' )𝑊 [𝑊' b , 𝑥#So attention is like bilinear: 𝑥#48[[𝑊' c , 𝑥#[𝑊' d )(𝑊' b (𝑊' c )? )𝑥# (e)𝑥# ([)

Encoder InputActual word representations are word pieces (byte pair encoding) Topic of next weekAlso added is a positional encoding so same words at differentlocations have different overall representations:49

BERT: Devlin, Chang, Lee, Toutanova (2018)BERT (Bidirectional Encoder Representations from Transformers):Pre-training of Deep Bidirectional Transformers for LanguageUnderstanding, which is then fine-tuned for a particular taskPre-training uses a cloze task formulation where 15% of words aremasked out and predicted:storegallon the man went to the [MASK] to buy a [MASK] of milk50

Transformer (Vaswani et al. 2017)BERT (Devlin et al. 2018)Q0[CLS]K0V0Q1K112 x V1Q2 K2V2Q3K3V3Q4K4h0,0h0,1h0,2h0,3h0,4 0Judiciary1Committee2[MASK]3ReportV44

7. Pots of data Many publicly available datasets are released with atrain/dev/test structure. We're all on the honor system to dotest-set runs only when development is complete. Splits like this presuppose a fairly large dataset. If there is no dev set or you want a separate tune set, then youcreate one by splitting the training data, though you have toweigh its size/usefulness against the reduction in train-set size. Having a fixed test set ensures that all systems are assessedagainst the same gold data. This is generally good, but it isproblematic where the test set turns out to have unusualproperties that distort progress on the task.52

Training models and pots of data When training, models overfit to what you are training on The model correctly describes what happened to occur inparticular data you trained on, but the patterns are notgeneral enough patterns to be likely to apply to new data The way to monitor and avoid problematic overfitting is usingindependent validation and test sets 53

Training models and pots of data You build (estimate/train) a model on a training set. Often, you then set further hyperparameters on another,independent set of data, the tuning set The tuning set is the training set for the hyperparameters! You measure progress as you go on a dev set (development testset or validation set) If you do that a lot you overfit to the dev set so it can be goodto have a second dev set, the dev2 set Only at the end, you evaluate and present final numbers on atest set Use the final test set extremely few times ideally only once54

Training models and pots of data The train, tune, dev, and test sets need to be completely distinct It is invalid to test on material you have trained on You will get a falsely good performance. We usually overfit on train You need an independent tuning set The hyperparameters won’t be set right if tune is same as train If you keep running on the same evaluation set, you begin tooverfit to that evaluation set Effectively you are “training” on the evaluation set you are learningthings that do and don’t work on that particular eval set and using the info To get a valid measure of system performance you need anotheruntrained on, independent test set hence dev2 and final test55

8. Getting your neural network to train Start with a positive attitude! Neural networks want to learn! If the network isn’t learning, you’re doing something to prevent itfrom learning successfully Realize the grim reality: There are lots of things that can cause neural nets to notlearn at all or to not learn very well Finding and fixing them (“debugging and tuning”) can often take moretime than implementing your model It’s hard to work out what these things are But experience, experimental care, and rules of thumb help!56

Models are sensitive to learning rates From Andrej Karpathy, CS231n course notes57

Models are sensitive to initialization From Michael ap3.html58

Training a gated RNN1.2.3.4.5.6.Use an LSTM or GRU: it makes your life so much simpler!Initialize recurrent matrices to be orthogonalInitialize other matrices with a sensible (small!) scaleInitialize forget gate bias to 1: default to rememberingUse adaptive learning rate algorithms: Adam, AdaDelta, Clip the norm of the gradient: 1–5 seems to be a reasonablethreshold when used together with Adam or AdaDelta.7. Either only dropout vertically or look into using BayesianDropout (Gal and Gahramani – not natively in PyTorch)8. Be patient! Optimization takes time[Saxe et al., ICLR2014;59Ba, Kingma, ICLR2015;Zeiler, arXiv2012;Pascanu et al., ICML2013]

Experimental strategy Work incrementally! Start with a very simple model and get it to work! It’s hard to fix a complex but broken model Add bells and whistles one-by-one and get the model workingwith each of them (or abandon them) Initially run on a tiny amount of data You will see bugs much more easily on a tiny dataset Something like 4–8 examples is good Often synthetic data is useful for this Make sure you can get 100% on this data Otherwise your model is definitely either not powerful enough or it isbroken60

Experimental strategy Run your model on a large dataset It should still score close to 100% on the training data afteroptimization Otherwise, you probably want to consider a more powerful model Overfitting to training data is not something to be scared of whendoing deep learning These models are usually good at generalizing because of the waydistributed representations share statistical strength regardless ofoverfitting to training data But, still, you now want good generalization performance: Regularize your model until it doesn’t overfit on dev data Strategies like L2 regularization can be useful But normally generous dropout is the secret to success61

Details matter! Look at your data, collect summary statistics Look at your model’s outputs, do error analysis Tuning hyperparameters is really important to almostall of the successes of NNets62

Good luck with your projects!63

Beyonce release Dangerously in Love? and the passage starting Beyonc e’s debut album, Dangerously in Love (2003). Let’s consider the algorithm in detail, following closely the description in Chen et al. (2017). The question is represented by a single embedding q, which is a weighted