Pre-Training For Generation - Harvard NLP

Transcription

Pretraining for GenerationAlexander Rush(Zack Ziegler, Luke Melas-Kyriazi, Sebastian Gehrmann)HarvardNLP / Cornell Tech

Overview MotivationCurrent and Classical ApproachesModelsExperimentsChallenges

SummarizationLondon, England (reuters) – Harry Potter starDaniel Radcliffe gains access to a reported 20million fortune as he turns 18 on monday, but heinsists the money won’t cast a spell on him. DanielRadcliffe as harry potter in “Harry Potter and theOrder of the Phoenix” to the disappointment ofgossip columnists around the world , the youngactor says he has no plans to fritter his cash awayon fast cars , drink and celebrity parties . “ i do n’tplan to be one of those people who , as soon asthey turn 18 , suddenly buy themselves a massivesports car collection Harry Potter star Daniel Radcliffe gets 20mfortune as he turns 18 monday. Youngactor says he has no plans to fritter hisfortune away. .

Common Summarization MistakesMammoth wave of snow darkens the sky overeverest basecamp. Appearing like a whitemushroom cloud roaring, they scurry as theirtents flap like feathers in the wind. Cursing andbreathing heavily, they wait until the poundingis over.Gehrmann et al. 2018

Problem How can we learn the general properties of long-formlanguage (discourse, reference, etc.) from a specificNLG dataset (summary, data-to-text, image captioning,dialogue, etc.)?

Motivation Long-Form Generation: LambadaThey tuned, discussed for a moment, then struckup a lively jig. Everyone joined in, turning thecourtyard into an even more chaotic scene, peoplenow dancing in circles, swinging and spinning incircles, everyone making up their own dance steps.I felt my feet tapping, my body wanting to move.Aside from writing, I ’ve always loved dancingPaperno et al. 2016

Lambada: Specialized StructureLSTM21.8Hoang et al (2018)59.2 Specialized attention-based model with kitchen-sink of entitytracking features and multi-task learning.

GPT-2: Impact of Model ScaleLSTM21.8Hoang et al (2018)59.2GPT-2 117M45.9GPT-2 345M55.5GPT-2 762M60.1GPT-2 1542M63.2Radford et al. 2019

This Talk: Conditional Generation with Pretraining Practical question: how can we use language models to improvethe quality of conditional generation tasks?Peters et al. 2018, Devlin et al. 2018, Radford et al. 2018

Overview MotivationCurrent and Classical ApproachesModelsExperimentsChallenges

Notation: Conditional Generation-Pretrained NN module-Rand. initialized NN module-Conditioning object-Generated text

Notation: Using pretrained language modelPretrainedModelConditionalModelReverseModel

Approach 0: Backtranslation Incorporate additional data toapproximate joint by heuristicalternating projection.Conditional Model Dominant approach in NMT.Does not require any pretraining.Reverse ModelSennrich et al. 2015

Backtranslation: Challenges Requires a reverse model for inputmodality. Requires access to the pretrainingdataset. Computationally wasteful.Conditional ModelReverse Model

Approach 1: Noisy Channel / Bayes’ RulePretrainedModel Dominant approach in statisticalmachine translation.Reverse Model Does not require conditionalmodel.Yu et al. 2017

Neural Noisy Channel Construct model to facilitateapproximate inference.Yu et al. 2017

Noisy Channel: Challenges Requires generative model for inputmodality.Challenging MAP inference problemwhen using deep model.Distributions often un-calibrated.PretrainedModelReverse ModelYu et al. 2017

Approach 2: Simple Fusion Assume access to logit representation(pre-softmax).Fused SoftmaxConditional Model Learn to smooth betweenconditional model and pretrainedmodel.Several other variants: cold fusion,shallow fusion, deep fusion.PretrainedModelGulcehre et al. 2015, Stahlberg et al. 2018

Fusion: Challenges Conditional model has no access topretraining. Conditional model must relearnaspects of language generationalready learned in the pretrainedmodel.Fused SoftmaxConditional ModelPretrainedModelGulcehre et al. 2015, Stahlberg et al. 2018

Approach 3: Representation Learning / Pretraining Utilize variable-length representationfrom model (“embeddings”)Conditional Model Dominate approach in NLUapplications (BERT/ELMo)PretrainedModelRamachandran et al 2017, Edunov et al. 2019

Representation Learning: Challenges Empirically less effective than simplerfusion approaches.Little success (even with wordembeddings) for conditionalgeneration tasks.Conditional ModelPretrainedModelRamachandran et al 2017, Edunov et al. 2019

Lessons: Pretraining for Generation Simple fusion based approaches seem most robust.Approaches requiring reverse models seem intractable.Backtranslation likely infeasible for generation.Deep pretraining seems to be the most interesting, but .Edunov et al. 2019

Approach 4: Zero-Shot Generation Fake conditioning by prependingsource with a special control word.PretrainedModel Produces surprisingly good outputsfor a simple trick.TL;DRRadford et al. 2019

Zero Shot: Challenges Only works with textual inputs. Requires a combinatorial search tofind source. PretrainedModelSeed word is problem specific.TL;DRRadford et al. 2019

Overview MotivationCurrent and Classical ApproachesModelsExperimentsChallenges

Pretraining ModelsConsider three different approaches to deep pretraining. Representation Learning: Repr-Transformer Combination through Context-Attn Pseudo Self AttentionDiffer in usage of the source data.

Assumption: Self-attention ModelsPretrained self-attention modelPretrainedModelConditionalModelExtended transformer model

Representation Learning: Repr-Transformer Utilize pretraining to providecontextual embeddings to aconditional transformer.Transformer used as “conditionalhead” to the pretrained LM.(Layer norm and residual connections omitted)

Intuition

Context-Attn Assume that pretrained model hasthe same form as the head. Can initialize conditional transformerwith self attention and feed forwardlayers.(Layer norm and residual connections omitted)

Intuition

Pseudo-Self Attention Train a model to inject conditioningdirectly into pretrained network. Learn to project conditioning asadditional attention keys.(Layer norm and residual connections omitted)

How do the methods differ? Key Idea: Train models topreserve as much of the originalweight structure as possible.

Overview MotivationCurrent and Classical ApproachesModelsExperimentsChallenges

Adaptive Conditional Generation TasksConditional Generation Tasks Task 1: Class-Conditional Generation Task 2: Document Summarization Task 3: Story Generation Task 4: Image Paragraph CaptioningMetrics: Perplexity (general quality of the language) Task-Specific Quality

Deep Pretraining for Adaptation: Three ApproachesRepr-TransContext-AttnPseudo-Self

Task 1: Class-Conditional Generation (IMDB)Positive movie review?When I saw the preview of this film, Ithought it was going to be a horriblemovie. I was wrong. The film has someof the funniest and most escapistscenes I’ve seen in a long time. Theacting is superb. The story is decent, butthe direction and editing may have beena bit harsh at times. 10 million training tokens (tgt)

Task 2: Document Summarization (CNN/DM)London, England (reuters) – Harry Potter star DanielRadcliffe gains access to a reported 20 million fortune ashe turns 18 on monday, but he insists the money won’tcast a spell on him. Daniel Radcliffe as harry potter in“Harry Potter and the Order of the Phoenix” to thedisappointment of gossip columnists around the world ,the young actor says he has no plans to fritter his cashaway on fast cars , drink and celebrity parties . “ i do n’tplan to be one of those people who , as soon as they turn18 , suddenly buy themselves a massive sports carcollection Harry Potter star Daniel Radcliffe gets 20mfortune as he turns 18 monday. Youngactor says he has no plans to fritter his fortuneaway. 30 million training tokens (tgt)

Task 2b: TL;DR Summarizationnot necessarily my lucky day , but some kids this is how it wentxxxxxxxxxxxxxwas sitting out on the dock at a local lake with a friend sharingsome beers . little boy aged 2-3 yrs old walks up with a woodenstick and starts poking at the water . it was windy out and thedock was moving , and sure enough the kid leans over justenough to topple head first into the water . i had already pulledmy phone out and wallet out just in case i was to accidentally fallin so i went straight over and hopped in . saw his little handreaching up and tossed him straight back onto the dock . walkedhim to his dad who didn ’ t speak any english and was veryconfused why i had his son soaking wet . left later that day andsaw the kid back on the dock ! it blew my mind.:TL;DR saved a 2 year old from drowning at a lakebecause i was drinking beers with a friend . First-place system usespretrained conditionalgeneration.

Task 3: Story Generation (WritingPrompt)A portal to a fantasy-like land opens in themiddle of New York City and exiles startcoming through .Tannen blinked . Nothingness greeted him ; hewas still dreaming of the massive portal before him. How long had it been ? Would it be . ? Howlong had it been since he saw it ? That wasimpossible , and yet , how did it end ? .Fan et al. 2018

Task 3: Story Generation (WritingPrompt)A portal to a fantasy-like land opens in themiddle of New York City and exiles startcoming through .Tannen blinked . Nothingness greeted him ; hewas still dreaming of the massive portal before him. How long had it been ? Would it be . ? Howlong had it been since he saw it ? That wasimpossible , and yet , how did it end ? . 300 million training tokens (tgt)

Task 4: Image Paragraph Captioning(All results use cross-entropy. ReinforcementLearning approaches perform better on this task.) 1 million training tokens (tgt)

Adapting in Low-Data SettingsPretraining (1.8K)I fell in love with this film in 1985. It’s aquintessential short film that explores theeveryday lives of the human condition.The main character of the movie is a mannamed Donald (Husband George). Hebuys a home and captures a great dealof information about the businessmenwho live and work in his neighborhood.No Pretraining (1.8K)“Set’s that I liked this movie. I have seen Iremember the original movie is one of themusic that it is great movie. I’ve seen thisfilm and one of the whole movie is like thismovie. It is so bad, I watched the top ofthis movie. i would see the movie wasbad, I have seen it. This movie, it’s a TVmain movie is about the plot, relaxing. I

Bigger Models? All experiments run with smallest available GPT-2 (117M)Bigger model recently released at 345M.

Concurrent Work Large-Scale Transfer Learning for Natural Language Generation- Golovanov et al 2019. Use roughly the same model for dialogue tasks.

Overview MotivationCurrent and Classical ApproachesModelsExperimentsFuture Challenges

Open QuestionsMore source determined(low conditional entropy)uentioranelnatiodi onon atis-c eras nCl geogalDigetyorexStntio ntioizaarmmSuhapgrra ngpa ie ionag aptc2Tsla ImanTrDataMore abstractive(high conditional entropy)Pseudo-Self approach well suited for open-ended conditionalgeneration.Application to low conditional entropy tasks?

Conclusions Pseudo self attention for general conditionalgeneration with pretrained LMs Strong automatic and human eval resultsacross diverse long-form conditionalgeneration tasks Application to low conditional entropy tasks?Connection with source-side pretraining?

1 million training tokens (tgt) Adapting in Low-Data Settings I fell in love with this film in 1985. It’s a quintessential short film that explores the everyday lives of the human condition. The main characte