CS11-711 Advanced NLP Prompting

Transcription

CS11-711 Advanced NLPPrompting( Encoder-Decoder Pre-training)Graham NeubigSitehttps://phontron.com/class/anlp2021/Most Slides by Pengfei Liu

Recommended Reading:2

Four Paradigms of NLP Technical Development 3g g Feature EngineerinArchitecture EngineerinObjective EngineerinPrompt Engineeringg

Feature Engineering Paradigm: Fully Supervised Learning (Non-neural Network Time Period: Most popular through 201 Characteristics Non-neural machine learning models mainly use Require manually defined feature extraction Representative WorkManual features - linear or kernelized support vector machine (SVMManual features - conditional random fields (CRF)))d5::4

Architecture Engineering Paradigm: Fully Supervised Learning (Neural NetworksTime Period: About 2013-201CharacteristicsRely on neural networkDo not need to manually define features, but should modify the networkstructure (e.g.: LSTM v.s CNNSometimes used pre-training of LMs, but often only for shallow features such asembeddingsRepresentative WorkCNN for Text Classification)8):s5:

Objective Engineering Paradigm: Pre-train, Fine-tunTime Period: 2017-NoCharacteristicsPre-trained LMs (PLMs) used as initialization of full model - both shallow anddeep featureLess work on architecture design, but engineer objective functionsTypical WorkBERT Fine Tuningew:6: s

Prompt Engineering Paradigm: Pre-train, Prompt, PredicDate: 2019-NoCharacteristicNLP tasks are modeled entirely by relying on LMThe tasks of shallow and deep feature extraction, and prediction of the data are allgiven to the LMEngineering of prompts is requireRepresentative WorkGPT3std::w7

What is Prompting?Encouraging a pre-trained model to make particular predictions byproviding a "prompt" specifying the task to be done.8

What is the general workflow of Prompting? 9n Prompt AdditioAnswer PredictioAnswer-Label Mappingn

Prompt Addition Prompt Addition: Given input x, we transform it into prompt x’ throughtwo steps:Define a template with two slots, one for input [x], and one for the answer [z]Fill in the input slot [x]10

Example: Sentiment ClassificationInput: x “I love this movie”Template: [x] Overall, it was a [z] moviePrompting: x’ “I love this movie. Overall itwas a [z] movie.”11

Answer PredictionAnswer Prediction: Given a prompt, predict the answer [zFill in [z]12]

ExampleInput: x “I love this movie”Template: [x] Overall, it was a [z] moviePrompting: x’ “I love this movie. Overall itwas a [z] movie.”Predicting: x’ “I love this movie. Overall itwas a fantastic movie.”13

Mapping Mapping: Given an answer, map it into a class label14

ExampleInput: x “I love this movie”Template: [x] Overall, it was a [z] moviePrompting: x’ “I love this movie. Overall itwas a [z] movie.”Predicting: x’ “I love this movie. Overall itwas a fantastic movie.”Mapping: fantastic Positive15

Types of Prompts Prompt: I love this movie. Overall it was a [z] movie Filled Prompt: I love this movie. Overall it was a boring movie Answered Prompt: I love this movie. Overall it was a fantastic movie Prefix Prompt: I love this movie. Overall this movie is [z] Cloze Prompt: I love this movie. Overall it was a [z] movie16

Design Considerations for Prompting 17e m g Pre-trained Model ChoicPrompt EngineerinAnswer EngineerinExpanding the ParadigPrompt-based Training Strategiesg

Design Considerations for Prompting 18g e m Pre-trained Model ChoicPrompt Template EngineerinAnswer EngineerinExpanding the ParadigPrompt-based Training Strategiesg

Design Considerations for Prompting 19e m g Pre-trained Model ChoicPrompt EngineerinAnswer EngineerinExpanding the ParadigPrompt-based Training Strategiesg

Pre-trained Language ModelsPopular Frameworks 20M M Left-to-Right LMasked LPrefix LEncoder-decoderM

Left-to-right Language Model Characteristics:First proposed by Markov (1913Count-based- Neural network-baseSpecifically suitable to highly larger-scale LM Example:GPT-1,GPT-2,GPT-3 Roles in Prompting MethodThe earliest architecture chosen for promptinUsually equipped with prefix prompt and the parameters of PLMs are fixedsgsd)21

Masked Language Model Characteristics:Unidirectional - bidirectional predictioSuitable for NLU task Example:BERT, ERNIRoles in Prompting MethodUsually combined with cloze prompSuitable for NLU tasks, which should be reformulated into a cloze tasknsts22E

Prefix Language Model Characteristics:A combination of Masked & Left-to-righUse a Transformer but two different mask mechanisms tohandle text X and y separatelCorruption operations can be introduced when encodingExamples:UniLM 1,2, ERNIE-MXt23y

Encoder-Decoder Characteristics:A denoised auto-encodeUse two Transformers and two different mask mechanismsto handle text X and y separatelCorruption operations can be introduced when encodingExamples:BART, T5Xy24r

Encoder-decoder Pre-training MethodsRepresentative Methods MASBART (mBARTUniLT5)SM25

MASS(Song et al.) Model: Transformer-based Encoder-decoder Objective: only predict masked spans Data: WebText

BART(Lewis et al.)FrameworkDifferent Corruption Model: Transformer-based encoder-decoder model Objective: Re-construct (corrupted) originalsentences Data: similar to RoBERTa (160GB): BookCorpus, CCNEWs, WebText, Stories

mBART(Liu et al.) Model: Transformer-based Multi-lingual Denoisingauto-encoder Objective: Re-construct (corrupted) originalsentences Data: CC25 Corpus (25 langauges)

UNiLM(Dong et al.) Model: prefixed-LM, left-to-right LM, Masked LM Objective: three types of LMs, shared parameters Data: English Wikipedia and BookCorpus

T5( Raffel et al.) Model: left-to-right LM, Prefixed LM, encoder-decoder Objective: explore different objectives respectively Data: C4 (750G) Wikipedia RealNews WebText

T5( Raffel et al.) Model: left-to-right LM, Prefix LM, encode-decoder Objective: explore different objectives respectively Data: C4 (750G) Wikipedia RealNews WebText

Application of Prefix LM/Encoder-Decoders in Prompting Conditional Text GeneratioTranslatioText SummarizatioGeneration-like TaskInformation ExtractioQuestion Answeringnsnn32n

Design Considerations for Prompting 33e m g Pre-trained Model ChoicPrompt EngineerinAnswer EngineerinExpanding the ParadigPrompt-based Training Strategiesg

Traditional Formulation V.S Prompt FormulationInput: x “I love this movie”Input: x “I love this movie”Predicting: y PositiveTemplate: [x] Overall, it was a [z] moviePrompting: x’ “I love this movie. Overall itwas a [z] movie.”Predicting: x’ “I love this movie. Overall itwas a fantastic movie.”Mapping (answer - label):fantastic Positive34

Traditional Formulation V.S Prompt FormulationInput: x “I love this movie”Input: x “I love this movie”Predicting: y PositiveTemplate: [x] Overall, it was a [z] moviePrompting: x’ “I love this movie. Overall itwas a [z] movie.”How to define asuitable prompttemplate?Predicting: x’ “I love this movie. Overall itwas a fantastic movie.”Mapping (answer - label):fantastic Positive35

Prompt Template EngineeringHow to define theshape of a prompttemplate?How to search forappropriate prompttemplates?36

Prompt Shape Cloze PrompI love this movie. Overall it was a [z] movieprompt with a slot [z] to fill in themiddle of the text as a cloze promptPrefix Prompprompt where the input text comesI love this movie. Overall this movie is [z]entirely before slot [z],t37t

Design of Prompt Templates Hand-crafteConfigure the manual template based on the characteristics of the tasAutomated searcSearch in discrete spacSearch in continuous spacekhe38d

Representative Methods for Prompt Search 39h g Prompt MininPrompt ParaphrasinGradient-based SearcPrompt/Prefix Tuningg

Prompt Mining (Jiang et al. 2019) Mine prompts given a set of questions/answers Middle-wordBarack Obama was born in Hawaii. ! Dependency-basedThe capital of France is Paris. ![X] was born in [Y].capital of [X] is [Y].40

Prompt Paraphrasing (Jiang et al. 2019) Paraphrase an existing prompt to get other candidates e.g. back translation with beam search[X] shares a border with [Y].en-demodelde-enmodel[X] has a common border with [Y].[X] adjoins [Y]. 41

Gradient-based Search (Shin et al. 2020) Automatically optimize arbitrary prompts based on existing words42

Prefix/Prompt Tuning (Li and Liang 2021, Lester et al. 2021) Optimize theembeddings of aprompt, instead of thewords. "Prompt Tuning"optimizes only theembedding layer, "PrefixTuning" optimizes prefixof all layers43

Design Considerations for Prompting 44g e m Pre-trained Model ChoicPrompt Template EngineerinAnswer EngineerinExpanding the ParadigPrompt-based Training Strategiesg

Answer EngineeringWhy do we need answerengineeringWe have reformulate the task! We alsoshould re-define the “ground truth labels”45?

Traditional Formulation V.S Prompt FormulationInput: x “I love this movie”Input: x “I love this movie”Predicting: y PositiveTemplate: [x] Overall, it was a [z] moviePrompting: x’ “I love this movie. Overall itwas a [z] movie.”Predicting: x’ “I love this movie. Overall itwas a fantastic movie.”Mapping (answer - label):fantastic Positive46

Traditional Formulation V.S Prompt FormulationLabel Space (Y)PositiveNegativeAnswer Space (Z)InterestingFantasticHappyBoring1-star 47

Answer Engineering Why do we need answerengineeringWe have reformulate the task! We alsoshould re-define the “ground truth labelsDefinitionaims to search for an answer space and amap to the original output Y that results inan effective predictive model”?48:

Design of Prompt AnswerHow to define theshape of an answer?How to search forappropriate answers?49

Answer Shape Token: Answers can be one or more tokens in the pre-trainedlanguage model vocabularyChunk: Answers can be chunks of words made up of more thanone tokenUsually used with cloze prompSentence: Answers can be a sentence of arbitrary lengtUsually used with prefix promptht50s

Answer ShapetokenToken or spansentences51

Answer Search Hand-crafteInfinite answer spaceFinite answer spacAutomated SearchDiscrete SpacContinuous Spaceed52e

Discrete Search Space Answer Paraphrasinstart with an initial answer space,then use paraphrasing to expand this answer space Prune-then-Searcan initial pruned answer space of several plausible answers is generatean algorithm further searches over this pruned space to select a final set of answerLabel Decompositiodecompose each relation label into its constituent words and use them as an answecity of death {person, city, death}rsdg53n h

Continuous Search Space Core idea: assign a virtual token for each class label and optimizethe token embedding for each label54

Design Considerations for Prompting 55g e m Pre-trained Model ChoicPrompt Template EngineerinAnswer EngineerinExpanding the ParadigPrompt-based Training Strategiesg

Multi-Prompt LearningSingle PromptMultiple Prompts56

Multi-Prompt LearningPrompt EnsemblePrompt AugmentationSingle PromptMultiple PromptsPrompt CompositionPromptDecompositionPrompt Sharing57

Prompt Ensembling Definitiousing multiple unanswered prompts for an input atinference time to make predictionsAdvantageUtilize complementary advantagesAlleviate the cost of prompt engineeringStabilize performance on downstream taskss58n

Prompt EnsemblingTypical MethodUniform AveraginWeighted AveraginMajority Votingsg59g

Prompt Augmentation DefinitionHelp the model answer the prompt that is currentlybeing answered by additional answered prompts Advantagemake use of the small amount of information thathas been annotated Core stepSelection of answered promptsOrdering of answered prompts60

Design Considerations for Prompting 61g e m Pre-trained Model ChoicPrompt Template EngineerinAnswer EngineerinExpanding the ParadigPrompt-based Training Strategiesg

Prompt-based Training Strategies Data PerspectivHow many training samples are usedParameter PerspectivWhether/How are parameters updated?e62e

Prompt-based Training: Data PerspectiveZero-shot: without any explicit training of the LM for the downstreamtasFew-shot: few training samples (e.g., 1-100) of downstream taskFull-data: lots of training samples (e.g., 10K) of downstream taskssk63

Prompt-based Training: Parameter PerspectiveStrategyLM ParamsTunedAdditionalPrompt ParamsPrompt ParamsTunedExamplesPromptless FineTuningYesN/AN/ABERT M PromptTuningNoYesYesPre x TuningFixed-prompt LMTuningYesNoN/APETPrompt LMFine-tuningYesYesYesPADAfi64

Too many, difficult to select?Promptless Fine-tuningFixed-prompt TuningPrompt LM Fine-tuningIf you have a huge pre-trainedlanguage model (e.g., GPT3)If you have few training samples?Tuning-free PromptingFixed-LM Prompt TuningIf you have lots of training samples?65

Questions?

Prompt-based Training: Data Perspective 63 Zero-shot: without any explicit training of the LM for the downstream task Few-shot: few training samples (e.g., 1-100) of downstream tasks Full-data: lots of