Introduction To Machine Learning For NLP I

Transcription

Introduction to Machine Learning for NLP IBenjamin Roth, Nina Poerner, Marina SperanskayaCIS LMU MünchenBenjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I1 / 48

Outline1This Course2Why Machine Learning?3Machine Learning DefinitionData (Experience)TasksPerformance Measures4Deep Learning5Linear Regression: Overview and Cost Function6SummaryBenjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I2 / 48

Course OverviewFoundations of machine learningIIIIIloss functionslinear regressionlogistic regressiongradient-based optimizationneural networks and backpropagationDeep learning tools in PythonIIIINumpyPytorchKeras(some) Tensorflow?Architectures for NLPICNNs, RNNs, Self-Attention (Transformer)ApplicationsIIIIWord EmbeddingsSentiment AnalysisRelation extractionPractical project (NLP related, optional)Benjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I3 / 48

Lecture Times, TutorialsCourse homepage:dl-nlp.github.ioThis is where exercise sheets and lecture slides are posted9-11 is supposed to be the lecture slot, and 11-12 the tutorial slot . but we will not stick to that allocationWe will sometimes have longer Q&A-style/interactive “tutorial”sessions, sometimes more lectures (see next slide)Tutor: Marina SperanskayaIIWill discuss exercise sheets in the tutorialsWill help you with the projectBenjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I4 / 48

Plan10/1610/2310/3011/611/139-11 slotOverview / ML Intro ILinear algebra Q&A / ML IIProbability Q&A / ML IIIPytorch IntroWord2Vec11/2011/2712/412/1112/189-11 slotRNNs, Pytorch Q&ALSTM discussionCNNAttention / BERTProject announcement1/81/151/221/292/511-12 slotML IINumpyPytorchNumpy Q&A11-12 slotWord2Vec Q&AKerasAttention / BERTKeras/Tagging Q&AKeras/CNN Q&A9-11 slotExamRegularizationHyperparametersProject Q&AProject presentationsEx. sheetReading: Linear algebraReading: ProbabilityNumpyPytorchPytorch/Word2VecEx. sheetReading: LSTM/GRUKeras/TaggingKeras/CNN–11-12 slot–Help with projectsHelp with projectsProjects Q&ApresentationsBenjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP IEx. sheet–––––5 / 48

FormalitiesThis class is graded by a written exam (Klausur) in the week afterChristmasAdditional bonus points can be earned by:IIExercise sheets (before Christmas)Project and presentation (after Christmas)If you got more than 50% of possible bonus points, they count for upto 10% of the exam.Formula:gfinal min(M, gexam M· max(0, 2 · (gbonus 0.5)))1012gbonus gproject gexercises33where M is the maximum possible number of points.Benjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I6 / 48

Work load6 ECTS, 14 weeks avg work load 13hrs / week (3 in class, 10 at home)IIin the first weeks, spend enough time to read and prepare so that youare not lost laterfrom beginning of November to Christmas: programming assignments coding takes time, and can be frustating (but rewarding)!Benjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I7 / 48

ExamWill cover material from the lectures and reading assignments up toChristmasSo even if you do not hand the reading assignments, it is a good ideato read them.Mostly conceptual questions, no code (no need to learn pytorchfunction names by heart!)Benjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I8 / 48

Exercise sheetsOptional (bonus points!)Exercise sheets 1, 2 and 5 are reading assignments with questionsOther sheets are programming exercisesFormat: jupyter notebooksAll exercise sheets contribute equallyBenjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I9 / 48

ProjectOptional (bonus points!)Project topic & data will be distributed before ChristmasYou should work in groups of 2 or 3All groups work on the same dataMust hand in codeGrading scheme TBA: probably a combination of results (goodperformance on test set), code quality, creativityOptional project presentation in the last week before Easter (may givebonus points, TBA)Benjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I10 / 48

Good project code . shows that you master the techniques taught in the lectures andexercises. shows that you can make “own decisions”: e.g. adapt model /task / training data etc if necessary. is well-structured and easy to understand (telling variable names,meaningful modularization – avoid: code duplication, dead code). is correct. is within the scope of this lecture (time-wise should not exceed4 10h)Benjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I11 / 48

A good project presentation . is short (10 min. per team). is targeted to your fellow students, who do not know detailsbeforehand. contains interesting stuff: unexpected observations? conclusions/ recommendations? did you deviate from some common practice?. demonstrates that all team members worked together on theprojectBenjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I12 / 48

Outline1This Course2Why Machine Learning?3Machine Learning DefinitionData (Experience)TasksPerformance Measures4Deep Learning5Linear Regression: Overview and Cost Function6SummaryBenjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I13 / 48

Machine Learning for NLPWhat are the alternatives?Advantages and disadvantages?IIIIIAccuracyCoverageResources required (data, expertise, human labour)Reliability/RobustnessExplainabilityP NP VPVP V NPNP Det NNBenjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I14 / 48

Outline1This Course2Why Machine Learning?3Machine Learning DefinitionData (Experience)TasksPerformance Measures4Deep Learning5Linear Regression: Overview and Cost Function6SummaryBenjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I15 / 48

A Definition“A computer program is said to learn from experience E with respect tosome class of tasks T and performance measure P, if its performanceat tasks in T , as measured by P, improves with experience E .”(Mitchell 1997)Benjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I16 / 48

A Definition“A computer program is said to learn from experience E with respect tosome class of tasks T and performance measure P, if its performance attasks in T , as measured by P, improves with experience E .”(Mitchell 1997)Learning: Attaining the ability to perform a task.A set of examples (“experience”) represents a more general task.Examples are described by features:sets of numerical properties that can be represented as vectors x Rn .Benjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I17 / 48

Outline1This Course2Why Machine Learning?3Machine Learning DefinitionData (Experience)TasksPerformance Measures4Deep Learning5Linear Regression: Overview and Cost Function6SummaryBenjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I18 / 48

Data“A computer program is said to learn from experience E [.], if itsperformance [.] improves with experience E .”Dataset: collection of examplesDesign matrixX Rn mIIIn: number of examplesm: number of featuresExample: Xi,j count of feature j (e.g. a stem form) in document i,intensity of j’th pixel in image iUnsupervised learning:IIIModel X, or find interesting properties of X.Example: Clustering (find groups of similar images/documents)Training data: only X.Supervised learning:IIPredict specific additional properties from X.E.g., sentiment classification: Predict sentiment (1–5) of amazonreviewTraining data: Label vector y Rn together with XIBenjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I19 / 48

Outline1This Course2Why Machine Learning?3Machine Learning DefinitionData (Experience)TasksPerformance Measures4Deep Learning5Linear Regression: Overview and Cost Function6SummaryBenjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I20 / 48

Machine Learning Tasks“A computer program is said to learn [.] with respect to some class oftasks T [.] if its performance at tasks in T [.] improves [.]”Types of Tasks:ClassificationRegressionStructured PredictionAnomaly Detectionsynthesis and samplingImputation of missing valuesDenoisingClusteringReinforcement learning.Benjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I21 / 48

Task: ClassificationWhich of k classes does an example belong to?f : Rn {1 . . . k}Typical example: Categorize image patchesIIFeature vector: color intensities for each pixel; derived features.Output categories: Predefined set of labelsTypical example: Spam ClassificationIIFeature vector: High-dimensional, sparse vector.Each dimension indicates occurrence of a particular word, or otheremail-specific information.Output categories: “spam” vs. ‘ham”Benjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I22 / 48

Task: ClassificationEMNLP 2017: Given a person name in a sentence that containskeywords related to police (“officer”, “police” .) and to killing(“killed”, “shot”), was the person a civilian killed by police?Benjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I23 / 48

Task: RegressionPredict a numerical value given some input.f : Rn RTypical examples:IIPredict the risk of an insurance customer.Predict the value of a stock.Benjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I24 / 48

Task: RegressionACL 2017: Given a response in a multi-turn dialogue, predict thevalue (on a scale from 1 to 5) how natural a response is.Benjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I25 / 48

Task: Structured PredictionPredict a multi-valued output with special inter-dependencies andconstraints.Typical examples:IPart-of-speech taggingISyntactic parsingIMachine TranslationOften involves search and problem-specific algorithms.Benjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I26 / 48

Task: Structured PredictionACL 2017: jointly find all relations relations of interest in a sentenceby tagging arguments and combining them.Benjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I27 / 48

Task: Reinforcement LearningIn reinforcement learning, the model (also called agent) needs toselect a serious of actions, but only observes the outcome (reward) atthe end.The goal is to predict actions that will maximize the outcome.EMNLP 2017: The computer negotiates with humans in naturallanguage in order to maximize its points in a game.Benjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I28 / 48

Task: Anomaly DetectionDetect atypical items or events.Common approach: Estimate density and identify items that have lowprobability.Examples:IIQuality assuranceDetection of criminal activityOften items categorized as outliers are sent to humans for furtherscrutiny.Benjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I29 / 48

Task: Anomaly DetectionACL 2017: Schizophrenia patients can be detected by theirnon-standard use of mataphors, and more extreme sentimentexpressions.Benjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I30 / 48

Supervised and Unsupervised LearningUnsupervised learning: Learn interesting properties, such asprobability distribution p(x)Supervised learning: learn mapping from x to y , typically byestimating p(y x)Supervised learning in an unsupervised way:p(y x) Pp(x, y )0y 0 p(x, y )Benjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I31 / 48

Outline1This Course2Why Machine Learning?3Machine Learning DefinitionData (Experience)TasksPerformance Measures4Deep Learning5Linear Regression: Overview and Cost Function6SummaryBenjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I32 / 48

Performance Measures“A computer program is said to learn [.] with respect to some [.]performance measure P, if its performance [.] as measured by P,improves [.]”Quantitative measure of algorithm performance.Task-specific.Benjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I33 / 48

Discrete vs. Continuous Loss FunctionsDiscrete Loss FunctionsIIIIAccuracy (how many samples were correctly labeled?)Error Rate (1 - accuracy)Precision / RecallAccuracy may be inappropriate for skewed label distributions, whererelevant category is rareF1-score 2 · Prec · RecPrec RecDiscrete loss functions cannot indicatea wrong decision isThey are not differentiable (hard to optimize)Often algorithms are optimized using a continuous loss (e.g. hingeloss) and evaluated using another loss (e.g. F1-Score).Benjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I34 / 48

Examples for Continuous Loss FunctionsSquared error (regression): (y f (x))2Hinge loss (classification):IImax(0, 1 f (x) · y )(assume that y { 1, 1}).These loss functions are differentiable. So we can use them forgradient descent (more on that later).Benjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I35 / 48

Outline1This Course2Why Machine Learning?3Machine Learning DefinitionData (Experience)TasksPerformance Measures4Deep Learning5Linear Regression: Overview and Cost Function6SummaryBenjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I36 / 48

Deep LearningLearn complex functions, that are (recursively) composed of simplerfunctions.Many parameters have to be estimated.Benjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I37 / 48

Deep LearningMain Advantage: Feature learningIIModels learn to capture most essential properties of data (according tosome performance measure) as intermediate representations.No need to hand-craft feature extraction algorithmsBenjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I38 / 48

Neural NetworksFirst training methods for deep nonlinear NNs appeared in the 1960s(Ivakhnenko and others).Increasing interest in NN technology (again) since around 10 yearsago (“Neural Network Renaissance”):Orders of magnitude more data and faster computers now.Many successes:IIIIIImage recognition and captioningSpeech regonitionNLP and Machine translationGame playing (AlphaGO).Benjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I39 / 48

Machine LearningDeep Learning builds on general Machine Learning conceptsmXargminθ HL(f (xi ; θ), yi )i 1Fitting data vs. generalizing from xxxxxxxxxfeaturexxfeatureBenjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP Ifeature40 / 48

Outline1This Course2Why Machine Learning?3Machine Learning DefinitionData (Experience)TasksPerformance Measures4Deep Learning5Linear Regression: Overview and Cost Function6SummaryBenjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I41 / 48

Linear RegressionFor one instance:IIIInput: vector x RnOutput: scalar y R(actual output: y ; predicted output: ŷ )Linear functionnXŷ wT x w j xjj 1Benjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I42 / 48

Linear RegressionLinear function:Tŷ w x nXwj xjj 1Parameter vector w RnWeight wj decides if value of feature xj increases or decreasesprediction ŷ .Benjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I43 / 48

Linear RegressionFor the whole data set:IIUse matrix X and vector y to stack instances on top of each other.Typically first column contains all 1 for the intercept (bias, shift) term. 1 x12 x13 . . . x1ny1 1 x22 x23 . . . x2n y2 X . y . . . . . 1xm2xm3.xmnymFor entire data set, predictions are stacked on top of each other:ŷ XwEstimate parameters using X(train) and y(train) .Make high-level decisions (which features.) using X(dev ) and y(dev ) .Evaluate resulting model using X(test) and y(test) .Benjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I44 / 48

Simple Example: Housing PricesPredict property prices (in 1K Euros) from just one feature: Squarefeet of property. 1 450730X 1 900 y 1300 1 13501700Prediction is: 1 450w1 450w2w ŷ w1 900w2 1 900 · 1 Xww2w1 1350w21 1350w1 will contain costs incurred in any property acquisitionw2 will contain remaining average price per square feet.Optimal parameters are for the above case: 759.1273.3ŷ 1245.1 w 1.081731.1Benjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I45 / 48

Linear Regression: Mean Squared ErrorMean squared error of training (or test) data set is the sum of squareddifferences between the predictions and labels of all m instances.mMSE (train) 1 X (train)(train) 2(ŷi yi)mi 1In matrix notation:MSE (train) 1 (train) ŷ y(train) 22m1 X(train) w y(train) 22mBenjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I46 / 48

Outline1This Course2Why Machine Learning?3Machine Learning DefinitionData (Experience)TasksPerformance Measures4Deep Learning5Linear Regression: Overview and Cost Function6SummaryBenjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I47 / 48

SummaryMachine learning definitionIIIDataTaskCost functionMachine learning tasksIIIClassificationRegression.Deep LearningIIImany successes in recent yearsfeature learning instead of feature engineeringbuilds on general machine learning conceptsLinear regressionIIOutput depends linearly on inputCost function: Mean squared errorNext up: estimating the parametersBenjamin Roth, Nina Poerner, Marina SperanskayaIntroduction(CIS LMUto MachineMünchen)Learning for NLP I48 / 48

Architectures for NLP I CNNs, RNNs, Self-Attention (Transformer) Applications I Word Embeddings I Sentiment Analysis I Relation extraction I Practical project (NLP related, optional) Benjamin Roth, Nina Poerner, Marina Speranskaya (CIS LMU Munchen) Introduct