Natural Language Processing - University Of California .

Transcription

Natural Language ProcessingInfo 159/259Lecture 4: Text classification 3 (Jan 30, 2020)David Bamman, UC Berkeley

Generative vs.Discriminative models Generative models specify a joint distribution over the labelsand the data. With this you could generate new dataP(X, Y ) P(Y ) P(X Y ) Discriminative models specify the conditional distribution ofthe label y given the data x. These models focus on how todiscriminate between the classesP(Y X)

Generative models With generative models (e.g., Naive Bayes), we ultimatelyalso care about P(Y X), but we get there by modeling more.P(Y y) P(X x Y y)P(Y y X x) y 𝒴 P(Y y)P(X x Y y) Discriminative models focus on modeling P(Y X) — and onlyP(Y X) — directly.

Logistic regressionP (y 1 x , β ) output space11 expFi 1 x i β iY {0, 1}

Features As a discriminative classifier, logisticregression doesn’t assume featuresare independent like Naive Bayesdoes. Its power partly comes in the abilityto create richly expressive featureswithout the burden of independence. We can represent text throughfeatures that are not just theidentities of individual words, but anyfeature that is scoped over theentirety of the input.featurescontains likehas word that shows up inpositive sentimentdictionaryreview begins with “I like”at least 5 mentions ofpositive affectual verbs(like, love, etc.)5

Logistic regression We want to find the value of β that leads to thehighest value of the log likelihood:N( β) i 1log P(yi xi , β)6

L2 regularizationN( β) i 1Flog P(yi xi , β)we want this to be highη2βjj 1but we want this to be small We can do this by changing the function we’re trying to optimize by addinga penalty for having values of β that are high This is equivalent to saying that each β element is drawn from a Normaldistribution centered on 0. η controls how much of a penalty to pay for coefficients that are far from 0(optimize on development data)7

L1 regularizationN( β) i 1Flog P(yi xi , β)we want this to be highηj 1 βj but we want this to be small L1 regularization encourages coefficients to beexactly 0. η again controls how much of a penalty to pay forcoefficients that are far from 0 (optimize ondevelopment data)8

some L2regularizationexp(β) represents the factor bywhich the odds change with a1-unit increase in xP (y x , β )exp ( β1 )1 P (y x , β )2.17Eddie Murphy1.98Tom Cruise1.70Tyler Perry1.70Michael Douglas1.66Robert Redford1.66Julia Roberts1.64Dance1.63Schwarzenegger1.63Lee Tergesen1.62Cher

/01/what-is-deep-learning-and-how-is-it-useful

History of NLP Foundational insights, 1940s/1950s Two camps (symbolic/stochastic), 1957-1970 Four paradigms (stochastic, logic-based, NLU, discoursemodeling), 1970-1983 Empiricism and FSM (1983-1993) Field comes together (1994-1999)J&M 2008, ch 1 Machine learning (2000–today) Neural networks ( 2014–today)

“Word embedding” in NLP 1220142016ACL, EMNLP, NAACL, TACL, EACL, CONLL, CL (data from ACL Anthology Network)

Neural networks in NLP Language modeling [Mikolov et al. 2010] Text classification [Kim 2014; Iyyer et al. 2015] Syntactic parsing [Chen and Manning 2014, Dyer et al. 2015, Andor et al. 2016] CCG super tagging [Lewis and Steedman 2014] Machine translation [Cho et al. 2014, Sustkever et al. 2014] Dialogue agents [Sordoni et al. 2015, Vinyals and Lee 2015, Ji et al. 2016] (for overview, see Goldberg 2017, 1.3.1)

Neural networks Discrete, high-dimensional representation of inputs(one-hot vectors) - low-dimensional “distributed”representations. Static representations - contextualrepresentations, where representations of wordsare sensitive to local context. Non-linear interactions of input features Multiple layers to capture hierarchical structure

Neural network libraries

Logistic regressionP(ŷ 1) 11 Fexp ( i 1 xi βi)xβnot1-0.5bad1-1.7movie00.3

SGDCalculate the derivative of some loss function with respectto parameters we can change, update accordingly to makepredictions on training data a little less wrong next time.17

Logistic regressionP( ŷ 1) 1 exp ( i 1 xi βi)Fxβnot1-0.5bad1-1.7movie00.3β1x1β2x2x31yβ3

Feedforward neural network Input and output are mediated by at least onehidden layer.x1h1yx2h2x3

*For simplicity, we’releaving out the bias term,but assume most layershave them as 2“Hidden”LayerOutput

ie01.73.1

WVx1h1yx2h2x3Fhj fxi Wi,ji 1the hidden nodes arecompletely determined by theinput and weights

WVx1h1yx2h2x3Fh1 fxi Wi,1i 1

Activation functions1σ (z) 1 exp( z)1.00y0.750.500.250.00-10-50x510

Logistic regressionP(ŷ 1) β1x111 Fexp ( i 1 xi βi)β2x2x3yβ3P(ŷ 1) σFxi βi( )i 1We can think about logistic regression as aneural network with no hidden layers

Activation functionsexp(z) exp( z)tanh(z) exp(z) exp( z)1.0y0.50.0-0.5-1.0-10-50x510

Activation functionsReLU(z) max(0, z)10.0y7.55.02.50.0-10-50x510

functionderivativeGoldberg 46 ReLU and tanh are both used extensively in modernsystems. Sigmoid is useful for final layer to scale output between0 and 1, but is not often used in intermediate layers.

WVx1h1yx2h2x3Fh1 σxi Wi,1i 1ŷ σ [V1 h1 V2 h2 ]Fh2 σxi Wi,2i 1

WVx1h1yx2h2x3Fŷ σ V1σFxi Wi,1i V2σx i W i,2iwe can express y as a function only of the input x and the weights W and V

Fŷ σ V1σFx i W i,1i V 2σxi Wi,2ih1h2This is hairy, but differentiableBackpropagation: Given training samples of x,y pairs, we can use stochastic gradientdescent to find the values of W and V thatminimize the loss.

WVx1h1yx2h2x3Neural networks are a series offunctions chained togetherThe loss is another functionchained on topxWσ (xW)σ (xW) Vlog (σ (σ (xW) V))σ (σ (xW) V)

Chain ruleV log (σ (σ (xW) V))Let’s take the likelihood for asingle training example withlabel y 1; we want this valueto be as high as possiblelog (σ (σ (xW) V)) σ (σ (xW) V) σ (xW) Vσ (σ (xW) V)σ (xW) VVA log (σ (hV))σ (hV)BCσ (hV) hVhVV

Chain ruleA Blog (σ (hV))σ (hV)σ (hV) hVhVVA1 σ (hV)CBσ (hV)(1Cσ (hV)) (1σ (hV))h (1ŷ)hh

Neural networks Tremendous flexibility on design choices(exchange feature engineering for modelengineering) Articulate model structure and use the chain rule toderive parameter updates.

Neural network structuresx1h1yx2h2x3Output one real value1

Neural network structuresx1y0y1y0h1x2h2x3Multiclass: output 3 values, onlyone 1 in training data

Neural network structuresx1y1y1y0h1x2h2x3output 3 values, several 1 intraining data

Regularization Increasing the number of parameters increasingthe possibility for overfitting to training data

Regularization L2 regularization: penalize W and V for being toolarge Dropout: when training on a x,y pair, randomlyremove some node and weights. Early stopping: Stop backpropagation before thetraining error is too small.

Deeper networksW1W2Vx1h1h2x2yh2h2x3h2x3

x1Denselyconnected layerWx2x3x4h1xh2Wx5x6h2hx7h (xW )

Convolutional networks With convolution networks, the same operation is(i.e., the same set of parameters) is applied todifferent regions of the input

2D /plug-in-convmatrix.html

1D Convolutionconvolution Kx1/3 1/3 1/340131⅓ 1-1422 1⅔ 201.5-131.50

Ix1hatedx2itx3Convolutionalnetworksh1 f(I, hated, it)xh1h2 f(it, I, really)Ireallyx4x5h2Wh3hh3 f(really, hated, it)hateditx6x7h1 (x1 W1 x2 W2 x3 W3 )h2 (x3 W1 x4 W2 x5 W3 )h3 (x5 W1 x6 W2 x7 W3 )

Indicatorvector Every token is a Vdimensional vector(size of the vocab)with a single 1identifying the wordWe’ll get to distributedrepresentations ofwords on 2/13vocab aba0

.40.1x410.3-0.4x3x5x6x7W3-2.4-4.75.7h1 (x1 W1 x2 W2 x3 W3 )h2 (x3 W1 x4 W2 x5 W3 )h3 (x5 W1 x6 W2 x7 W3 )

.40.1x410.3-0.4x3x5x6x7W3-2.4-4.75.7For indicator vectors, we’re just addingthese numbers togetherh1 (W1,xid Wid W3,xid )2,x123(Where xnid specifies the location of the 1in the vector — i.e., the vocabulary id)

2.7-0.4-3.2W3-2.46.2-4.71.95.7For dense input vectors (e.g.,embeddings), full dot productx7h1 (x1 W1 x2 W2 x3 W3 )

Pooling737192 Down-samples a layer byselecting a single pointfrom some set Max-pooling selects thelargest value910553

Global pooling731 Down-samples a layer byselecting a single pointfrom some set Max-pooling over time(global max pooling)selects the largest valueover an entire sequence Very common for NLPproblems.9219053

Convolutionalnetworksx1x21x310x42x5-1x6510This defines one filter.x7convolutionmax pooling

x1We can specify multiple filters; eachfilter is a separate set of parameters tobe learnedx2WaWbWcx31x1x4x21x51x3x6x7h1 (x W )R4Wd

Convolutional networks With max pooling, we select a single number for eachfilter over all tokens (e.g., with 100 filters, the output of max pooling stage 100-dimensional vector) If we specify multiple filters, we can also scope each filterover different window sizes

Zhang and Wallace 2016, “A SensitivityAnalysis of (and Practitioners’ Guide to)Convolutional Neural Networks forSentence Classification”

CNN as important ngramdetectorHigher-order ngrams aremuch more informative thanjust unigrams (e.g., “i don’tlike this movie” [“I”, “don’t”,“like”, “this”, “movie”])We can think about a CNNas providing a mechanismfor detecting important(sequential) ngrams withouthaving the burden ofcreating them as uniquefeaturesunique rams1,074,921Unique ngrams (1-4) in Cornell movie review dataset

259 project proposaldue 2/18 Final project involving 1 to 3 students involving natural languageprocessing -- either focusing on core NLP methods or using NLP insupport of an empirical research question. Proposal (2 pages): outline the work you’re going to undertakemotivate its rationale as an interesting question worth askingassess its potential to contribute new knowledge by situatingit within related literature in the scientific community. (cite 5relevant sources)who is the team and what are each of your responsibilities(everyone gets the same grade)Feel free to come by my office hours and discuss your ideas!

Tuesday Read Hovy and Spruit (2016) beforehand andcome prepared to discuss!

Natural Language Processing Info 159/259 Lecture 4: Text