Natural Language Processing - Stony Brook University

Transcription

CSE 634Data Mining Concepts and TechniquesProfessor: Anita WasilewskaNatural Language Processing(From past Presentation)1

iewSBU CS Graduate Course: CSE 628 - Introduction to NLP (Professor Niranjan Balasubramanian)Intro to Sentiment Analysis - https://www.youtube.com/watch?v YYQNpjvvLE&t 490shttp://verbs.colorado.edu/ /www.nltk.org/book/ch05.htmlPart-of-Speech Tagging: CSE 628 Niranjan Balasubramanianhttps://web.stanford.edu/ ng/courses/inf2a/slides/2007 inf2a L13 slides.pdfhttp://cl.indiana.edu/ utah.edu/ cs5340/slides/viterbi.4.pdfCoursera Course on Introduction to Natural Language Processing by Prof. Dragomir Radev2

Overview:3

Introduction to NLP4

IntroductionNatural-language processing (NLP)is an area of computer science and artificial intelligence concernedwith the interactions between computers and human (natural) languagesIn particular, is concerned with how to program computers to fruitfullyprocess large amounts of natural language dataChallenges in NPLfrequently involve speech recognition, natural-language understanding,and natural-language generationNLP is characterized as a hard problem in computer science ashuman language is rarely precise or plainly spoken5

IntroductionWhat is Natural Language Processing?It is the field of computer science and computational linguisticsBut let's take a look at a few interesting challengesUnderstand semantics - --- apply your knowledge of the physical worldContext is everything in NLPHere are three examples6

7Ref: w

8Ref: w

IntroductionHuman Understanding:The sofa didn't fit through the door because it was toonarrow.The sofa didn't fit through the door because it was toowide.Why do you think we are able to answer this butcomputer wasn’t? Watson demo uemix.net/9

How would youinterpret this one?Ref: w10

IntroductionHuman Understanding:Fountain water is not drinkableComputer Understanding:Fountain is not engaged in drinking waterFountain is not going to drink the waterUnderstand semantics-apply your knowledge of the physical worldWhy do you think you were able to answer this but computer wasn’t?11

Challenges: VariabilityAmbiguityMeaning is context dependentRequires backgroundknowledgeRef: CSE 628 - Introduction to NLP (Professor NiranjanBalasubramanian)Image From: Commons.wikimedia.org12

IntroductionHow does the communication context affect meaning?What are the meanings of words, phrases etc.?How do words form phrases, and phrases sentences?How do morphemes, i.e. sub-word units, form words?How do phonemes, i.e., sound units, form pronunciations?How are the speech sounds generated and perceived?13

Introduction How does the communication context affectmeaning?What are the meanings of words, phrases etc.?How do words form phrases, and phrases sentences?How do morphemes, i.e. sub-word units, formwords?How do phonemes, i.e., sound units, formpronunciations?How are the speech sounds generated andperceived?14

Some NLP applications1. Spelling and Grammar Correction/detection (Eg. MS-Word, Grammarlyetc.)2. Machine Translation (Eg. GoogleTranslate, Bing Translate)3. Opinion Mining (Eg. Extract sentiment ofdemographic from blogs and socialmedia)4. Speech Recognition and Synthesis (Eg.Siri, Google assistant, Amazon eatures/15

NLP ToolkitsFound around the web!Stanford NLP Pipeline(Java) spaCy (Python) NLTK (Python) Factorie andMallet(Scala Java) Apache OpenNLP (Java) GATE (Java)16Ref: CSE 628 - Introduction to NLP (Prof. Niranjan Balasubramanian)

Language Modelling17

One-Hot VectorsMachine learning algorithms work with numeric values andNLP application generate data in the form of words andsentencesOne way to convert the words into numeric values isone-hot vectorsWe take all the words that are present in the dictionary andmake vectors such that one index represents the word and therest all are zeros18

How do we convert words to numbers?One –Hot VECTORS19

Problem with One-Hot Vectors Machine Learning algorithms usingOne-Hot Vectors are computationallyexpensiveThey do not consider the similarity betweenwords20

Bag –of - WordsAnother approach to solve this problem isBag of WordsIn this approach, we take a document and find out thefrequencies of occurrence of words in itAnd then these frequencies are fed into the machinelearning algorithm21

Bag-of-WordsBag of words is a collection of all the words that are present in thedocument along with their frequency.“John likes to watch movies. Mary likes movies too.”Bag-of-Words representationThen we use the frequencies as a feature (values of attributes) inour machine learning algorithms.22

Problem with Bag-of-Words Too simplistic Ignores the context of the word Loses the ordering of the wordsFor example: “My name is John” is same as“Is my name John?” 23

Word2Vector Model So, to remedy these problems, engineers at Google came upwith the Word2Vec modelIn this approach we represent words as vectors.If you two vectors are similar then their dot product is veryhigh and as they move away from each other their dot productreduces until they are perpendicular to each other and thentheir dot product is zero24

Word to VectorsWe represent every word in the form of vectors. As shown in the examplebelow:“CAT”: 0.1, 0.5, 0.2“DOG”: 0.4, 0.1, 0.5Ref: https://www.youtube.com/watch?v YYQNpjvvLE&t 490s25

Word2Vec - CreationFind the number of occurrences when both the words DOG and CAT occur togetherRef: https://www.youtube.com/watch?v YYQNpjvvLE&t 490s26

Word2Vec - Creation“CAT” u“DOG” vFind a value of u and v such that uTv is approximatelyequal to 5 which is the number of times the two wordsoccur together.uTv u1.v1 u2.v2 u3.v3 5Ref: https://www.tensorflow.org/tutorials/word2vec27

Word2Vec - Another WayAnother way to visualize this problem is through this matrix multiplication. We put ourvectors for all the words in matrix X and take its transpose X’. When we multiply X with X’28we should approximately get the matrix M

Word2Vec-Last Remarks1.Instead of using the frequency of two words occurring together in the matrix M,we actually take the logarithm of the frequency. This helps us with words like“the”, “a”, “and” etc.2.The biggest problem with Word2Vec is that it cannot handle new or out-ofvocabulary (OOV) words. If your model has not encountered a word before thenit will have no idea how to interpret it or how to build a vector out of it. One isforced to use a random vector.29

POS tagging30

POS TaggingProcess of classifying words into their parts of speech and labeling themaccordinglyParts of speech are also known as word classes or lexical categoriesThe collection of tags used for a particular task is known as a tagsetWords from the same part of speech tends to behave in a similar du/ /www.nltk.org/book/ch05.htmlPart-of-Speech Tagging: CSE 628 Niranjan Balasubramanian31

Parts of Speech in EnglishThere are several POS TagsetsMost modern language processing onEnglish uses the 45-tag Penn Treebanktagset (Marcus et al., 1993) as show in thetableOther tagsets:Brown corpus: 87 POS tagsC5 tagset: 61 POS tagsReferences:https://web.stanford.edu/ jurafsky/slp3/10.pdfPart-of-Speech Tagging: CSE 628 Niranjan Balasubramanian32

POS TaggingSequence ofTokens (eg. aTweet)POS TaggingAn assignment ofPOS tags foreach word in theinputThere/EX are/VBP 70/CD children/NNS there/RBReferences:Part-of-Speech Tagging: CSE 628 Niranjan Balasubramanianhttps://web.stanford.edu/ jurafsky/slp3/10.pdf33

Benefits of POS taggingSuccinctly characterizes the context of a wordHelps in recognizing similarities and differences between wordsText to speech applications: e.g. pronunciation of leadReferences:Part-of-Speech Tagging: CSE 628 Niranjan urses/inf2a/slides/2007 inf2a L13 slides.pdf34

ChallengesSame words can have different POS tags when used in different contextsbook that flight: verbhand me that book: nounNeed to understand the meaning of the sentence before assigning POS tags:DifficultUnknown/New words cannot be specified POS tags: Have to guess the tagReferences:Part-of-Speech Tagging: CSE 628 Niranjan Balasubramanianhttps://web.stanford.edu/ jurafsky/slp3/10.pdfhttp://cl.indiana.edu/ md7/13/545/slides/06-pos/06-pos.pdf35

Rule-based POS TaggingDepends on a dictionary that provides possible POS tags for a word,or rules can be learned using training dataAmbiguity can be removed using manually developed rulesExample Rule:if preceding word is ART:disambiguate {NOUN,VERB} as NOUN.References:http://www.eng.utah.edu/ cs5340/slides/viterbi.4.pdf36

Statistical POS TaggingInvolves selecting most likely sequence of tags for wordsWe need to calculate P(T1.Tn w1.wn)Using Bayes Rule this is equal to:P(T1.Tn w1.wn) P(T1.Tn) P(w1.wn T1.Tn)P(w1.wn)Calculating the above probability requires a lot of dataSo, we approximate this by assuming independence based on part-of-speech tag bigramsand lexical generation probabilitiesReferences:http://www.eng.utah.edu/ cs5340/slides/viterbi.4.pdf37

Parsing38

Parsing Programming Languages39

Parsing Human LanguageRather different than computer languages– No types for words– No brackets around phrases– Ambiguity Words Parses– Implied information40

Syntactic Ambiguity PP attachment – I saw the man with the telescope Gaps – Mary likes Physics but hates Chemistry Coordination scope – Small boys and girls are playing Gerund vs. adjective – Frightening kids can cause trouble41

The Parsing Problem Parsing means associating tree structures to a sentence,given a grammar (often a Context Free Grammar)– There may be exactly one such tree structure– There may be many such structures– There may be none Grammars (e.g., CFG) are declarative– They don’t specify how the parse tree will be constructed42

Constituency parsing43

Dependency Parsing44

Applications of ParsingConstituency parsing Grammar checking – “I want to return this shoes” Machine translation – E.g., word order – SVO vs. SOVDependency Parsing Question answering – How many people in sales make 40K or moreper year? Information extraction – Breaking Bad takes place in New Mexico.45

Probabilistic CFGSome trees (derivations or parses) are more likely than others– Some rules are more frequent than others.Argmax Pr(Tree Sentence)46

Probabilistic CFG CGF Probabilities47

What are some ways to parse given a CFG? Top down parsing Bottom up parsing Dynamic Programming CYK or CKY parsing [Bottom up] Earley Algorithm [Top down]48

CYK ParsingThe worst case complexity θ(n³. G )}49

Sentiment Analysis in Facebook and itsapplication to e-learningAuthors:Alvaro OrtigosaJose M. MartinRosa M. CarroPublished in:Computers in Human BehaviorVolume 31, February 2014, Pages 527-541Department of Computer Science,Universidad Autonoma de Madrid,Madrid, Spain50

Sentiment AnalysisStreams of text - Customer Reviews, Social Media posts , Tweets etcDetermine how people feel about the service or productIdentify the online mood - positive, negative or indifferent - known as PolarityExample -“I love it” - Positive“It is a terrible movie” - Negative51

Sentiment AnalysisAccuracy is influenced by the context in which the words are usedExample “You must read the book” - Positive or negative?Position of words in text is interesting to considerExample “This book is addictive, it can be read in one sitting, but Ihave to admit that it is rubbish”Presence of figures of speech - Irony & Sarcasm52

ObjectiveTo extract information about the users’ sentimentpolarityTo detect significant emotional changesHow ? - SentBuk - FB application that retrievesmessages written by users and classifies them accordingto their polarityApproach ? - Lexicon based53

Extract users’ sentiment polarityRawtextSentences(list of Strings)Chunked sentencesTokenized sentencesTokenized sentences54

Extract users’ sentiment polarityPreprocessing - convert all words to lowercaseSegmentation - message divided into sentencesTokenization I - tokens are extracted from each sentence, just spacesEmoticon detection - classifier searches for all emoticons from text filesTokenization II - all punctuations are considered as separatorsInterjection detection - interjections are intensified by repeating letter; useregular expression to detect interjection Eg: haha vs hahahahahah55

Extract users’ sentiment polarityToken Score assignation - 1 if it transmits a positive sentiment,0 for neutral and -1 for negative sentimentSyntactical analysis - Apply POS tagging to discriminate words thatdo not reflect any sentiment (Eg. articles), negations are detected(Eg. “do not like”)Polarity calculation - tokens that are susceptible to conveysentiments according to grammatical category are takenSum of scores divided by sum of all candidates to receive a score56

Sentiment Change detectionCollect other data related to users’ action that could give cluesNumber of messages written (posts)Number of comments to messagesNumber of Likes made to messages, commentsFinding patterns over time - 1 day, 3-4 day, during the weekendsFor example - If a user usually writes two or three messages per week and oneweek writes twenty messages, then this may be a sign that something differentis happening to him/her57

Results & ConclusionSome messages classified as negative included irony or tease - should not beconsidered as negativeMany positive classified messages were the greetings - had high influence on theresultsChanged focus on messages wrote on his/her own wall (posts)58

Applications to e-learningGather accurate sentiment/opinion about the reviewsfor the courses and professorsStudents can receive personalized advice about which educationalactivity to be carried out nextMotivational actions intended to encourage the students59

One-Hot Vectors 18 Machine learning algorithms work with numeric values and NLP application generate data in the form of words and sentences One way to convert the words into numeric values is one-hot vectors We take all the words that are present in the dictionary and make vectors such that one index represents the word and the rest all are zeros