CS 4032 Natural Language Processing - WordPress

Transcription

CS 4032Natural Language ProcessingDr Budditha HettigeDepartment of Computer EngineeringFaculty of ComputingGeneral Sir John Kotelawala Defence University

Course detailsCourse CodeCS 40322Course TitleNATURAL LANGUAGE PROCESSINGCourse TypeElectiveCredits02TheoryHours udditha Hettige (http://budditha.wordpress.com)2

Course details Assignment (30%) Final Examination (70%) References– Russell and Norvig, Artificial Intelligence: A Modern Approach,Prentice Hall, 2003– D. Jurafsky, JH. Martin, Speech and Language Processing– S.Bird, E.Klein, Natural Language Processing with Python– Natural Language Toolkit, http://www.nltk.org/ Materials– ocessing/Budditha Hettige (http://budditha.wordpress.com)3

Contents Introduction Speech Processing– Text-to-Speech,– Speech Recognition Words– Morphology– Part-of-Speech Tagging– Morphological Processing Syntax– Word Classes– Context-Free Grammars,– Parsing– Language and Complexity Semantics– Representing Meaning,– Semantic Analysis,– Lexical Semantics, Word SenseDisambiguation and InformationRetrieval Pragmatics– Discourse, Dialogue andConversational Agents, NLP Applications– Machine Translation SystemBudditha Hettige (http://budditha.wordpress.com)4

Budditha Hettige (http://budditha.wordpress.com)5

Where does it fit in the CS taxonomy?ComputersDatabasesArtificial ral Language nguageAnalysisSemanticsBudditha Hettige (http://budditha.wordpress.com)Parsing6

What is Natural Language Processing? Natural Language Processing (NLP) is acomputational treatment of the Natural (human)Languages– Natural Language Understanding– Natural Language Generation rstandingGenerationBudditha Hettige (http://budditha.wordpress.com)7

What is Natural Language Processing? Natural Language Processing– Process information contained in naturallanguage text. Also known as– Computational Linguistics (CL),– Human Language Technology (HLT)– Natural Language Engineering (NLE) Can machines understand human language?Budditha Hettige (http://budditha.wordpress.com)8

Why Study NLP? A hallmark of human intelligence. Text is the largest repository of human knowledgeand is growing quickly.– emails, news articles, web pages, scientificarticles, insurance claims, customer complaintletters, transcripts of phone calls, technicaldocuments, government documents, patentportfolios, court decisions, contracts, Are we reading any faster than before?Budditha Hettige (http://budditha.wordpress.com)9

Why are language technologies needed? Many companies would make a lot of money if theycould use computer programmes that understood textor speech. Just imagine if a computer could be usedfor:– Answering the phone, and replying to a question– Understanding the text on a Web page to decidewho it might be of interest to– Translating a daily newspaper from Japanese toEnglish (an attempt is made to do this already)– Understanding text in journals / books and buildingan expert systems based on that understandingBudditha Hettige (http://budditha.wordpress.com)10

Dreams? NLP Applications– Show me Star Trek.? (Talk to your TV set)– Will my computer talk to me like another human ?– Will the search engine get me exactly what I am lookingfor?– Can my PC read the whole newspaper and tell me theimportant news only.?– Can my palmtop translate what that Japanese lady istelling me. ?– Can my PC do my English homework ?Budditha Hettige (http://budditha.wordpress.com)11

NLP Applications Question answering– Who is the first Taiwanese president? Text Categorization/Routing– e.g., customer e-mails. Text Mining– Find everything that interacts with BRCA1. Machine Translation Language Teaching/Learning– Usage checking Spelling correction– Is that just dictionary lookup?Budditha Hettige (http://budditha.wordpress.com)12

Application areas Text-to-Speech & Speech recognitionNatural Language Dialogue Interfaces to DatabasesInformation RetrievalInformation ExtractionDocument ClassificationDocument Image AnalysisAutomatic SummarizationText Proofreading – Spelling & GrammarMachine TranslationStory understanding systemsPlagiarism detectionCan u think of anything else ?Budditha Hettige (http://budditha.wordpress.com)13

Relevant Scientific Conferences Association for Computational Linguistics (ACL) North American Association for Computational Linguistics(NAACL) International Conference on Computational Linguistics(COLING) Empirical Methods in Natural Language Processing(EMNLP) Conference on Computational Natural Language Learning(CoNLL) International Association for Machine Translation (IMTA)Budditha Hettige (http://budditha.wordpress.com)1414

Early days. How to measure Intelligence of a Machine? Turing test – Alan Turing (1950)– A machine can be accepted to be intelligent if it can fool ajudge that its human over a tele-typing exercise. ELIZA by Weizenbaum (1966)– Pretends to be a psychiatrist and converses with a user onhis problems.– Uses Keyword pattern matching– Many users thought the machine really understood theirproblem.– Many such systems exist now. E.g. Alan, Alice, David Cansuch tests be taken as a measure for Intelligence ?Budditha Hettige (http://budditha.wordpress.com)15

Early days. SHRDLU– Can understand Natural Language command.– Developed by Terry Winograd MIT AI Lab (1968 –70)using Lisp.– Works on a “Blocks World” a simulated environmentin which blocks like coloured cubes, cylinders,pyramids can be moved around, placed over eachother, etc.– Understands a bit of anaphora.– Memory to store history.– Successful demonstration of AI.Budditha Hettige (http://budditha.wordpress.com)16

The problem When people see text, they understand its meaning When computers see text, they get only characterstrings (and perhaps HTML tags) We'd like computer agents to see meanings and beable to intelligently process text These desires have led to many proposals forstructured, semantically marked up formats But often human beings still resolutely make use oftext in human languagesBudditha Hettige (http://budditha.wordpress.com)17

Knowledge of language needed Phonetics and Phonology – The study of linguisticsounds. Morphology – The study of the meaningful componentsof words Syntax – The study of the structural relationshipsbetween words. Semantics – The study of meaning. Pragmatics – The study of how language is used toaccomplish goals. Discourse – The study of linguistic units larger than asingle utteranceBudditha Hettige (http://budditha.wordpress.com)18

Why is NLP difficult? Computers are not brains– There is evidence that much of languageunderstanding is built-in to the human brain Computers do not socialize– Much of language is about communicating withpeople Key problems:– Representation of meaning– Language presupposed knowledge about the world– Language only reflects the surface of meaning– Language presupposes communication betweenpeopleBudditha Hettige (http://budditha.wordpress.com)19

Why is NLP difficult? The hidden structure of language is highlyambiguous Structures for: Fed raises interest rates 0.5% ineffort to control inflation (NYT headline 5/17/00)Budditha Hettige (http://budditha.wordpress.com)20

Hidden Structure English plural pronunciation– Toy s toyz– Book s books– Church s churchiz– Box s boxiz– Sheep s sheep; add z; add s; add iz; add iz; add nothing What about new words?– Bach ‘s boxs; why not boxiz?Budditha Hettige (http://budditha.wordpress.com)21

Language subtleties Adjective order and placement– A big black dog– A big black scary dog– A big scary dog– A scary big dogA black big dog Antonyms– Which sizes go together? Big and little Big and small Large and smallLarge and littleBudditha Hettige (http://budditha.wordpress.com)22

World Knowledge is subtle He arrived at the lecture. He chuckled at the lecture. He arrived drunk. He chuckled drunk. He chuckled his way through the lecture.He arrived his way through the lecture.Budditha Hettige (http://budditha.wordpress.com)23

Words are ambiguous(have multiple meanings) I know that.I know that block.I know that blocks the sun.I know that block blocks the sun.Budditha Hettige (http://budditha.wordpress.com)24

Challenges in NLP: Ambiguity Words or phrases can often be understood inmultiple ways.– Teacher Strikes Idle Kids– Killer Sentenced to Die for Second Time in 10Years– They denied the petition for his release that wassigned by over 10,000 people.– child abuse expert/child computer expert– Who does Mary love? (three-way ambiguous)Budditha Hettige (http://budditha.wordpress.com)25

Where are the ambiguities?Budditha Hettige(http://budditha.wordpress.com)26

Challenges in NLP: Variations Syntactic Variations– I was surprised that Kim lost– It surprised me that Kim lost– That Kim lost surprised me. The same meaning can be expressed in differentways– Who wrote “The Language Instinct”?– Steven Pinker, a MIT professor and author of “TheLanguage Instinct”, Budditha Hettige (http://budditha.wordpress.com)27

Parsing Analyze the structure of a sentenceSVPNPPPNPDTheNstudentVputDNPNPDthe book ontheBudditha Hettige (http://budditha.wordpress.com)Ntable28

Syntactic Variations contd.SSVPVPNPNPNPNNVNTeacher strikes idle kidsNPNVANTeacher strikes idle kidsBudditha Hettige (http://budditha.wordpress.com)29

How can a machine understandthese differences?– Get the cat with the gloves.Budditha Hettige (http://budditha.wordpress.com)30

Natural Languages vs. Computer Languages Ambiguity is the primary difference between naturaland computer languages. Formal programming languages are designed to beunambiguous, i.e. they can be defined by agrammar that produces a unique parse for eachsentence in the language. Programming languages are also designed forefficient (deterministic) parsing, i.e. they aredeterministic context-free languages (DCLFs).– A sentence in a DCFL can be parsed in O(n) time where nis the length of the string.Budditha Hettige (http://budditha.wordpress.com)31

Natural Language Tasks Processing natural language text involves manyvarious syntactic, semantic and pragmatic tasks inaddition to other problems. Task can be divided into– Syntactic Tasks– Semantics Tasks– Pragmatics/Discourse Tasks– Other TasksBudditha Hettige (http://budditha.wordpress.com)32

Syntactic tasks:Word Segmentation Breaking a string of characters (graphemes) into asequence of words. In some written languages (e.g. Chinese) words are notseparated by spaces. Even in English, characters other than white-space can beused to separate words [e.g. , ; . - : ( ) ] Examples from English URLs:– jumptheshark.com jump the shark .com– myspace.com/pluckerswingbar myspace .com pluckers wing bar myspace .com plucker swing bar Budditha Hettige (http://budditha.wordpress.com)33

Syntactic tasks:Morphological Analysis Morphology is the field of linguistics that studies the internalstructure of words. (Wikipedia) A morpheme is the smallest linguistic unit that has semanticmeaning (Wikipedia)– e.g. “carry”, “pre”, “ed”, “ly”, “s” Morphological analysis is the task of segmenting a word into itsmorphemes:– carried carry ed (past tense)– independently in (depend ent) ly– Googlers (Google er) s (plural)– unlockable un (lock able) ? (un lock) able ?Budditha Hettige (http://budditha.wordpress.com)34

Syntactic tasks:Part Of Speech (POS) Tagging Annotate each word in a sentence with apart-of-speech.I ate the spaghetti with meatballs.Pro V DetNPrepNJohn saw the saw and decided to take it to the table.PN V Det N Con V Part V Pro Prep Det N Useful for subsequent syntactic parsing andword sense disambiguation.Budditha Hettige (http://budditha.wordpress.com)35

Syntactic tasks:Phrase Chunking Find all non-recursive noun phrases (NPs) andverb phrases (VPs) in a sentence.– [NP I] [VP ate] [NP the spaghetti] [PP with] [NPmeatballs].– [NP He ] [VP reckons ] [NP the current account deficit ][VP will narrow ] [PP to ] [NP only # 1.8 billion ] [PP in ][NP September ]Budditha Hettige (http://budditha.wordpress.com)36

Syntactic tasks:Syntactic Parsing Produce the correct syntactic parse tree for asentence.Budditha Hettige (http://budditha.wordpress.com)37

Semantic Tasks:Word Sense Disambiguation (WSD) Words in natural language usually have a fairnumber of different possible meanings.– Ellen has a strong interest in computationallinguistics.– Ellen pays a large amount of interest on her creditcard. For many tasks (question answering, translation),the proper sense of each ambiguous word in asentence must be determined.Budditha Hettige (http://budditha.wordpress.com)38

Semantic Tasks:Semantic Role Labeling (SRL) For each clause, determine the semantic role playedby each noun phrase that is an argument to theverb.agent patient source destination instrument– John drove Mary from Austin to Dallas in hisToyota Prius.– The hammer broke the window. Also referred to a “case role analysis,” “thematicanalysis,” and “shallow semantic parsing”Budditha Hettige (http://budditha.wordpress.com)39

Semantic Tasks:Semantic Parsing A semantic parser maps a natural-languagesentence to a complete, detailed semanticrepresentation (logical form). For many applications, the desired output isimmediately executable by another program. Example: Mapping an English database query toProlog:How many cities are there in the US?answer(A, count(B, (city(B), loc(B, C),const(C, countryid(USA))),A))Budditha Hettige (http://budditha.wordpress.com)40

Pragmatics/Discourse Tasks:Anaphora Resolution/Co-Reference Determine which phrases in a documentrefer to the same underlying entity.– John put the carrot on the plate and ate it.– Bush started the war in Iraq. But the presidentneeded the consent of Congress. Some cases require difficult reasoning. Today was Jack's birthday. Penny and Janet went to the store.They were going to get presents. Janet decided to get a kite."Don't do that," said Penny. "Jack has a kite. He will make youtake it back."Budditha Hettige (http://budditha.wordpress.com)41

Pragmatics/Discourse Tasks:Ellipsis Resolution Frequently words and phrases are omitted fromsentences when they can be inferred fromcontext."Wise men talk because they have something to say;fools, because they have to say something.“ (Plato)"Wise men talk because they have something to say;fools talk because they have to say something.“ (Plato)Budditha Hettige (http://budditha.wordpress.com)42

Other Tasks:Information Extraction (IE) Identify phrases in language that refer to specific types ofentities and relations in text. Named entity recognition is task of identifying names ofpeople, places, organizations, etc. in text.people organizations places– Michael Dell is the CEO of Dell Computer Corporation and livesin Austin Texas. Relation extraction identifies specific relations betweenentities.– Michael Dell is the CEO of Dell Computer Corporation and livesin Austin Texas.Budditha Hettige (http://budditha.wordpress.com)4343

Other Tasks:Question Answering Directly answer natural language questionsbased on information presented in a corpora oftextual documents (e.g. the web).– When was Barack Obama born? (factoid) August 4, 1961– Who was president when Barack Obama was born? John F. Kennedy– How many presidents have there been since BarackObama was born? 9Budditha Hettige (http://budditha.wordpress.com)44

Text Summarization Produce a short summary of a longer document orarticle.– Article: With a split decision in the final two primaries and a flurry ofsuperdelegate endorsements, Sen. Barack Obama sealed the Democratic presidentialnomination last night after a grueling and history-making campaign against Sen.Hillary Rodham Clinton that will make him the first African American to head amajor-party ticket. Before a chanting and cheering audience in St. Paul, Minn., thefirst-term senator from Illinois savored what once seemed an unlikely outcome tothe Democratic race with a nod to the marathon that was ending and to what will beanother hard-fought battle, against Sen. John McCain, the presumptive Republicannominee .– Summary: Senator Barack Obama was declared the presumptiveDemocratic presidential nominee.Budditha Hettige (http://budditha.wordpress.com)45

Machine Translation (MT) Translate a sentence from one natural language toanother.– Hasta la vista, bebé Until we see each other again, baby.Budditha Hettige (http://budditha.wordpress.com)46

Assignment 1 Find some NLP tool or application and demonstratehow it work? Including (5 min presentation systemdemonstration)– What is ?– Technology– Features– Who it work– What wee can do from thatBudditha Hettige (http://budditha.wordpress.com)47

Applications

Applications What uses of the computer involve language?What language use is involved?What are the main problems?How successful are they?Budditha Hettige (http://budditha.wordpress.com)49

Speech applications Speech recognition (Speech-to-text)– Uses As a general interface to any text-based application Text dictation Speech understanding– Not the same: computer must understand intention, not necessarilyexact words– Uses As a general interface to any application where meaning is importantrather than text As part of speech translation Difficulties–––––Separating speech from background noiseFiltering of performance errors (disfluencies)Recognizing individual sound distinctions (similar phonemes)Variability in human speechAmbiguity in language (homophones)Budditha Hettige (http://budditha.wordpress.com)50

Speech applications Voice recognition– Not really a linguistic issue– But shares some of the techniques and problems Text-to-speech (Speech synthesis)– Uses: Computer can speak to you Useful where user cannot look at (or see) screen– Difficulties Homograph disambiguation Prosody determination (pitch, loudness, rhythm) Naturalness (pauses, disfluencies?)Budditha Hettige (http://budditha.wordpress.com)51

Word processing Check and correct spelling, grammar and style Types of spelling errors– Non-existent words Easy to identify But suggested correction not always appropriate– Accidental homographs Deliberate ‘errors’– Foreign words– Proper names, neologisms– Illustrations of spelling errors!Budditha Hettige (http://budditha.wordpress.com)52

Better word processing Spell checking for homonyms Grammar checking Tuned to the user– You can (already) add your own auto-corrections– Non-native users (‘Interference checking’)– Dyslexics and other special needs users Intelligent word processing– Find/replace that knows about morphology, syntaxBudditha Hettige (http://budditha.wordpress.com)53

Text prediction Speed up word processingFacilitate text dictationAt lexical level, already seen in SMSMore sophisticated , might be based on corpus ofpreviously seen texts Especially useful in repeated tasks– Translation memory– Authoring memoryBudditha Hettige (http://budditha.wordpress.com)54

Dialogue systems Computer enters a dialogue with user– Usually specific cooperative task-oriented dialogue– Often over the phone– Examples? Usually speech-driven, but text also appropriate Modern application is automatic transactionprocessing Limited domain may simplify language aspect Domain ‘model’ will play a big part Simplest case: choose closest match from (hidden)menu of expected answers More realistic versions involve significant problemsBudditha Hettige (http://budditha.wordpress.com)55

Dialogue systems Apart from speech recognition and synthesis issues,NL components include Topic tracking Anaphora resolution– Use of pronouns, ellipsis Reply generation– Cooperative responses– Appropriate use of anaphoraBudditha Hettige (http://budditha.wordpress.com)56

(also know as)Conversation machines Another old AI goal (cf. Turing test)Also (amazingly) for amusementMainly speech, but also text basedEarly famous approaches include ELIZA, whichshowed what you could do by cheating Modern versions have a lot of NLP, especiallydiscourse modelling, and focus on the languagegeneration componentBudditha Hettige (http://budditha.wordpress.com)57

QA systems NL interface to knowledge databaseHandling queries in a natural wayMust understand the domainEven if typed, dialogue must be naturalHandling of anaphorae.g. When is the next flight to Sydney?And the one after?What about Melbourne then?6.507.507.20OK I’ll take the last one.Budditha Hettige (http://budditha.wordpress.com)58

IR systems Like QA systems, but the aim is to retrieveinformation from textual sources that contain theinfo, rather than from a structured data base Two aspects– Understanding the query (cf Google, Ask Jeeves)– Processing text to find the answer Named Entity RecognitionBudditha Hettige (http://budditha.wordpress.com)59

Budditha Hettige(http://budditha.wordpress.com)60/26

Budditha Hettige(http://budditha.wordpress.com)61/26

Budditha Hettige(http://budditha.wordpress.com)62/26

Named entity recognition Typical textual sources involve names (people,places, corporations), dates, amounts, etc. NER seeks to identify these strings and label them Clues are often linguistic Also involves recognizing synonyms, andprocessing anaphoraBudditha Hettige (http://budditha.wordpress.com)63

Automatic summarization Renewed interest since mid 1990s, probably due togrowth of WWW Different types of summary– indicative vs. informative– abstract vs. extract– generic vs. query-oriented– background vs. just-the-news– single-document vs. multi-documentBudditha Hettige (http://budditha.wordpress.com)64

Automatic summarization topic identification stereotypical text structure cue words high-frequency indicator phrases intratext connectivity discourse structure centrality topic fusion concept generalization semantic association summary generation sentence planning to achieve information compactionBudditha Hettige (http://budditha.wordpress.com)65

Text mining Discovery by computer of new, previously unknowninformation, by automatically extracting informationfrom different written resources (typically Internet) Cf data mining (e.g. using consumer purchasingpatterns to predict which products to place closetogether on shelves), but based on textualinformation Big application area is biosciencesBudditha Hettige (http://budditha.wordpress.com)66

Text mining preprocessing of document collections (textcategorization, term extraction) storage of the intermediate representations techniques to analyze these intermediaterepresentations (distribution analysis, clustering,trend analysis, association rules, etc.) visualization of the results.Budditha Hettige (http://budditha.wordpress.com)67

Story understanding An old AI application Involves – Inference– Ability to paraphrase (to demonstrateunderstanding) Requires access to real-world knowledge Often coded in “scripts” and “frames”Budditha Hettige (http://budditha.wordpress.com)68

Machine Translation Oldest non-numerical application of computers Involves processing of source-language as in otherapplications, plus – Choice of target-language words and structures– Generation of appropriate target-language strings Main difficulty is source-language analysis and/orcross-lingual transfer implies varying levels of“understanding”, depending on similarities betweenthe two languages MT tools for translators, but some overlapBudditha Hettige (http://budditha.wordpress.com)69

Machine Translation First approaches perhaps most intuitive: lookup words and then do local rearrangement “Second generation” took linguistic approach:grammars, rule systems, elements of AI Recent (since 1990) trend to use empirical(statistical) approach based on large corporaof parallel text– Use existing translations to “learn” translationmodels, either a priori (Statistical MT machinelearning) or on the fly (Example-based MT casebased reasoning)– Convergence of empirical and rationalist (rulebased) approaches: learn models based ontreebanks or similar.Budditha Hettige (http://budditha.wordpress.com)70

Language teaching CALL Grammar checking but linked to models of– The topic– The learner– The teaching strategy Grammars (etc) can be used to create languagelearning exercises and drillsBudditha Hettige (http://budditha.wordpress.com)71

Assistive computing Interfaces for disabled Many devices involve language issues, e.g.– Text simplification or summarization for userswith low literacy (partially sighted, dyslexic, nonnative speaker, illiterate, etc.)– Text completion (predictive or retrospective) Works on basis of probabilities or previousexamplesBudditha Hettige (http://budditha.wordpress.com)72

Conclusion Many different applications But also many common elements– Basic tools (lexicons, grammars)– Ambiguity resolution– Need (but impossibility of having) for real-world knowledge Humans are really very good at language– Can understand noisy or incomplete messages– Good at guessing and inferringBudditha Hettige (http://budditha.wordpress.com)73

QuestionSophia is a social humanoid robot developed by HongKong-based company Hanson Robotics. Discuss whatare the NLP techniques use by the Sophia?Budditha Hettige (http://budditha.wordpress.com)74

Next Introduction Speech Processing– Text-to-Speech,– Speech Recognition Words– Morphology– Part-of-Speech Tagging– Morphological Processing Syntax– Word Classes– Context-Free Grammars,– Parsing Semantics– Representing Meaning,– Semantic Analysis,– Lexical Semantics, Word SenseDisambiguation and InformationRetrieval Pragmatics– Discourse, Dialogue andConversational Agents, NLP Applications– Machine Translation SystemBudditha Hettige (http://budditha.wordpress.com)75

Feb 01, 2021 · Why Study NLP? A hallmark of human intelligence. Text is the largest repository of human knowledge and is growing quickly. –emai