Text Mining Webinar - KNIME

Transcription

Text Mining WebinarThe Textprocessing ExtensionRosaria Silipo and Kilian Thiel

Agenda Text Mining Goals and UsageEnrichment & PreprocessingData Types & StructuresVisualizationTopic DetectionSentiment AnalysisKNIME Copyright 2013KNIME Text Mining Webinar2

Install TextProcessing ExtensionKNIME:www.knime.orgInstall Textprocessing Extensionunder KNIME LabsKNIME Copyright 2013KNIME Text Mining Webinar3

ExamplesExample Workflows available on the KNIMEpublic server.KNIME Copyright 2013KNIME Text Mining Webinar4

Text Mining WorkflowFiltering,Stemming, .CreateNewDocumentEnrichment(POS, Sentiment,cities, domainspecific terms, .)FrequenciesVisualizationTag Cloud and DocumentProperties VisualizationKNIME Copyright 2013PreprocessingBowAdd TagsRead or Create anew DocumentTransformationKNIME Text Mining WebinarTF abs, TF rel, ,IDF, .Data MiningClassification forTopic Detection5

1 - Create a DocumentKNIME Copyright 2013KNIME Text Mining Webinar6

New Data TypesDocumentEncapsulates text, author,title, source, category,and typeKNIME Copyright 2013KNIME Text Mining Webinar7

From a FolderThe output isa list ofDocuments?KNIME Copyright 2013KNIME Text Mining Webinar8

From PUBMEDThe output is alist of DocumentsDestinationFolder MUST beEMPTY!KNIME Copyright 2013KNIME Text Mining Webinar9

Strings to DocumentKNIME Copyright 2013KNIME Text Mining Webinar10

From RSS Feedshttp://feeds.nytimes.com/nyt/rss/WorldKNIME Copyright 2013Palladian NodesKNIME Text Mining Webinar11

The Data SetReviews of Restaurants in Berlin fromTripAdvisorSelf-downloaded with RSS FeederKNIME Copyright 2013KNIME Text Mining Webinar12

2 - EnrichmentKNIME Copyright 2013KNIME Text Mining Webinar13

New Data TypesDocumentEncapsulates text, author,title, source, category,and typeTermEncapsulates a termKNIME Copyright 2013KNIME Text Mining Webinar14

Enrichment (Tagging)Enrichment nodes (mostly) change thegranularity of terms.– Multiword detection, named entity recognition,part of speech definition, – To each detected entity (term) a tag is added,specifying its type.– To avoid intersection of granularity the lastnode dominates.KNIME Copyright 2013KNIME Text Mining Webinar15

Tagger Conflict ResolutionIn case of intersections of granularity the last node overwrites.Example: “The gene interleukin 6 interacts .”1. POS tagger: “The\DT gene\NN interleukin\NN 6\CD interacts \VBZ ”2. NE tagger: “The\DT gene\NN interleukin 6\GENE interacts \VBZ ”POSTaggerAdds POS tags toterms.KNIME Copyright 2013NETaggerOverwrite!Adds NE tags to terms and overrides otherconflicting tags.KNIME Text Mining Webinar16

TaggingLanguage POS ModelNo settingsAbner ModelNamed EntityNo settingsNamed EntityTags can be setas unmodifiableDictionary ColumnTag TypeTag ValueKNIME Copyright 2013KNIME Text Mining WebinarDictionary ColumnTag TypeTag ValueMatching Strategy17

Unmodifiable Named Entity TagsNamed Entities Tags attachedthrough enrichment nodes canbe set as unmodifiableUnmodifiable Tags are notaffected by any preprocessingnodes (stemming, filtering,etc.)KNIME Copyright 2013KNIME Text Mining Webinar18

WorkflowStrings to Document Workflow POS TaggerKNIME Copyright 2013KNIME Text Mining Webinar19

3 - TransformationKNIME Copyright 2013KNIME Text Mining Webinar20

Data Types and Features StringDocumentTagsMeta InfoKNIME Copyright 2013 SentenceTermMolecule StructureDocument VectorKNIME Text Mining Webinar21

Parsing / TokenizationParser nodes parse documents by applyingstandard tokenization via OpenNLPtokenizer.Each token is a term consisting of a singleword.Tags are applied to terms.KNIME Copyright 2013KNIME Text Mining Webinar22

1Annotations label asection part in thedocument (like“Abstract”, e2te3te4te5Terms consist oftokens and tagsToken:Tags:KNIME Copyright 2013to1to2 to3 tokens tokens tokensta1 ta2 tagstagstagsKNIME Text Mining Webinartags23

The Bag of WordsNo SettingsrequiredKNIME Copyright 2013Each Term isextracted withTags (NN, VB, )KNIME Text Mining WebinarList (Bag) of Terms (Words)identified in Document24

Data and Sentence ExtractorKNIME Copyright 2013KNIME Text Mining Webinar26

ConversionsMolecularStructureKNIME Copyright 2013KNIME Text Mining Webinar27

WorkflowKNIME Copyright 2013KNIME Text Mining Webinar28

4 - PreprocessingKNIME Copyright 2013KNIME Text Mining Webinar29

Preprocessing ModeNormal– Faster– Preprocesses only terms of the term column– Documents are not changedDeep––––SlowerPreprocesses terms of the term columnTerms in documents are changed as wellUnchanged documents can be appendedKNIME Copyright 2013KNIME Text Mining Webinar30

Filtering by Tags and Termsbiomedical named entity tagsFrench Treebank (POS) tagsAny tagsnumbers, words N chars,modifiable termsPOS, chemical, and pharma tagsRemove Punctuation from documentRegEx termsSTTS tagsStandard Named Entity Filter tagsStop word termsKNIME Copyright 2013KNIME Text Mining Webinar31

Converting and ReplacingTerm case conversionTerm replacement with dictionary wordIntroducing Hyphenation(Liang’s algorithm)RegEx based term replacerGroups rows by term valueKNIME Copyright 2013KNIME Text Mining Webinar32

Stemmingconvertconverting convert[]Kuhlen Stemmer (English only)Porter (English only)Snowball (English, German, French, .)Stemmed term replaces original term!KNIME Copyright 2013KNIME Text Mining Webinar33

WorkflowKNIME Copyright 2013KNIME Text Mining Webinar34

5 - FrequenciesKNIME Copyright 2013KNIME Text Mining Webinar35

Frequency Measures TFTerm Frequency TF absolute # occurr. of term t TF relative # occurr. of term t/ # terms IDF log(1 # docs / # docs with term t)Inverse Document Frequency ICF log(1 # cat./ # cat. with term t)Inverse Category Frequency IDF * TFKNIME Copyright 2013KNIME Text Mining Webinar36

Frequency FilterValues betweenmin threshold andmax thresholdFrequency ColumnTop K RowsMin maxthresholdsKKNIME Copyright 2013KNIME Text Mining Webinar37

WorkflowKNIME Copyright 2013KNIME Text Mining Webinar40

6 - VisualizationKNIME Copyright 2013KNIME Text Mining Webinar41

Document ViewerRight-click word listsearch enginesSearch engines listed inKNIME- Textprecessing Search Engine PreferencesKNIME Copyright 2013Double-clickopens thedocumentKNIME Text Mining Webinar42

Document ViewerPrevious andnext documentKNIME Copyright 2013KNIME Text Mining Webinar43

Tag CloudAdj, verbs and nounsas same wordKNIME Copyright 2013KNIME Text Mining Webinar44

Asian RestaurantsKNIME Copyright 2013KNIME Text Mining Webinar45

German Food RestaurantsKNIME Copyright 2013KNIME Text Mining Webinar46

Fast Food RestaurantsKNIME Copyright 2013KNIME Text Mining Webinar47

Hiliting in Tag CloudsInteractiveTablevietnamisKNIME Copyright 2013KNIME Text Mining Webinar48

WorkflowKNIME Copyright 2013KNIME Text Mining Webinar49

7 - Topic ClassificationKNIME Copyright 2013KNIME Text Mining Webinar50

Document VectorDocument Vector: Documentsrepresented in the terms spaceKNIME Copyright 2013KNIME Text Mining WebinarBitvector orfrequencymeasure51

Term VectorTerm Vector: Terms representedin the documents spaceKNIME Copyright 2013KNIME Text Mining WebinarBitvector orfrequencymeasure52

Topic Detection GoalPossible Topics:- Asian Restaurants- German Food- Fast FoodPre-labeled data set available!Target Topic is in Document Category.KNIME Copyright 2013KNIME Text Mining Webinar53

Topic DetectionAfter the Document Vector Transformation,topic detection becomes just another dataanalytics problem.80% for training set, 20% for test set.Target CategoryKNIME Copyright 2013KNIME Text Mining Webinar54

Classification Sub-WorkflowTraining/TestingWrong docsPrepare inputdata as word inDocument 1/0ExtractCategoryX-ValidationKNIME Copyright 2013KNIME Text Mining Webinar55

Problem 1Sometimes a pre-labeled data set is notavailable.1. Use a Clustering technique2. Find a similar pre-labeled data sets thatyou can adapt to the current problemKNIME Copyright 2013KNIME Text Mining Webinar56

Problem 2The vector generated by the DocumentVector node can be high dimensional.To reduce the input space dimensionality youcan:- Filter words by frequency- Detect keywords and only use the mostimportant ones.KNIME Copyright 2013KNIME Text Mining Webinar57

Keywords Extractor NodesFrom:"Keyword extraction from a singledocument using word cooccurrence statistical information"by Y.Matsuo and M. Ishizuka.Max. # keywordsper docKNIME Copyright 2013From:"KeyGraph: Automatic Indexing byCo-occurrence Graph based onBuilding Construction Metaphor" byYukio Ohsawa.KNIME Text Mining Webinar58

8 - Sentiment AnalysisKNIME Copyright 2013KNIME Text Mining Webinar59

Sentiment Corpus MPQA Corpuswith negativeand positivewords Tag Wordsaccording toCorpus with aDictionaryTagger NodeKNIME Copyright 2013KNIME Text Mining Webinar60

Dictionary Tagger NodeList of Wordsfrom CorpusTag Type to attach toWords in documentand in Corpus ListTag Value to attach toWords in documentand in Corpus ListNE Named EntitiesEach Tag Type has alist of possible TagValuesKNIME Copyright 2013KNIME Text Mining Webinar61

Sentiment By DocumentPERSON Tag forPOSITIVE WordsTIME Tag forNEGATIVE WordsATTITUDE /-1 forPOSITIVE orNEGATIVE WordsMEAN(ATTITUDE)on all words foreach documentAdjust Bin boundaries inAuto-Binner node for moreaccurate resultsKNIME Copyright 2013Bin1 negative documentsBin2, Bin3 neutral documentsBin4 positive documentsKNIME Text Mining Webinar62

Sentiment by Category Tag Cloud on all words in all documents fora given category Words in tag clouds are colored byColor bysentimentNE valueFastFoodNew column NEwith values:PERSON, TIME,neutral (whenmissing value)KNIME Copyright 2013AsianKNIME Text Mining WebinarGermanCuisine63

Tag Cloud by CategoryAsianKNIME Copyright 2013Fast FoodKNIME Text Mining WebinarGerman64

Sentiment by Word Find Most Positive/Most NegativeDocument Build Tag Cloud with words colored bySortingsentiment AscendingSortingKNIME Copyright 2013 Descending KNIME Text Mining Webinar65

Tag Cloud of Worst/Best DocMost PositiveKNIME Copyright 2013Most NegativeKNIME Text Mining Webinar66

ImprovementsAdd polarity change for negationsAdd polarity changes for enhancementsImprove positive/negative dictionaryImprove bin distribution (skewenesstowards positive) Improve list of stop words Remove Names of Burgers? KNIME Copyright 2013KNIME Text Mining Webinar67

Thank youeducation@knime.comKNIME Copyright 2013KNIME Text Mining Webinar68

30.10.2013 · KNIME Text Mining Webinar 2. Agenda. Text Mining Goals and Usage Enrichment & Preprocessing Data Types & Structures Visualization Topic Detection Sentiment Analysis. KNIME