Transcription
Text Mining WebinarThe Textprocessing ExtensionRosaria Silipo and Kilian Thiel
Agenda Text Mining Goals and UsageEnrichment & PreprocessingData Types & StructuresVisualizationTopic DetectionSentiment AnalysisKNIME Copyright 2013KNIME Text Mining Webinar2
Install TextProcessing ExtensionKNIME:www.knime.orgInstall Textprocessing Extensionunder KNIME LabsKNIME Copyright 2013KNIME Text Mining Webinar3
ExamplesExample Workflows available on the KNIMEpublic server.KNIME Copyright 2013KNIME Text Mining Webinar4
Text Mining WorkflowFiltering,Stemming, .CreateNewDocumentEnrichment(POS, Sentiment,cities, domainspecific terms, .)FrequenciesVisualizationTag Cloud and DocumentProperties VisualizationKNIME Copyright 2013PreprocessingBowAdd TagsRead or Create anew DocumentTransformationKNIME Text Mining WebinarTF abs, TF rel, ,IDF, .Data MiningClassification forTopic Detection5
1 - Create a DocumentKNIME Copyright 2013KNIME Text Mining Webinar6
New Data TypesDocumentEncapsulates text, author,title, source, category,and typeKNIME Copyright 2013KNIME Text Mining Webinar7
From a FolderThe output isa list ofDocuments?KNIME Copyright 2013KNIME Text Mining Webinar8
From PUBMEDThe output is alist of DocumentsDestinationFolder MUST beEMPTY!KNIME Copyright 2013KNIME Text Mining Webinar9
Strings to DocumentKNIME Copyright 2013KNIME Text Mining Webinar10
From RSS Feedshttp://feeds.nytimes.com/nyt/rss/WorldKNIME Copyright 2013Palladian NodesKNIME Text Mining Webinar11
The Data SetReviews of Restaurants in Berlin fromTripAdvisorSelf-downloaded with RSS FeederKNIME Copyright 2013KNIME Text Mining Webinar12
2 - EnrichmentKNIME Copyright 2013KNIME Text Mining Webinar13
New Data TypesDocumentEncapsulates text, author,title, source, category,and typeTermEncapsulates a termKNIME Copyright 2013KNIME Text Mining Webinar14
Enrichment (Tagging)Enrichment nodes (mostly) change thegranularity of terms.– Multiword detection, named entity recognition,part of speech definition, – To each detected entity (term) a tag is added,specifying its type.– To avoid intersection of granularity the lastnode dominates.KNIME Copyright 2013KNIME Text Mining Webinar15
Tagger Conflict ResolutionIn case of intersections of granularity the last node overwrites.Example: “The gene interleukin 6 interacts .”1. POS tagger: “The\DT gene\NN interleukin\NN 6\CD interacts \VBZ ”2. NE tagger: “The\DT gene\NN interleukin 6\GENE interacts \VBZ ”POSTaggerAdds POS tags toterms.KNIME Copyright 2013NETaggerOverwrite!Adds NE tags to terms and overrides otherconflicting tags.KNIME Text Mining Webinar16
TaggingLanguage POS ModelNo settingsAbner ModelNamed EntityNo settingsNamed EntityTags can be setas unmodifiableDictionary ColumnTag TypeTag ValueKNIME Copyright 2013KNIME Text Mining WebinarDictionary ColumnTag TypeTag ValueMatching Strategy17
Unmodifiable Named Entity TagsNamed Entities Tags attachedthrough enrichment nodes canbe set as unmodifiableUnmodifiable Tags are notaffected by any preprocessingnodes (stemming, filtering,etc.)KNIME Copyright 2013KNIME Text Mining Webinar18
WorkflowStrings to Document Workflow POS TaggerKNIME Copyright 2013KNIME Text Mining Webinar19
3 - TransformationKNIME Copyright 2013KNIME Text Mining Webinar20
Data Types and Features StringDocumentTagsMeta InfoKNIME Copyright 2013 SentenceTermMolecule StructureDocument VectorKNIME Text Mining Webinar21
Parsing / TokenizationParser nodes parse documents by applyingstandard tokenization via OpenNLPtokenizer.Each token is a term consisting of a singleword.Tags are applied to terms.KNIME Copyright 2013KNIME Text Mining Webinar22
1Annotations label asection part in thedocument (like“Abstract”, e2te3te4te5Terms consist oftokens and tagsToken:Tags:KNIME Copyright 2013to1to2 to3 tokens tokens tokensta1 ta2 tagstagstagsKNIME Text Mining Webinartags23
The Bag of WordsNo SettingsrequiredKNIME Copyright 2013Each Term isextracted withTags (NN, VB, )KNIME Text Mining WebinarList (Bag) of Terms (Words)identified in Document24
Data and Sentence ExtractorKNIME Copyright 2013KNIME Text Mining Webinar26
ConversionsMolecularStructureKNIME Copyright 2013KNIME Text Mining Webinar27
WorkflowKNIME Copyright 2013KNIME Text Mining Webinar28
4 - PreprocessingKNIME Copyright 2013KNIME Text Mining Webinar29
Preprocessing ModeNormal– Faster– Preprocesses only terms of the term column– Documents are not changedDeep––––SlowerPreprocesses terms of the term columnTerms in documents are changed as wellUnchanged documents can be appendedKNIME Copyright 2013KNIME Text Mining Webinar30
Filtering by Tags and Termsbiomedical named entity tagsFrench Treebank (POS) tagsAny tagsnumbers, words N chars,modifiable termsPOS, chemical, and pharma tagsRemove Punctuation from documentRegEx termsSTTS tagsStandard Named Entity Filter tagsStop word termsKNIME Copyright 2013KNIME Text Mining Webinar31
Converting and ReplacingTerm case conversionTerm replacement with dictionary wordIntroducing Hyphenation(Liang’s algorithm)RegEx based term replacerGroups rows by term valueKNIME Copyright 2013KNIME Text Mining Webinar32
Stemmingconvertconverting convert[]Kuhlen Stemmer (English only)Porter (English only)Snowball (English, German, French, .)Stemmed term replaces original term!KNIME Copyright 2013KNIME Text Mining Webinar33
WorkflowKNIME Copyright 2013KNIME Text Mining Webinar34
5 - FrequenciesKNIME Copyright 2013KNIME Text Mining Webinar35
Frequency Measures TFTerm Frequency TF absolute # occurr. of term t TF relative # occurr. of term t/ # terms IDF log(1 # docs / # docs with term t)Inverse Document Frequency ICF log(1 # cat./ # cat. with term t)Inverse Category Frequency IDF * TFKNIME Copyright 2013KNIME Text Mining Webinar36
Frequency FilterValues betweenmin threshold andmax thresholdFrequency ColumnTop K RowsMin maxthresholdsKKNIME Copyright 2013KNIME Text Mining Webinar37
WorkflowKNIME Copyright 2013KNIME Text Mining Webinar40
6 - VisualizationKNIME Copyright 2013KNIME Text Mining Webinar41
Document ViewerRight-click word listsearch enginesSearch engines listed inKNIME- Textprecessing Search Engine PreferencesKNIME Copyright 2013Double-clickopens thedocumentKNIME Text Mining Webinar42
Document ViewerPrevious andnext documentKNIME Copyright 2013KNIME Text Mining Webinar43
Tag CloudAdj, verbs and nounsas same wordKNIME Copyright 2013KNIME Text Mining Webinar44
Asian RestaurantsKNIME Copyright 2013KNIME Text Mining Webinar45
German Food RestaurantsKNIME Copyright 2013KNIME Text Mining Webinar46
Fast Food RestaurantsKNIME Copyright 2013KNIME Text Mining Webinar47
Hiliting in Tag CloudsInteractiveTablevietnamisKNIME Copyright 2013KNIME Text Mining Webinar48
WorkflowKNIME Copyright 2013KNIME Text Mining Webinar49
7 - Topic ClassificationKNIME Copyright 2013KNIME Text Mining Webinar50
Document VectorDocument Vector: Documentsrepresented in the terms spaceKNIME Copyright 2013KNIME Text Mining WebinarBitvector orfrequencymeasure51
Term VectorTerm Vector: Terms representedin the documents spaceKNIME Copyright 2013KNIME Text Mining WebinarBitvector orfrequencymeasure52
Topic Detection GoalPossible Topics:- Asian Restaurants- German Food- Fast FoodPre-labeled data set available!Target Topic is in Document Category.KNIME Copyright 2013KNIME Text Mining Webinar53
Topic DetectionAfter the Document Vector Transformation,topic detection becomes just another dataanalytics problem.80% for training set, 20% for test set.Target CategoryKNIME Copyright 2013KNIME Text Mining Webinar54
Classification Sub-WorkflowTraining/TestingWrong docsPrepare inputdata as word inDocument 1/0ExtractCategoryX-ValidationKNIME Copyright 2013KNIME Text Mining Webinar55
Problem 1Sometimes a pre-labeled data set is notavailable.1. Use a Clustering technique2. Find a similar pre-labeled data sets thatyou can adapt to the current problemKNIME Copyright 2013KNIME Text Mining Webinar56
Problem 2The vector generated by the DocumentVector node can be high dimensional.To reduce the input space dimensionality youcan:- Filter words by frequency- Detect keywords and only use the mostimportant ones.KNIME Copyright 2013KNIME Text Mining Webinar57
Keywords Extractor NodesFrom:"Keyword extraction from a singledocument using word cooccurrence statistical information"by Y.Matsuo and M. Ishizuka.Max. # keywordsper docKNIME Copyright 2013From:"KeyGraph: Automatic Indexing byCo-occurrence Graph based onBuilding Construction Metaphor" byYukio Ohsawa.KNIME Text Mining Webinar58
8 - Sentiment AnalysisKNIME Copyright 2013KNIME Text Mining Webinar59
Sentiment Corpus MPQA Corpuswith negativeand positivewords Tag Wordsaccording toCorpus with aDictionaryTagger NodeKNIME Copyright 2013KNIME Text Mining Webinar60
Dictionary Tagger NodeList of Wordsfrom CorpusTag Type to attach toWords in documentand in Corpus ListTag Value to attach toWords in documentand in Corpus ListNE Named EntitiesEach Tag Type has alist of possible TagValuesKNIME Copyright 2013KNIME Text Mining Webinar61
Sentiment By DocumentPERSON Tag forPOSITIVE WordsTIME Tag forNEGATIVE WordsATTITUDE /-1 forPOSITIVE orNEGATIVE WordsMEAN(ATTITUDE)on all words foreach documentAdjust Bin boundaries inAuto-Binner node for moreaccurate resultsKNIME Copyright 2013Bin1 negative documentsBin2, Bin3 neutral documentsBin4 positive documentsKNIME Text Mining Webinar62
Sentiment by Category Tag Cloud on all words in all documents fora given category Words in tag clouds are colored byColor bysentimentNE valueFastFoodNew column NEwith values:PERSON, TIME,neutral (whenmissing value)KNIME Copyright 2013AsianKNIME Text Mining WebinarGermanCuisine63
Tag Cloud by CategoryAsianKNIME Copyright 2013Fast FoodKNIME Text Mining WebinarGerman64
Sentiment by Word Find Most Positive/Most NegativeDocument Build Tag Cloud with words colored bySortingsentiment AscendingSortingKNIME Copyright 2013 Descending KNIME Text Mining Webinar65
Tag Cloud of Worst/Best DocMost PositiveKNIME Copyright 2013Most NegativeKNIME Text Mining Webinar66
ImprovementsAdd polarity change for negationsAdd polarity changes for enhancementsImprove positive/negative dictionaryImprove bin distribution (skewenesstowards positive) Improve list of stop words Remove Names of Burgers? KNIME Copyright 2013KNIME Text Mining Webinar67
Thank youeducation@knime.comKNIME Copyright 2013KNIME Text Mining Webinar68
30.10.2013 · KNIME Text Mining Webinar 2. Agenda. Text Mining Goals and Usage Enrichment & Preprocessing Data Types & Structures Visualization Topic Detection Sentiment Analysis. KNIME