Zurich KNIME Users Text Mining Tripadvisor Data

Transcription

Zurich KNIME UsersText Mining Tripadvisor DataKilian ThielKNIMECopyright 2014 KNIME.com AG

Text Mining with KNIME: Mining Tripadvisor DataAgenda The KNIME Textprocessing Extension– Preliminaries– Philosophy & Usage Classification of Tripadvisor Reviews– Tripadvisor data– Classification of reviewsCopyright 2014 KNIME.com AG2

ng Documentation Examples Forum White PapersCopyright 2014 KNIME.com AG3

Installation1.)Copyright 2014 KNIME.com AG2.)4

RequirementsRequirements to import and run demo workflows KNIME 2.11.1 Textprocessing (labs) Distance Matrix (KNIME) Palladian (Community)Copyright 2014 KNIME.com AG5

Tips Settings (knime.ini)– Set maximum memory for KNIME– -Xmx3GCopyright 2014 KNIME.com AG6

PhilosophyClassification perhaps your nameisRumpelstiltskin[Person] ? perhaps your nameisRumpelstiltskin[Person] ? Copyright 2014 KNIME.com 107

Additional Data Types Document Cell– Encapsulates a document Title, sentences, terms, words Authors, category, source Generic meta data (key, value pairs) Term Cell– Encapsulates a term Words, tagsCopyright 2014 KNIME.com AG8

Data Table Structures Document table– List of documents Bag of words– Tuples of documentsand terms Document vectors– Numericalrepresentations ofdocumentsCopyright 2014 KNIME.com AG9

Tripadvisor DataRatingTitleAuthorFulltextCopyright 2014 KNIME.com AG10

Tripadvisor DataReviews about italian and indian restaurants in Zurich Indian: 202 Italian: 202Copyright 2014 KNIME.com AG11

Tripadvisor DataGoal: Build classifier to distinguish between indian anditalian restaurants, based on their reviews.Review about italian orindian restaurant?Copyright 2014 KNIME.com AG12

Tripadvisor DataGoal:Copyright 2014 KNIME.com AG13

1.) ReadingRead/Parse textual dataCopyright 2014 KNIME.com AG14

DemoReading Read Tripadvisor data (.table file) Filter rows with missing restaurant value Convert strings to documents Filter all but the document columnCopyright 2014 KNIME.com AG15

2.) EnrichmentEnrich documents with semantic informationCopyright 2014 KNIME.com AG16

DemoEnrichment / Tagging Apply POS Tagger node Use Bag of Words node to inspect tagging resultCopyright 2014 KNIME.com AG17

3.) PreprocessingPreprocess documents and filter wordsCopyright 2014 KNIME.com AG18

DemoPreprocessing Filter– Numbers– Punctuation marks– Stop Words Convert to lower case Stemming Keep only nouns, verbs, adjectivesCopyright 2014 KNIME.com AG19

4.) TransformationCreation of numerical representation of documentsCopyright 2014 KNIME.com AG20

DemoTransformation Transform to bag of word Compute TF value for terms Transform to document vectors Extract category (class) valueCopyright 2014 KNIME.com AG21

5.) ClassificationTraining of a model (decision tree) and scoringCopyright 2014 KNIME.com AG22

DemoClassification Append color based on class Partition data into training and test set Train decision tree model in training data Apply decision tree model on test data Score model, measure accuracyCopyright 2014 KNIME.com AG23

Additional Workflows Clustering of documents– Hierarchical clustering– K-Means clustering– PCA visualization Unsupervised topic detectionCopyright 2014 KNIME.com AG24

Thank YouQuestions http://tech.knime.org/forum Kilian.Thiel@knime.com60kFollow us40k Twitter: @KNIME20k LinkedIn: https://www.linkedin.com/groups?gid 2212172 KNIME Blog: http://www.knime.org/blogCopyright 2014 KNIME.com AG25

Copyright 2014 KNIME.com AG Text Mining with KNIME: Mining Tripadvisor Data Agenda The KNIME Textprocessing Extension –Preliminaries –Philosophy & Usage Classification of Tripadvisor Reviews –Tripadvisor data –Classification of reviews 2