SF Data Mining Meetup September 22, 2014 - Weebly

Transcription

9/23/2014Text Analytics TutorialSF Data Mining MeetupSeptember 22, 2014Kilian Thiel, Rosaria Silipo, Cathy PearlKNIME.com AG, Zurich, t 2014 KNIME.com AGTool Installation Download open source KNIME analytics platform-sdk-download Select package for your OS and install Open the KNIME application In the top menu select “File” or “LOCAL” - “InstallKNIME Extensions” Install “KNIME & Extensions” and “KNIME LabsExtensions”Copyright 2014 KNIME.com AG21

9/23/2014Install KNIME Extensions (incl. Text Processing)Copyright 2014 KNIME.com AG3Requirements to import and run Demo Workflows KNIME 2.10 Text Processing Extension from KNIME LabsExtensions Distance Matrix from KNIME ExtensionsMemory TipIn file knime.ini set memory to max available -Xmx 3GCopyright 2014 KNIME.com AG42

9/23/2014Resources The KNIME Website (www.knime.org) LEARNING HUB under RESOURCES (www.knime.org/learninghub)Use Cases and White Papers for example workflows, and FORUM for questions and answersDOCUMENTATION for documentation, FAQ, change-logs, .LABS for new developments and experimental nodesCOMMUNITY for development instructions and third party nodes Blog for news, tips and tricks(www.knime.org/blog)KNIME TVchannel onText Mining Webinar http://www.youtube.com/watch?v tY7vpTLYlIg KNIME onCopyright 2014 KNIME.com AG@KNIME5ResourceseBooks from the KNIME Press:http://www.knime.org/knimepress- KNIME Beginner’s Luck- The KNIME Cookbook- The KNIME Booklet for SAS UsersFree Beginner’sGuide – use Code“meetupsf14”Copyright 2014 KNIME.com AG3

9/23/2014Text Processing Steps1. Import Data3. Pre-processing(Filtering, Stemming, )4. ClassificationClustering2. yright 2014 KNIME.com AG4. TransformationBoW, Frequencies,Document Vector7Import Demo Workflows Download zip file with demo workflows frommeetup site Open the KNIME application In the top menu, select File - Import KNIMEWorkflow . Enable option „Select Archive File“ Browse to zip file Import all workflows and data into KNIMECopyright 2014 KNIME.com AG84

9/23/2014Import Demo WorkflowsCopyright 2014 KNIME.com AG9Demo Workflows0-TripAdvisorCrawling: importing data from web1-Reading: Importing data from text, word, pdf, Twitter,XML, 2-Enrichment POS: String to Document and Word Taggingin Document3-Preprocessing: Filtering and Stemming4-Classification-Cuisine: BoW, Frequencies, Document toDocument VectorOther workflows for multi-words, clustering, topicextraction, and reporting.Copyright 2014 KNIME.com AG105

9/23/2014Demo: The KNIME WorkbenchCopyright 2014 KNIME.com AGText Processing CategoryCopyright 2014 KNIME.com AG126

9/23/2014Demo: TripAdvisor Restaurant Data Set (SF)Copyright 2014 KNIME.com AG13Demo: TripAdvisor Data (SF Restaurants)Reviews about Italian and Chinese restaurants in SanFrancisco Chinese: 272 Italian: 268Copyright 2014 KNIME.com AG147

9/23/2014Demo: Goal of this TutorialGoal: Build a classifier to distinguish between Chinese andItalian restaurants, based on the reviews.Italian or ChineseRestaurant?Copyright 2014 KNIME.com AG15Demo: Final WorkflowGoal:Copyright 2014 KNIME.com AG168

9/23/20141.) ReadingRead/Parse textual dataCopyright 2014 KNIME.com AG17DemoReading Read Tripadvisor data (.table file) Filter rows with missing restaurant value Convert strings to documents Filter all but the document column Examples of other possible formats to importCopyright 2014 KNIME.com AG189

9/23/20140.) Web Crawler WorkflowPalladian Extension from:KNIME Community Contributions – OtherCopyright 2014 KNIME.com AG19DemoReading Web Crawler Workflow to get data from the Web Palladian Community Contributions Extension HtmlParser node Xpath nodeCopyright 2014 KNIME.com AG2010

9/23/20142.) EnrichmentEnrich documents with semantic informationThis assigns a tag to each word:- Grammar tags (POS)- Context dependent tags- Sentiment tags- Named Entity tags- Custom tagsCopyright 2014 KNIME.com AG21DemoEnrichment / Tagging Apply POS Tagger node Use Bag of Words node to inspect tagging result Show other possible TaggingsCopyright 2014 KNIME.com AG2211

9/23/20143.) PreprocessingPreprocess documents and filter wordsCopyright 2014 KNIME.com AG23DemoPreprocessing Filter– Numbers– Punctuation marks– Stop Words Convert to lower case Stemming (Snowball stemmer because of the manylanguages associated with it) Keep only nouns (NN), verbs (VB), adjectives (JJ)Copyright 2014 KNIME.com AG2412

9/23/20144.) TransformationCreation of numerical representation of documentsBoW creates the list of words for each documentTF calculates word frequencies (absolute or relative)in each documentCopyright 2014 KNIME.com AG25DemoTransformation Transform to bag of word Compute TF value for termsTFrel (word) n(word)/NIDF(word) log(1 (n(docs)/n(word, docs))Tfrel(word) * IDF(word) is used oftenICF(word) log(1 (n(cat)/n(word, cat)) Sort output data by frequencyCopyright 2014 KNIME.com AG2613

9/23/20144.) TransformationCreation of numerical representation of documentsCopyright 2014 KNIME.com AG27DemoTransformation Transform to document vectors Extract category (class) valueCopyright 2014 KNIME.com AG2814

9/23/20145.) ClassificationBack to classical Data Analytics:Training of a model (decision tree) and scoringCopyright 2014 KNIME.com AG29DemoClassification Append color based on class Partition data into training and test set Train decision tree model in training data Apply decision tree model on test data Score model, measure accuracy Show cross-validation loopCopyright 2014 KNIME.com AG3015

9/23/2014Additional Workflows Multi Word Tagging– Detection of frequent Ngrams (Ngram Creator)– Creation of dictionary from Ngrams– Applying Dictionary Tagger Classification with Multi Words Clustering of documents– hierarchical clustering based on distance matrix Topic Extraction– Topic Extractor (Parallel LDA)Copyright 2014 KNIME.com AG31Thank YouQuestions http://tech.knime.org/forum Rosaria.Silipo@knime.com60kFollow us40k Twitter: @KNIME 20kLinkedIn: https://www.linkedin.com/groups?gid 2212172 KNIME Blog: http://www.knime.org/blogCopyright 2014 KNIME.com AG3216

9/23/2014 3 Copyright 2014 KNIME.com AG The KNIME Website (www.knime.org) LEARNING HUB under RESOURCES (www.knime.org/learning- hub) Use Cases and .