Big Data And Sentiment Analysis Using KNIME: Online .

Transcription

Big Data and Sentiment Analysis using KNIME:Online Reviews vs. Social MediaAna Mihanović, Hrvoje Gabelica, Živko KrstićPoslovna inteligencija d.o.o., Zagreb, Croatia{ana.mihanovic, hrvoje.gabelica, zivko.krstic}@inteligencija.comAbstract - Text analytics and sentiment analysis can help anorganization derive potentially valuable business insightsfrom text-based content such as word documents, email andpostings on social media streams like Facebook, Twitter andLinkedIn. The system described here analyses opinionsabout various gadgets collected from two different sourcesand in two different forms; online reviews and Twitter posts(tweets). Sentiment analysis can be applied to online reviewsin easier and more detailed way than to the tweets. Namely,online reviews are written in clear and grammatically moreaccurate form, while in tweets, internet slang, sarcasm andallegory are often used. System described here explainsmethods of data collection, sentiment analysis process foronline reviews and tweets using KNIME, gives an overviewof differences and analysis possibilities in sentiment analysisfor both data sources.I.INTRODUCTIONText mining or sentiment analysis [1] is analysis ofdata contained in a natural language text, which dealswith the computation of opinion, sentiment andsubjectivity in text. Sentiment analysis refers to the use ofnatural language processing, text analysis andcomputational linguistics to identify and extractsubjective information from the text documents. Basictask of sentiment analysis is to determine the polarity of agiven texts.Tasks of dictionary making and sentiment analysisprocess are done by the means of KNIME [2], which is auser-friendly graphical workbench capable of entireanalysis process. KNIME uses six different steps toprocess texts: reading and parsing documents, namedentity recognition, filtering and manipulation, wordcounting and keyword extraction, transformation andvisualization. Following workflows and tasks aredeveloped and executed using KNIME: Retrieving data from databaseDictionary development and implementationReview scoringII.DATA COLLECTIONIn gathering online reviews and tweets aboutgadgets, focus was set on few gadget manufacturers suchas Apple, Samsung, Nokia, Nexus, LG and on thewebpages that contain online reviews about gadgets.Total number of collected online reviews and tweetswhich were collected during few hours for the purposesof this paper was: 812176 tweets419624 online reviewsThe system presented here handles crawling,extracting gadget reviews and storing them for analysis.Collected unstructured text is prepared for text miningand sentiment analysis. System presented gives theanalysis results for every single review or tweet.Online reviews were crawled from webpages usingApache Nutch [3] crawler, highly extensible and scalableopen source web crawler which traverses the web sitestarting from a given set of URLs and follows the linksmatching a given pattern to a certain depth. Tweets werecollected with in-house developed Java package used forstreaming posts from Twitter. Both online reviews andtweets were collected within few hours. Since tweets aremuch shorter than online reviews and collecting theseposts takes less time than crawling online reviews, thenumber of tweets is much bigger than the number ofcrawled online reviews.Crawled online reviews and tweets and stored intoHBase tables on Apache Hadoop [4] server. Hadoop is anopen source software project that enables the distributedprocessing of large data sets across clusters of commodityservers and it is popular for developing large-scale dataintensive applications. Hadoop is used as a storageenvironment, and HBase table, as a component ofHadoop, is used to store information about gadgets andgadgets reviews. HBase [5] is non-relational, distributeddatabase written in Java which efficiently holdsunstructured data in large tables and can be concurrentlyand randomly accessed. Hadoop and HBase were usedbecause of their main functionalities, which are storage oflarge quantities of unstructured, textual data and theirpossibility of reading and writing data in the real time.III.DATASET DESCRIPTIONData collected into HBase tables slightly differs foronline reviews and tweets.Online reviews are described with followingattributes and structured in the following way:

Key - unique key indicator for each gadgetreview created before review was storedinto databasePID (Product group ID) – name of thegadget, such as Samsung Galaxy s4 oriPhone 5Review Date – date when online review wasstored into HBase tableReview text – text of the review aboutspecific gadgetLang – language of the review textTweets are described with following attributes andstructured in the following way: Key - unique key indicator for each gadgetreview created before tweet was stored intodatabaseuserScreenName – Username of TwitteruserCreationDate – date when tweet was storedinto HBase tabletext – tweet text about specific gadgetkeyword – name of the gadget, such asSamsung Galaxy s4 or iPhone 5, analogousto PID field in online reviews tableLang – language of the tweetKey and PID value (keyword vale for tweets)allowed aggregated analysis on the level of each reviewor tweet, or on the level of the specific gadget. Datevalues allowed aggregated analysis for the defined periodof time.Described data is loaded into KNIME with HBaseReader node and processed. In this phase, only onlinereviews and tweets in English language were collected.Language of the text is set to English and all texts thathave different language values are filtered out, becauseEnglish dictionary applied on reviews and posts writtenin other languages would not give results.IV.DICTIONARY BUILDINGA. Online reviewsOnline reviews are analyzed on the category level.Seven different categories are introduced for the gadgetreviews: Accessibility/usage – ease of usage, availabilityof applications and softwareValue for money – is the price of the productreasonable or not, is the price justified by qualityContent/composition – materials that product ismade of, all hardware components, accessories(handset, bluetooth handset, keypad, etc.)Quality – quality of materials product is madeof,qualityofhardwarecomponents,manufacturer quality User experience – experience of the user whenbuying and using product (issues, wouldrecommend, wouldn’t recommend, would buyagain, etc.)Look/appearance – physical appearance of theproduct (color, design, size, weight, etc.)Service/Support – seller’s attitude and kindness,service center help support, warranty of theproductService/support category is the category with theleast number of the recognized phrases and for thatreasons it is not included in chart visualizations.Redefiniton of this category is being considered.Dictionary building for detailed sentiment analysisimplies making an initial list of adjectives and nounswhich are normally used when describing a specificproduct. Initial list is collected by analyzing existingopinions and reviews and manually extracting the wordswhich appear the most in those reviews and opinions orthose words that seem to be most important. Grade scopeis defined from 0 to 5 (0 meaning extremely dissatisfied,and 5 meaning extremely satisfied). Null means no rating.Since terms and phrases extracted from reviews needto have category and grade, nouns are holders ofcategories, and adjectives are holder of grades. Thatmeans that every noun on the list had to be categorizedinto one of seven categories, and every adjective on thelist has to be graded with one grade. Once the initial listof nouns and adjectives has been collected, nouns andadjectives are used in regular expressions which bindthem together and recognize complete phrases in reviewsconsisted of all combinations of adjectives and nouns thatare in the list. Regular expressions are composed for themost used grammatical forms in English language. List ofregular expressions used for dictionary development: Adjective noun and negations („not“ adjective noun)This form is used to detect most simple phrases ofEnglish language, such as „amazing battery“, „badscreen“ or „great apps“. For the negation detection,regular expressions are written in the form „not“ adjective noun. Noun „to be“ adjective and negations (noun „not to be“ adjective)With these forms of regular expressions, morecomplicated phrases are detected, such as „batterydoesn’t last“, „screen is great“, or „camera is good“. Tomake phrases detection more accurate and successful, allforms of verb „to be“ must be included, includinggrammatical mistakes which are people likely to make. Noun intensifier/downtoner adjective andnegations (noun „not to be“ intensifier/downtoner adjective)Intensifies or amplifiers are words in English thatincrease the effect of the verb and include such words as„completely“, „totally“, „undoubtedly“, „absolutely“, etc.Downtoners are words that decrease the effect of the verb

and include such words as „kind of“, „not so much“, „sortof“, and so on.Another important feature is to correctly gradephrases that contain negations and/or modifiers becausethese type of words change total grade and meaning of aphrase. If phrase contains some of defined words, grade iscalculated using different “if.else if” rules. Theseexceptions are handled as follows: If the phrase contains negation, the grade isreversed and replaced by its module (e.g.„camera is great“ 4, „camera isn’t great“ 1)If the phrase contains intensifier, the grade isincreased by one (e.g. „good processor “ 3,„really good processor“ 4). If the phrasecontains negations along with intensifier, thegrade module is calculated (e.g. „not really goodprocessor“ 1)If the phrase contains downtoner, the grade isdecreased by one (e.g. „display is good“ 3,„display is merely good“ 2). Negations usedalong with downtoners are rare and therefore notanalyzedB. Twitter postsV.DATA SCORINGSentiment analysis of online reviews and tweetsdiffers greatly. Online reviews are easier to objectivelyanalyze because of clearer written form and biggeramount of meaningful sentiments, but they have to beanalyzed on a more detailed level than tweets.Tweets often contain internet slang, sarcasm andallegory which are often used. It could be said thatgrading online reviews is easier, but the dictionary ismore complicate to make, and that grading tweets isharder, but the dictionary is easier to make.Online reviews are analyzed on the phrase andcategory level, giving phrase a grade for one of sevencategories. Tweets are analyzed on the word level givinga positive or negative grade for each term. Text analyticsof online reviews is accomplished simply with phrasescounters and mean calculations, while analytics of tweetsis frequency-driven.Big quantity of reviews or posts are loaded intoKNIME to be graded. Two separate wokflows are built,one for grading onine reviews based on a grade andcategory, and other one for positive-negative grading.Dictionary built for sentiment analysis of tweetscontains of key words graded only as positive ornegative. Scoring or sentiment analysis of tweets is doneon the positive-negative level, because tweets don’tcontain clear phrases such as those that can be found inonline reviews. Therefore, tweets are impossible tocategorize into numerous categories as online reviews,and they need to be analyzed on the word level, givingeach word positive or negative polarity.A. Online Reviews GradingDictionary built for sentiment analysis of socialmedia posts consist of most used words in those posts.Because of the different structure when comparing onlinereviews and tweets, dictionary needs to include jargonwords, internet slang, smiley icons, abbreviations,hashtags and similar. These words and symbols are ofgreat importance, since tweets are usually not rich withuseful phrases and terms. Example of such wordsincluded into dictionary are: #loveit, #horror, :-(, :-*,OMG, JK and similar.Online reviews are processed and graded withthe term presence method [7], rather than term frequencymethod. Term presence method gives a binary valuewhich simply indicates does the term or phrase occur inthe text (value 1) or not (value 0). Every term or phrasehas also a grade and category joined and binary values ofterms are summed on the level of each review, givingterm counters and grade sum for each review.For this task, publicly available MPQA subjectivitylexicon was used as a starting point for recognizingcontextual polarity [6], which was expanded with Twitterspecific words mentioned above. Existing dictionarycontaining of approximately 8000 words is expanded tofit the needs for gadget analysis in a way that initialportion of tweets are collected, which are separated intosingle words with Bag of Words processing. Unnecessarywords such as symbols or web URLs are filtered out, andall useful, social media specific words are graded andadded to the dictionary.First input of sentiment analysis workflowdeveloped in KNIME are online reviews read from thedatabase, and second, parallel input is the dictionarymade of phrases recognized with regular expressions.Phrases from the dictionary are recognized from thedictionary file and tagged in Review Texts usingDictionary Tagger node.Sum of grades for every category is divided withcounter number for every category, giving final grade foreach online review and for each category. The resultswere written back to the Hadoop database (Fig. 1), and asa result of Hadoop aggregation, the average grade forevery category on the level of single gadget is calculated.ReviewTextValueForMoney Content CompositionQuality Accessibility Usage Look Appearance"4.5 stars out of 5 Niall Boxhall 29 November5 2010 4Good: Very4good for a resistive touchscreen,4brilliant user5 interface, livesquare," In reply to jasz @ 2012-04-24 16:33 from4 sxti - click4 to readThe5 LG bl40 is an awesome5phone, if you like3 the look and want a re"Great phone, with exce

online reviews and tweets using KNIME, gives an overview of differences and analysis possibilities in sentiment analysis for both data sources. I. INTRODUCTION Text mining or sentiment analysis [1] is analysis of data contained in a natural language text, which deals with the computation of opinion, sentiment and subjectivity in text. Sentiment analysis refers to the use of natural language .