Sentiment Analysis Using KNIME: A Systematic Literature Review Of Big .

Transcription

View metadata, citation and similar papers at core.ac.ukbrought to you byCOREprovided by Loughborough University Institutional RepositorySentiment Analysis using KNIME: aSystematic Literature Review of Big DataLogisticsGary Graham1, Royston Meriton, Patrick HennellyAbstract - Text analytics and sentiment analysiscan help researchers to derive potentiallyvaluable thematic and narrative insights fromtext-based content such as industry reviews,leading OM and OR journal articles andgovernment reports. The classification systemdescribed here analyses the opinions of theperformance of various public and private,manufacturing, medical, service and retailorganizations in integrating big data into theirlogistics. It explains methods of data collectionand the sentiment analysis process forclassifying big data logistics literature usingKNIME. Finally, it then gives an overview of thedifferences and explores future possibilities insentiment analysis for investigating differentindustrial sectors and data sources.1. INTRODUCTIONBig data logistics can be defined as the modellingand analysis of (urban) transport and distributionsystems through large data sets created by GPS,cell phone and transactional data of companyoperations, combined with human generatedactivity (i.e. social media, public transport) [1]. Thedemands and requirements are literally changing ona daily basis with the innovation in technologieswith smart computing and big data. All types oforganization whose logistics operation functions ina big data environment will have to adapt tochanging customer demands. At the same time theywill need to exploit the availability of big datatechnology to improve their process andoperational capabilities.Big data requires firms to have more technicaland technological supports to handle the five V’s ofBig Data and analytics that is “Volume”, “Variety”,Veracity”, “Value” and “Velocity” [2]. However,with the growth of big data there is privacysurveillance and data misuse challenges [3].Organizations also face challenges around quality,comprehensiveness, collection and the analysis ofdata from various sources. Furthermore, big dataalso needs to be robust, accessible, andinterpretable if it is to provide organizations withmeaningful opportunities and solutions.The purpose of this paper therefore is to explorethe risks and challenges of the firm implementing“big data logistics” into their operations. Secondly,to investigate the opportunities that big dataprovides the firm to improve their logisticsperformance. This will be achieved through the textprocessing of 552 records containing industryreviews, leading OM and OR journal articles andgovernment reports. We will analyse the opinionsof the performance of various public and private,manufacturing, medical, service and retailorganizations in integrating big data (analytics) intotheir logistics.II. DATA MINING TOOLSData mining is the process of discoveringinteresting knowledge from large amounts of datastored in databases, data warehouses, or otherinformation repositories. Data mining has manyapplication fields such as marketing, business,science and engineering, economics, games andbioinformatics.Text mining or sentiment analysis [4] is theanalysis of data contained in a natural languagetext, which deals with the computation of opinion,sentiment and subjectivity in text. Sentimentanalysis refers to the use of natural languageprocessing, text analysis and computationallinguistics to identify and extract subjectiveinformation from the text documents. The basictask of sentiment analysis is to determine thepolarity of a given texts.Currently, many data mining and knowledgediscovery tools and software are available for everyone and different usage such as the WaikatoEnvironment for Knowledge Analysis (WEKA) [5]RapidMiner [4] and Orange [6]. These tools andsoftware1 provide a set of methods and algorithmsthat help in the better utilization of data andinformation available to users; including methodsand algorithms for data analysis, cluster analysis,genetic algorithms, nearest neighbour, datavisualization, regression analysis, decision trees,predictive analytics and, text mining.Text mining tools, are not without theirlimitations. They tend to use a relatively simpledictionary approach to identify associations. Thismeans they cannot identify novel or newly namedbig data science phrases and terms. Another1Corresponding author, Leeds University BusinessSchool, Maurice Keyworth Building, University ofLeeds, Leeds, LS2 9JT. T: 44 (0) 113 343 8557, E:g.graham@leeds.ac.uk.1Jovic et al [8] provide a detailed review through acomparative study of diverse collection of data miningtools.

limitation lies in its inability to extract context ormeaning from sentences or terms. Methods that useartificial intelligence (AI), word context or machinelearning (ML) methods could potentially improvethe current term identification system [10]. In spiteof these limitations we suggest methodologicallythat text mining is a useful starting point inbuilding a neutral coding system for theoreticalframing and concept development [11].The tasks of dictionary making and sentimentanalysis are done in this paper by the means ofKNIME [7], which we found to be a user-friendlygraphical workbench capable of entire analysisprocess. KNIME uses six different steps to processtexts: reading and parsing documents, named entityrecognition, filtering and manipulation, wordcounting and keyword extraction, transformationand visualization. The choice of KNIME camebecause of its straightforward graphical userinterface for data pre-processing, its intuitive dataflow user interface and from its powerful datamining elements. However, there are also severallimitations including its high memoryrequirements, the complexity of Lab nodes and itssyncretic analysis of text as there are limitedsemantic and pragmatic analysis functionsavailable.III. KNIME METHODThe KNIME text processing feature was designedand developed to read and process textual data, andtransform it into numerical data (document andterm vectors) in order to apply regular KNIME datamining nodes (for classification and clustering).This feature allows for the parsing of textsavailable in various formats (here we used .csv) asKNIME data cells stored in a data table. It is thenpossible to recognise and tag different kinds ofnamed entities such as with positive and negativesentiment, thus enrichening the documentssemantically.Furthermore, documents can be filtered (e.g. bythe stop word or named entity filters), stemmed bystemmers for various languages pre-processed inmany other ways. Frequencies of words can becomputed, keywords extracted and documents canbe visualised (e.g. tag clouds). To apply regularKNIME nodes to cluster or classify documentsaccording to their sentiment, they can betransformed into numerical vectors.searchable fields and several document types thatpermit the user to easily narrow their searching.Both databases sort the results by parameters suchas; first author, cites, relevance and etc. The RefineResults section in both databases allows the user toquickly limit or exclude results by author, source,year, subject area, document type, institutions,countries, funding agencies and languages. Theresulting documents provide a citation, abstract,and references at a minimum. Results may beprinted, e-mailed, or exported to a citationmanager. The results may also be reorganizedaccording to the needs of the researcher by simplyclicking on the headings of each column. Oursearch of “big data logistics” documents resulted in552 records being retrieved from a ten year periodfrom 2006 to 2016.The described data was then loaded intoKNIME with the File Reader node and processed.In this phase, only records in English languagewere collected. Language of the text is set toEnglish and all texts that have different languagevalues are filtered out, because English dictionaryapplied on reviews and posts written in otherlanguages would not give results.V. DICTIONARY BUILDINGDictionary built for sentiment analysis of thephrase “big data” as it is used with respect to theterm “logistics” was graded only as positive ornegative. Scoring or sentiment analysis of thephrase “big data logistics” is done on the positivenegative level, therefore the phrase was analysedon the word level, giving each word associatedwith it a positive or negative polarity. For instance,efficiency would be scored positive whilst riskswould be scored negatively.For this task, publicly available MPQAsubjectivity lexicon was used as a starting point forrecognizing contextual polarity [7], this wasexpanded with a big data vocabulary built from theauthors previous papers [3]. The existing dictionarycontaining of approximately 8000 words isexpanded to fit the needs for sentiment analysis in away that initial portion of sentences are collected,which are separated into single words with Bag ofWords processing. Unnecessary words such assymbols or web URLs are filtered out, and alluseful, big data specific words are graded andadded to the dictionary. For instance, “veracity”,“value”, “volume”, “variety” and “velocity”.IV. DATA COLLECTIONVI. DATA SCORINGWOS and Scopus are powerful databases whichprovide different searching and browsing options[9]. The search options in both databases are theStandard Basic and Advanced. There are differentThe records were analysed on the word level givinga positive or negative grade for a term connected toeach phrase. Whilst text analytics of documents isusually accomplished simply with phrases counters

and mean calculations, our analytics of isfrequency-driven. Two separate work flows weretherefore built, one for calculating frequency basedon a grade and category, and other one for positivenegative (sentiment) grading.Total score for each word is given bymultiplying TF*IDF value with attitude of a term.Attitude can have one of three values depending onthe word polarity; -1 for word with negativepolarity, 1 for word with positive polarity and 0for neutral words.Final weights, which now represent attitude ofeach document , are grouped on the level ofdocument and binned into three bins to give one ofthree final results for each term; positive, negativeor neutral (Fig. 3).Figure 2 Big data logistics sentimentsA. Big Data Record GradingTF*IDF (Term Frequency*Inverse DocumentFrequency) [7] method assigns non-binary weightsrelated on a number of occurrences of a word.Weighting exploits counts from a backgroundcorpus, which is a large collection of documents;the background corpus serves as indication of howoften a word may be expected to appear in anarbitrary text. TF*IDF calculation determines howrelevant a given word is in a particular document.Besides term frequency 𝑓𝑤, 𝑑 which equals thenumber of times word 𝑤 appears in a document,size of the corpus 𝐷 is also needed. Given adocument collection, a word 𝑤 and an individualdocument 𝑑 𝜖 𝐷, TF*IDF value can be calculated:T𝐹 𝐼𝐷𝐹𝑤 𝑓𝑤, 𝑑 log𝐷𝑓 𝑤𝑑(1)Figure 3 TF-IDF ProcessingVII RESULTSTag clouds were initially used to visualise ourinitial findings. A simple tag cloud presented inFigure 4 gives the most used words in the positive(left hand cloud) and negative used words (righthand cloud).

Figure 4 Tag clouds of positive/negative sentimentThe attitudes towards big data were classified as“positive”, “neutral” and “negative”. Neutralgrades can be avoided, and we accomplished thisby removing grade bins and removing a bin forneutral grade. The positive and negative gradeswere aggregated for all terms associated with bigdata. In Figure 5 it can be seen that sentiments arefar more positive (245) than negative re 6 Most occurring wordsThen using the TF*IDF decision treelearner/predictor approach we tested the accuracyof the classification system (that we had adopted indifferentiating the big data logistics sentiments).Our results are presented in Figure 7.ClassificationAnalyticsFigure 5 Aggregated sentimentsVIII Classification ExperimentIn order to test the validity of the TF*IDFclassification model we ran a prototype experimentwith the ten most common words extracted (i.e.those with the highest TF*IDF scores) (see Figure6 below).UnspecifiederrorsTruePo FalsePo TrueNe FalseNe False No Recall PrecisionSensitivitySpecificityF Measure Accuracy isionSensitivitySpecificityF MeasureAccuracyCohen's 50.97940.37090.87790.09610.096SDSkew Kourtsis4.871 5.9587 35.73221.9759 6.4517 41.84156.708 -5.9538 37.19360.8436 0.6156 -0.81090.2497 3.7281 12.7170.09110.2479 3.7281 12.7170.1116 -6.0956 38.40340.120500Figure 7 Classification accuracyOur model shows a predictive accuracy of 88%in classifying the textual data. We then tested usingthe hierarchical classification function in Knime theability of the classification model to deal with theaddition of features. From Figure 8 we can see byfeature 4 that the model peaked at 100% accuracyand then maintained this level of accuracy asfeatures kept being added to it.

enable more robust and accurate modeldevelopment.For practitioners, the advantages of big data tooperations is identified to be: “optimization”,“intelligence”, “flexible” and “efficiency”. Whilstthe risks and challenges to operations from big dataare identified as: “security”, “culture”,“inefficient”, “waste”.Figure 8 Features accuracySo this initial test prototype of the model seemsto have a high degree of accuracy and validity indealing with sentiment classification. However, thisis only a prototype of the decision model, so morerobust testing will be needed in the future.Specifically, this will provide more stringentMPLA testing for variance.More in-depth analysis and more discretemodelling are clearly needed to assist in theimplementation of big data initiatives [2]. Some ofthe changes that operations and their connectedlogistics face are revolutionary and this requirescareful consideration from both a practical andtheoretical point of view. The description of thenew models and the rich context in which thesenew models are embedded will provide a deepinsight to researchers and practitioners in exploringsimilar opportunities and challenges in their owndomains.VIX CONCLUSIONSIn this paper, we studied the classification ofopinions towards “big data logistics”. Wepresented a novel approach to extracting key wordsand predicting “positive” and “negative”sentiments. We proved the validity of ourapproach by examining different classifiers thatutilized twenty features extracted from the TF*IDFprocessing [7].However this model is only a prototype tohighlight the text processing potential of KNIME.In the future, we intend to build comparisonsbetween a range of industrial and retailing sectors.We see the role of KNIME potentially as animportant mediating step in the framing andbuilding of theoretical frameworks. Furthermore itcould be adopted to build much more grounded andunbiased coding systems of qualitative data.Theoretically, it is evident from our initial textprocessing of the big data literature, that our workconfirms that of Foss Wamba et al., [2] andMehmood et al., [3]. We can confirm there is anevolution taking place from strategic analysis tooperational implementation.Thematic patterns and framework categoriesneed building from the extracted terms. Then,linkages and co-occurrences need exploring toestablish a grounded approach for building theoryfrom KNIME and other data mining tools [4]. Aswell as the positive sentiments that dominate thebig data modelling landscape, theoreticians need tofactor in more negative and risk constructs toReferences1. Blanco, E.E. and Fransoo, J.C., (2013)“Reaching 50 million nanostores: retail distributionin emerging megacities”. TUE WorkingPaper –404. January. Available /wp404.pdf.2. Fosso Wamba, S, Akter, S., Edwards, A.,Chopin, G., and Gnanzou, D. (2015). How “bigdata‟ can make big impact: Findings from asystematic review and a longitudinal case study.International Journal of Production Economics, 34,2, pp. 77-84.3. Mehmood, R., Meriton, R., Graham, G.,Hennelly, P., Kumar, M. (2016) Exploring theinfluence of big data on city transport operations: aMarkovian approach. International Journal ofOperations and Production Management.Forthcoming.4. Hofmann, M., Klinkenberg, R. (2013)RapidMiner: Data Mining Use Cases and BusinessAnalytics Applications, Boca Raton: CRC Press.5. Hall, M., Frank, E., Holmes, G., Pfahringer, B.,Reutemann, P., and Witten, I. H. The WEKA datamining software: an update, SIGKDDExplorations, vol. 11, no. 1, pp. 10–18, 2009.6. Demšar, J., Curk, T., and Erjavec, A. (2013)“Orange: data mining toolbox in Python,” Journalof Machine Learning Research, vol. 14, pp.2349 2353, 2013.7. Berthold, M. R., Cebron, N., Dill, F., Gabriel, T.R., Kötter, T. and Meinl, T. (2008) KNIME: TheKonstanz Information Miner, in Data Analysis,Machine Learning and Applications (Studies inClassification, Data Analysis, and Knowledge

Organization), Springer Berlin Heidelberg, pp.319–326, 2008.8. Jović, A., Brkić, K., and Bogunović, N. (2014)An overview of free software tools for general datamining. Information and CommunicationTechnology, Electronics and Microelectronics(MIPRO), 2014 37th International Convention on.IEEE, 2014.Available at:http://bib.irb.hr/datoteka/699127.MIPRO 2014 final.pdf9. Archambault, É., Campbell, D. and Gingras, Y.(2009) Comparing bibliometric statistics obtainedfrom the Web of Science and Scopus. Journal ofthe American Society for Information Science andTechnology 60.7: 1320-1326.10. Lau, K-N., Kam-Hon L., and Ho, Y. (2005)"Text mining for the hotel industry." Cornell Hoteland Restaurant Administration Quarterly, 46.3, :344-362.11. Glaser, B. G., and Strauss, A. L. (2009) Thediscovery of grounded theory: Strategies forqualitative research. Transaction publishers, 2009.

information repositories. Data mining has many application fields such as marketing, business, science and engineering, economics, games and bioinformatics. Text mining or sentiment analysis [4] is the analysis of data contained in a natural language text, which deals with the computation of opinion, sentiment and subjectivity in text. Sentiment