Text Mining Methodologies With R: An Application To .

Transcription

Text Mining Methodologies with R: AnApplication to Central Bank TextsJonathan Benchimol,† Sophia Kazinnik‡ and Yossi Saadon§March 1, 2021We review several existing methodologies in text analysis and explain formalprocesses of text analysis using the open-source software R and relevant packages.We present some technical applications of text mining methodologies comprehensively to economists.1IntroductionA large and growing amount of unstructured data is available nowadays. Most ofthis information is text-heavy, including articles, blog posts, tweets and more formal documents (generally in Adobe PDF or Microsoft Word formats). This availability presents new opportunities for researchers, as well as new challenges forinstitutions. In this paper, we review several existing methodologies in analyzingtext and describe a formal process of text analytics using open-source software R.Besides, we discuss potential empirical applications.This paper is a primer on how to systematically extract quantitative information from unstructured or semi-structured data (texts). Text mining, the quantitative representation of text, has been widely used in disciplines such as politicalscience, media, and security. However, an emerging body of literature began toapply it to the analysis of macroeconomic issues, studying central bank commuThis paper does not necessarily reflect the views of the Bank of Israel, the Federal Reserve Bankof Richmond or the Federal Reserve System. The present paper serves as the technical appendix ofour research paper (Benchimol et al., 2020). We thank Itamar Caspi, Shir Kamenetsky Yadan, ArielMansura, Ben Schreiber, and Bar Weinstein for their productive comments.† Bank of Israel, Jerusalem, Israel. Corresponding author. Email: jonathan.benchimol@boi.org.il‡ Federal Reserve Bank of Richmond, Richmond, VA, USA.§ Bank of Israel, Jerusalem, Israel.1

nication and financial stability in particular.1 This type of text analysis is gainingpopularity, and is becoming more widespread through the development of technical tools facilitating information retrieval and analysis.2An applied approach to text analysis can be described by several sequentialsteps. A uniform approach to creating such measurement is required to assign aquantitative measure to this type of data. To quantify and compare texts, they needto be measured uniformly. Roughly, this process can be divided into four steps.These steps include data selection, data cleaning process, extraction of relevantinformation, and its subsequent analysis.We describe briefly each step below and demonstrate how it can be executedand implemented using open source R software. We use a set of monthly reportspublished by the Bank of Israel as our data set.Several applications are possible. An automatic and precise understanding offinancial texts could allow for the construction of several financial stability indicators. Central bank publications (interest rate announcements, official reports,etc.) could also be analyzed. This quick and automatic analysis of the sentimentconveyed by these texts should allow for fine-tuning of these publications beforemaking them public. For instance, a spokesperson could use this tool to analyzethe orientation of a text–an interest rate announcement for example–before makingit public.The remainder of the paper is organized as follows. Section 2 describes textextraction and Section presents 3 methodologies for cleaning and storing text fortext mining. Section 4 presents several data structures used in Section 5 which details methodologies used for text analysis. Section 6 concludes, and the Appendixpresents additional results.1 See,for instance, Bholat et al. (2015), Bruno (2017), and Correa et al. (2020).for instance, Lexalytics, IBM Watson AlchemyAPI, Provalis Research Text Analytics Software, SAS Text Miner, Sysomos, Expert System, RapidMiner Text Mining Extension, Clarabridge,Luminoso, Bitext, Etuma, Synapsify, Medallia, Abzooba, General Sentiment, Semantria, Kanjoya,Twinword, VisualText, SIFT, Buzzlogix, Averbis, AYLIEN, Brainspace, OdinText, Loop CognitiveComputing Appliance, ai-one, LingPipe, Megaputer, Taste Analytics, LinguaSys, muText, TextualETL, Ascribe, STATISTICA Text Miner, MeaningCloud, Oracle Endeca Information Discovery,Basis Technology, Language Computer, NetOwl, DiscoverText, Angoos KnowledgeREADER, Forest Rim’s Textual ETL, Pingar, IBM SPSS Text Analytics, OpenText, Smartlogic, Narrative ScienceQuill, Google Cloud Natural Language API, TheySay, indico, Microsoft Azure Text Analytics API,Datumbox, Relativity Analytics, Oracle Social Cloud, Thomson Reuters Open Calais, Verint Systems, Intellexer, Rocket Text Analytics, SAP HANA Text Analytics, AUTINDEX, Text2data, Saplo,and SYSTRAN, among many others.2 See,2

2Text extractionOnce a set of texts is selected, it can be used as an input using package tm (Feinereret al., 2008) within the open-source software R. This package can be thought as aframework for text mining applications within R, including text preprocessing.This package has a function called Corpus. This function takes a predefineddirectory which contains the input (a set of documents) and returns the output,which is the set of documents organized in a particular way. In this paper, werefer to this output as a corpus. Corpus here is a framework for storing this set ofdocuments.We define our corpus through R in the following way. First, we apply a function called file.path, that defines a directory where all of our text documents arestored.3 In our example, it is the folder that stores all 220 text documents, eachcorresponding to a separate interest rate decision meeting.After we define the working directory, we apply the function Corpus from thepackage tm to all of the files in the working directory. This function formats the setof text documents into a corpus object class as defined internally by the tm package.file.path - file.path("./data/folder")corpus - Corpus(DirSource(file.path))Now, we have our documents organized as a corpus object. The content of eachdocument can be accessed and read using the writeLines function. For example:writeLines(as.character(corpus[[1]]))Using this command line, we can access and view the content of documentnumber one within the corpus. This document corresponds to the interest ratediscussion published in December 2007. Below are the first two sentences of thisdocument:Bank of Israel Report to the public of the Bank of Israel’s discussionsbefore setting the interest rate for January 2007 The broad-forumdiscussions took place on December 24, 2006, and the narrow forum3 Thefolder should contain text documents only. If there are other files in that location (i.e. Rfiles), than the Corpus function will include the text in the other files.3

discussions on December 25, 2006, January 2007 General Before the Governormakes the monthly interest rate decision, discussions are held at twolevels.The first discussion takes place in a broad forum, in which therelevant background economic conditions are presented, including real andmonetary developments in Israel’s economy and developments in the globaleconomy.There are other approaches to storing a set of texts in R, for example by usingthe function data.frame or tibble, however, we will concentrate on tm’s corpusapproach as it is more intuitive, and has a greater number of corresponding functions explicitly written for text analysis.3Cleaning and storing textOnce the relevant corpus is defined, we transform it into an appropriate formatfor further analysis. As mentioned previously, each document can be thought ofas a set of tokens. Tokens are sets of words, numbers, punctuation marks, andany other symbols present in the given document. The first step of any text analysis framework is to reduce the dimension of each document by removing uselesselements (characters, images, and advertisements,4 etc.).Therefore, the next necessary step is text cleaning, one of the crucial steps in textanalysis. Text cleaning (or text preprocessing) makes an unstructured set of textsuniform across and within and eliminates idiosyncratic characters.5 Text cleaningcan be loosely divided into a set of steps as shown below.The text excerpt presented in Section 2 contains some useful information aboutthe content of the discussion, but also a lot of unnecessary details, such as punctuation marks, dates, ubiquitous words. Therefore, the first logical step is to removepunctuation and idiosyncratic characters from the set of texts.This includes any strings of characters present in the text, such as punctuationmarks, percentage or currency signs, or any other characters that are not words.There are two coercing functions6 called content transformer and toSpace that,in conjunction, get rid of all pre-specified idiosyncratic characters.4 Removalof images and advertisements is not covered in this paper.characters that are not used to understand the meaning of a text.6 Many programming languages support the conversion of value into another of a different datatype. This kind of type conversions can be implicitly or explicitly made. Coercion relates to the implicit conversion which is automatically done. Casting relates to the explicit conversion performedby code instructions.5 Specific4

The character processing function is called toSpace. This function takes a predefined punctuation character, and converts it into space, thus erasing it from thetext. We use this function inside the tm map wrapper, that takes our corpus, appliesthe coercing function, and returns our corpus with the already made changes.In the example below, toSpace removes the following punctuation characters:“-”, “,”, “.”. This list can be expanded and customized (user-defined) as needed.corpus - tm map(corpus, toSpace, "-")corpus - tm map(corpus, toSpace, ",")corpus - tm map(corpus, toSpace, ".")The text below shows our original excerpt, with the aforementioned punctuation characters removed:The broad forum discussions took place on December 24 2006 and the narrowforum discussions on December 25 2006 January 2007 General Before theGovernor makes the monthly interest rate decision discussions are held attwo levels The first discussion takes place in a broad forum in which therelevant background economic conditions are presented including real andmonetary developments in Israel’s economy and developments in the globaleconomyAnother way to eliminate punctuation marks, or characters, is to apply theremovePunctuation function to the corpus. This function removes a set of predefined punctuation characters, but it cannot be customized if the need arises. Onecan combine both approaches (toSpace and removePunctuation) in order to effectively remove all punctuation and idiosyncratic characters from text.Besides, any numbers present in the texts of our corpus can be removed by theremoveNumbers function, such as in the below code:corpus - tm map(corpus, removePunctuation)corpus - tm map(corpus, removeNumbers)Now, the text below shows our original excerpts, but without any punctuationmarks or digits:5

The broad forum discussions took place on December and the narrow forumdiscussions on December January General Before the Governor makes themonthly interest rate decision discussions are held at two levels The firstdiscussion takes place in a broad forum in which the relevant backgroundeconomic conditions are presented including real and monetary developmentsin Israels economy and developments in the global economyThe current text excerpt conveys the meaning of this meeting a little moreclearly, but there is still much unnecessary information. Therefore, the next stepwould be to remove the so-called stop words from the text.What are stop words? Words such as “the”, “a”, “and”, “they”, and manyothers can be defined as stop words. Stop words usually refer to the most commonwords in a language, and as they are so common, carry no specific informationalcontent. Since these terms do not carry any meaning as standalone terms, they arenot valuable for our analysis. In addition to a pre-existing list of stop words, adhoc stop words can be added to list.We apply a function from the package tm onto our existing corpus as definedabove in order to remove the stop words. There is a coercing function calledremoveWords that erases a given set of stop words from the corpus. There aredifferent lists of stop words available, and we use a standard list of English stopwords.However, before removing the stop words, we need to turn all of our existingwords within the text into lowercase. Why? Because converting to lowercase, orcase folding, allows for case-insensitive comparison. This is the only way for thefunction removeWords to identify the words subject for removal.Therefore, using the package tm, and a coercing function tolower, we convertour corpus to lowercase:corpus - tm map(corpus, tolower)Below is the example text excerpt following the command mentioned above:the broad forum discussions took place on december 24 2006 and the narrowforum discussions on december 25 2006 january 2007 general before thegovernor makes the monthly interest rate decision discussions are held attwo levels the first discussion takes place in a broad forum in which the6

relevant background economic conditions are presented including real andmonetary developments in israels economy and developments in the globaleconomyWe can now remove the stop words from the text:corpus - tm map(corpus, removeWords, stopwords("english"))Here, tm map is a wrapper function that takes the corpus and applies characterprocessing function removeWords onto all of the contents of the corpus (all 220documents). It returns the modified documents in the same format, a corpus, butwith the changes already applied. The following output shows our original textexcerpt with the stop words removed:broad forum discussions took place december narrow forum discussionsdecember january general governor makes monthly interest rate decisiondiscussions held two levels first discussion takes place broad forumrelevant background economic conditions presented including real monetarydevelopments israels economy developments global economyThe next and final step is to stem the remaining words. Stemming is a processof turning several words that mean the same thing into one. For example, afterstemming, words such as “banking”, “banks”, and “banked” become “bank”. Asstemming reduces inflected or derived words to their word stem or root. This canbe thought of as word normalization. Stemming algorithm allows us not to countdifferent variations as the same term as separate instances.Below, we use a coercing function called stemDocument, that stems words in atext document using Porter’s stemming algorithm. This algorithm removes common morphological and inflectional endings from words in English, as describedin the previous paragraph.corpus - tm map(corpus, stemDocument)Once we have applied several of these character processing functions to ourcorpus, we would like to examine it in order to view the results. Overall, as aresult of the above procedures, we end up with the following:7

broad forum discuss took place decemb narrow forum discuss decemb januarigeneral governor make month interest rate decis discuss held two levelfirst discuss take place broad forum relev background econom condit presentinclud real monetari develop israel economi develop global economiThis last text excerpt shows what we end up with once the data cleaning manipulations are done. While the excerpt we end up with resembles its original onlyremotely, we can still figure out reasonably well the subject of the discussion.74Data structuresOnce the text cleaning step is done, R allows us to store the results in one of thetwo following formats, dtm and tidytext. While there may be more ways to storetext, these two formats are the most convenient when working with text data in R.We explain each of these formats next.4.1Document Term MatrixDocument Term Matrix (dtm) is a mathematical matrix that describes the frequencyof terms that occur in a collection of documents. Such matrices are widely usedin the field of natural language processing. In dtm, each row corresponds to aspecific document in the collection and each column corresponds to the specificterm within that document. An example of a dtm is shown in Table 1.This type of matrix represents the frequency for each unique term in each document in corpus. In R, our corpus can be mapped into a dtm object class by usingthe function DocumentTermMatrix from the tm package.dtm - DocumentTermMatrix(corpus)The goal of mapping the corpus onto a dtm is twofold; the first is to present thetopic of each document by the frequency of semantically significant and uniqueterms, and second, to position the corpus for future data analysis.7 Anotherway to perform the changes discussed in this section would be to use the dplyr command (see below). Both ways are viable, and we keep the longer explanation within the paper toexplain the sequence of steps thoroughly.corpus corpus % % tm map(removePunctuation) % % tm map(removeNumbers) % %tm map(tolower) % % tm map(removeWords,stopwords("english"))% %tm map(stemDocument)8

Document iMay 2008June 2008July 2008August 2008September 2008October 2008November 2008.Term jaccord activ396453495832065.averag416712221611.Table 1: An excerpt of a dtmThe value in each cell of this matrix is typically the word frequency of each termin each document. This frequency can be weighted in different ways, to emphasizethe importance of certain terms and de-emphasize the importance of others. Thedefault weighting scheme within the DocumentTermMatrix function is called TermFrequency (tf). Another common approach to weighting is called Term Frequency- Inverse Document Frequency (tf-idf).While the tf weighting scheme is defined as the number of times a word appears in the document, tf-idf is defined as the number of times a word appearsin the document but is offset by the frequency of the words in the corpus, whichhelps to adjust for the fact that some words appear more frequently in general.Why is the frequency of each term in each document important? A simplecounting approach such as term frequency may be inappropriate because it canoverstate the importance of a small number of very frequent words. Term frequency is the most normalized one, measuring how frequently a term occurs in adocument with respect to the document length, such as:tf(t) Number of times term t appears in a documentTotal number of terms in the document(1)A more appropriate way to calculate word frequencies is to employ the tf-idfweighting scheme. It is a way to weight the importance of terms in a documentbased on how frequently they appear across multiple documents. If a term frequently appears in a document, it is important, and it receives a high score. However, if a word appears in many documents, it is not a unique identifier, and it willreceive a low score. Eq. 1 shows how words that frequently appear in a singledocument will be scaled up, and Eq. 2 shows how common words which appear9

in many documents will be scaled down.idf(t) lnTotal number of documentsNumber of documents with term t in it(2)Conjugating these two properties yield the tf-idf weighting scheme.tf-idf(t) tf(t)idf(t)(3)In order to employ this weighting scheme (Eq. 3), we can assign this optionwithin the already familiar function dtm:dtm - DocumentTermMatrix(corpus, control list(weighting weightTfIdf))The above-mentioned steps provide us with a suitable numeric matrix underthe name of dtm. In this matrix, each cell contains an integer, corresponding to a(perhaps weighted) number of times a specific term appeared in each documentwithin our corpus. However, most cells in such dtm will be empty, i.e., zeros,because most terms do not appear in most of the documents.

est Rim’s Textual ETL, Pingar, IBM SPSS Text Analytics, OpenText, Smartlogic, Narrative Science Quill, Google Cloud Natural Language API, TheySay, indico, Microsoft Azure Text Analytics API, Datumbox, Relativity Analytics, Oracle Social Cloud, Thomson Reuters Open Calais, Verint Sys-t