Game Of Thrones : Text Analysis Of The George R.R Martin’s .

Transcription

Game of Thrones : Text Analysis of the George R.R Martin’sbook series A Song of Ice and Fire using SAS Text MinerBrad Gross and Srividhya Naraharirao, Louisiana State University

IntroductionBackgroundA Song of Ice and Fire60 million copies sold45 languagesHBO television show: 18.6 million viewers*Presentation may contain spoilers

IntroductionObjectiveBooks contain a unique narrative structure of switching narrator point of view on aper chapter basis along with having dozens of characters, families, and locations.Can we determine speaker traits based upon text clusters and factoranalysis, character qualities based on common words used, relationshipstrength based on interactions?Can we use text analytics with multiple models to attempt to predict which family'spoint of view the reader is viewing the world from?

IntroductionToolsSAS Enterprise MinerFilter, Data Partition, Metadata, Regression, and Save DataUsed for filtering observations, data control, predictive modeling,and building data setsSAS Text MinerText Import, Text Parsing, Text Filter, Text Profile, Text Cluster, Text TopicUsed for reading in text, filtering terms, including/excludingparts of speech, pattern discovery

IntroductionCorpus5 Books337 Chapters24 Character Perspectives13 Families134,840,898 characters of text

Analysis: Process flowProcess Flow

Analysis: Data PreparationStep 1: Filter / Metadata / Partitioning

Analysis: Data PreparationFilterBar graphs of all importedvariablesUser specified selection ofwhich values to keepRemoval of missing valuesMinimum Frequency/Numberof Levels cutoffs

Analysis: Data PreparationMetadataConfigure and change metadataData PartitionSplit data into training and validation sets

Analysis: Text MiningStep 2: Text Mining

Analysis: Text MiningText ParsingFirst step in text mining analysis. Text parsing uses advancednatural language processing to represent documents ascollections of terms Word stemming Exclude parts of speech Determine and exclude entity types Specified a stop list and multi-word list of terms thatneeds to be ignored from analysis including 18castles, 491 people, 22 places, 24 words consideredtoo specific, and 317 multi-word terms.OUTPUT:20,000 terms to be considered for further analysis

Analysis: Text MiningText Parsing - ResultsCorpus is parsed into terms: Role AttributeAlpha – All letters Frequency # of Docs Term will be kept

Analysis: Text MiningText FilteringFilter out terms that appeared in onlyone document.Weight assigned to terms based on'Inverse Document Frequency'Terms occurring infrequently are givena higher score

Analysis: Text MiningText Filtering - ResultsTerms are given a weight based on the inverse of the their frequency used .Concept linking is a way to find and display the terms that are highly associated with theselected term in the Terms table. The selected term is surrounded by the terms thatcorrelate the strongest with it.

Analysis : Text MiningText ClusteringFirst step in the knowledge extraction process.The following steps extract patterns from thedata and match observations to the patterns.Text Cluster node will discover themes andassign each document to one of these themes.

Analysis: Text MiningText ClusteringTwo clustering algorithms are available The Expectation Maximization algorithmclusters documents with a flat representation, The Hierarchical clustering algorithm groupsclusters into a tree hierarchy Both approaches rely on the singular valuedecomposition (SVD) to transform the originalweighted, term-document frequency matrix intoa dense but low dimensional representation

Analysis : Text MiningSingular Value DecompositionDimensionality reduction techniqueSVD resolution - Higher the number higherthe risk of fitting to noise

Analysis : Text MiningText Clustering - Results

Analysis : Text MiningText Clusters RepresentationsCluster 3Cluster 1Cluster 2

Analysis : Text MiningText TopicText Topic node will derives themes/concepts from the terms that can be used instead of thetermsEach document is assigned to zero or more of those themes

We see topics clearly separatingpeople/places:Topic 1 - NorthTopic 2 - CitiesTopic 3 - Rural areasTopic 5 - East

Analysis : Text MiningText ProfilingThe text profile node enables us to profile a target variable based on a set ofterms from the documentText profiling was leveraged to profile the Game of thrones characters, identifysimilarities and relationship strengths between the characters based on theirinteractions

Analysis : Text MiningText Profiling results – Character relationship strengths

Analysis: Text MiningText Profiling results – terms describing each character

Analysis: PredictionText Rule BuilderThis node provides a text mining predictive modeling solution within SAS TextMinerThis derives a set of classification rules from the terms which are useful indescribing and predicting the target variableFor eg. (Term A) &(Term B) (Term C) can be a rule to classify a target variableThe results of the model are highly interpretable

Analysis : PredictionText Rule Builder resultsTraining Hitrate: 97.2%Validation Hitrate:76.85%Not a great model forprediction, good forexplanation

Analysis : Performance ComparisonComparison of predictive modelsOnce the predictor variables were obtained from the text mining process, variousmodels were tested for their accuracy in predicting the character family by chapterLogistic RegressionDecision TreeNeural NetworksText Rule Builder

Analysis : Performance ComparisonComparison of predictive models – process flow

Analysis : Performance ComparisonModel Comparison resultsNaive Rule:96 Stark Chapters / 141 TotalChapters 68%

Analysis : PredictionNeural Nets vs Logistic Regression – Family PredictionAll predictors show as extremely significant One predictor shows as significant

Analysis : PredictionNeural Nets – Character PredictionPredicting chapter by characterAll text allowedPrediction tends to get confusedmainly amongst characterswho interact often (Arya andBrienne) and characters whoappear less often (Aeron)

Game of Thrones : Text Analysis of the George R.R Martin’s book series A Song of Ice and Fire using SAS Text Miner Brad Gross and Srividhya Naraharirao, Louisiana State University. Background A Song of Ice and Fire 60 million copies so