R And Data Mining: Examples And Case Studies

Transcription

R and Data Mining: Examples and Case Studies1Yanchang .comOctober 20, 20151c 2012-2015 Yanchang Zhao. Published by Elsevier in December 2012. All rights reserved.

Messages from the AuthorCase studies: The case studies are not included in this online version. They are reserved exclusively for a book version published by Elsevier in December 2012.Latest version: The latest online version is available at links below. See the websites also for anR Reference Card for Data Mining. http://www.rdatamining.com http://www2.rdatamining.com (for readers having no access to above website)R code, data and FAQs: R code, data and FAQs are provided at links below. http://www.rdatamining.com/books/rdm ples-and-case-studies.htmlChapters/sections to add: topic modelling and stream graph; spatial data analysis; performance evaluation of classification/prediction models (with ROC and AUC); parallel computingand big data. Please let me know if some topics are interesting to you but not covered yet by thisbook.Questions and feedback: If you have any questions or comments, or come across any problemswith this document or its book version, please feel free to post them to the RDataMining groupbelow or email them to me. Thanks.Discussion forum: Please join our discussions on R and data mining at the RDataMining group(16,000 members, as of October 2015) on LinkedIn http://group.rdatamining.com .Twitter: Follow @RDataMining on Twitter (2,200 followers, as of October 2015).A sister book: See a new edited book titled Data Mining Application with R at links below,which features 15 real-world applications on data mining with R. http://www.rdatamining.com/books/dmar ns-with-r.html

ContentsList of FiguresvList of Abbreviations1 Introduction1.1 Data Mining . . . . . . . . .1.2 R . . . . . . . . . . . . . . . .1.2.1 R Basics . . . . . . . .1.2.2 RStudio . . . . . . . .1.3 Datasets . . . . . . . . . . . .1.3.1 The Iris Dataset . . .1.3.2 The Bodyfat Dataset .vii.111223442 Data Import and Export2.1 Save and Load R Data . . . . . . . . . . . . . .2.2 Import from and Export to .CSV Files . . . . .2.3 Import Data from SAS . . . . . . . . . . . . . .2.4 Import/Export via ODBC . . . . . . . . . . . .2.4.1 Read from Databases . . . . . . . . . .2.4.2 Output to and Input from EXCEL Files2.5 Read and Write EXCEL files with package xlsx2.6 Further Readings . . . . . . . . . . . . . . . . .77889991011.131315192331324 Decision Trees and Random Forest4.1 Decision Trees with Package party . . . . . . . . . . . . . . . . . . . . . . . . . . .4.2 Decision Trees with Package rpart . . . . . . . . . . . . . . . . . . . . . . . . . . .4.3 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .333336405 Regression5.1 Linear Regression . . . . . . .5.2 Logistic Regression . . . . . .5.3 Generalized Linear Regression5.4 Non-linear Regression . . . .4545505152.3 Data Exploration and Visualization3.1 Have a Look at Data . . . . . . . .3.2 Explore Individual Variables . . . .3.3 Explore Multiple Variables . . . . .3.4 More Explorations . . . . . . . . .3.5 Save Charts into Files . . . . . . .3.6 Further Readings . . . . . . . . . .i.

ii6 Clustering6.1 The k-Means Clustering .6.2 The k-Medoids Clustering6.3 Hierarchical Clustering . .6.4 Density-based Clustering .CONTENTS.53535457577 Outlier Detection7.1 Univariate Outlier Detection . . .7.2 Outlier Detection with LOF . . . .7.3 Outlier Detection by Clustering . .7.4 Outlier Detection from Time Series7.5 Discussions . . . . . . . . . . . . .6363667071728 Time Series Analysis and Mining8.1 Time Series Data in R . . . . . . . . . . . . . . . . . . .8.2 Time Series Decomposition . . . . . . . . . . . . . . . .8.3 Time Series Forecasting . . . . . . . . . . . . . . . . . .8.4 Time Series Clustering . . . . . . . . . . . . . . . . . . .8.4.1 Dynamic Time Warping . . . . . . . . . . . . . .8.4.2 Synthetic Control Chart Time Series Data . . . .8.4.3 Hierarchical Clustering with Euclidean Distance8.4.4 Hierarchical Clustering with DTW Distance . . .8.5 Time Series Classification . . . . . . . . . . . . . . . . .8.5.1 Classification with Original Data . . . . . . . . .8.5.2 Classification with Extracted Features . . . . . .8.5.3 k-NN Classification . . . . . . . . . . . . . . . . .8.6 Discussions . . . . . . . . . . . . . . . . . . . . . . . . .8.7 Further Readings . . . . . . . . . . . . . . . . . . . . . .7575767879798081838585868888889 Association Rules9.1 Basics of Association Rules .9.2 The Titanic Dataset . . . . .9.3 Association Rule Mining . . .9.4 Removing Redundancy . . . .9.5 Interpreting Rules . . . . . .9.6 Visualizing Association Rules9.7 Further Readings . . . . . . .898989919394959910 Text Mining10.1 Retrieving Text from Twitter . . . . . . . . . . . . . . .10.2 Transforming Text . . . . . . . . . . . . . . . . . . . . .10.3 Stemming Words . . . . . . . . . . . . . . . . . . . . . .10.4 Building a Term-Document Matrix . . . . . . . . . . . .10.5 Frequent Terms and Associations . . . . . . . . . . . . .10.6 Word Cloud . . . . . . . . . . . . . . . . . . . . . . . . .10.7 Clustering Words . . . . . . . . . . . . . . . . . . . . . .10.8 Clustering Tweets . . . . . . . . . . . . . . . . . . . . .10.8.1 Clustering Tweets with the k-means Algorithm .10.8.2 Clustering Tweets with the k-medoids Algorithm10.9 Packages, Further Readings and Discussions . . . . . . .101101102103106107108109110111112114.

CONTENTS11 Social Network Analysis11.1 Network of Terms . . . . . . . . . .11.2 Network of Tweets . . . . . . . . .11.3 Two-Mode Network . . . . . . . .11.4 Discussions and Further Readings .iii.11511512112612912 Case Study I: Analysis and Forecasting of House Price Indices13113 Case Study II: Customer Response Prediction and Profit Optimization13314 Case Study III: Predictive Modeling of Big Data with Limited Memory13515 Online Resources15.1 R Reference Cards . . . . . . . . . . . . . .15.2 R . . . . . . . . . . . . . . . . . . . . . . . .15.3 Data Mining . . . . . . . . . . . . . . . . .15.4 Data Mining with R . . . . . . . . . . . . .15.5 Classification/Prediction with R . . . . . .15.6 Time Series Analysis with R . . . . . . . . .15.7 Association Rule Mining with R . . . . . .15.8 Spatial Data Analysis with R . . . . . . . .15.9 Text Mining with R . . . . . . . . . . . . .15.10Social Network Analysis with R . . . . . . .15.11Data Cleansing and Transformation with R15.12Big Data and Parallel Computing with R hy143General Index149Package Index151Function Index153Appendix: Book Promotion - Data Mining Applications with R155

ivCONTENTS

List of Figures1.1RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.163.17Histogram . . . . . . . . . . . . . . . . . .Density . . . . . . . . . . . . . . . . . . .Pie Chart . . . . . . . . . . . . . . . . . .Bar Chart . . . . . . . . . . . . . . . . . .Boxplot . . . . . . . . . . . . . . . . . . .Scatter Plot . . . . . . . . . . . . . . . . .Scatter Plot with Jitter . . . . . . . . . .Smooth Scatter Plot . . . . . . . . . . . .A Matrix of Scatter Plots . . . . . . . . .3D Scatter plot . . . . . . . . . . . . . . .Heat Map . . . . . . . . . . . . . . . . . .Level Plot . . . . . . . . . . . . . . . . . .Contour . . . . . . . . . . . . . . . . . . .3D Surface . . . . . . . . . . . . . . . . .Parallel Coordinates . . . . . . . . . . . .Parallel Coordinates with Package latticeScatter Plot with Package ggplot2 . . . 4.64.74.8Decision Tree . . . . . . . . . . .Decision Tree (Simple Style) . . .Decision Tree with Package rpartSelected Decision Tree . . . . . .Prediction Result . . . . . . . . .Error Rate of Random Forest . .Variable Importance . . . . . . .Margin of Predictions . . . . . .34353839404243445.15.25.35.45.5Australian CPIs in Year 2008 to 2010 . . . . . . . . . . .Prediction with Linear Regression Model - 1 . . . . . . . .A 3D Plot of the Fitted Model . . . . . . . . . . . . . . .Prediction of CPIs in 2011 with Linear Regression ModelPrediction with Generalized Linear Regression Model . . .46484950526.16.26.36.46.56.66.7Results of k-Means Clustering . . . . . . . . .Clustering with the k-medoids Algorithm - I .Clustering with the k-medoids Algorithm - IICluster Dendrogram . . . . . . . . . . . . . .Density-based Clustering - I . . . . . . . . . .Density-based Clustering - II . . . . . . . . .Density-based Clustering - III . . . . . . . . .54555657596060.v.3

viLIST OF FIGURES6.8Prediction with Clustering Model . . . . . . . . . . . . . . . . . . . . . . . . . . . .617.17.27.37.47.57.67.77.8Univariate Outlier Detection with BoxplotOutlier Detection - I . . . . . . . . . . . .Outlier Detection - II . . . . . . . . . . . .Density of outlier factors . . . . . . . . . .Outliers in a Biplot of First Two PrincipalOutliers in a Matrix of Scatter Plots . . .Outliers with k-Means Clustering . . . . .Outliers in Time Series Data . . . . . . .8.18.28.38.48.58.68.78.88.98.10A Time Series of AirPassengers . . . . . . . . . .Seasonal Component . . . . . . . . . . . . . . . . .Time Series Decomposition . . . . . . . . . . . . .Time Series Forecast . . . . . . . . . . . . . . . . .Alignment with Dynamic Time Warping . . . . . .Six Classes in Synthetic Control Chart Time SeriesHierarchical Clustering with Euclidean Distance . .Hierarchical Clustering with DTW Distance . . . .Decision Tree . . . . . . . . . . . . . . . . . . . . .Decision Tree with DWT . . . . . . . . . . . . . .9.19.29.39.49.5AAAAA10.110.210.310.4Frequent Terms . . .Word Cloud . . . . .Clustering of WordsClusters of Tweets .Scatter Plot of Association Rules . . . .Balloon Plot of Association Rules . . .Graph of Association Rules . . . . . . .Graph of Items . . . . . . . . . . . . . .Parallel Coordinates Plot of Association.11.1 A Network of Terms - I . . . .11.2 A Network of Terms - II . . . .11.3 Cohesive Blocks . . . . . . . . .11.4 Cliques . . . . . . . . . . . . .11.5 Cliques . . . . . . . . . . . . .11.6 Distribution of Degree . . . . .11.7 A Network of Tweets - I . . . .11.8 A Network of Tweets - II . . .11.9 A Network of Tweets - III . . .11.10A Two-Mode Network of Terms11.11A Two-Mode Network of Terms. . . . . . . . . . . . . . . . . . . . . . . . . . . . .Components. . . . . . . . . . . . . . . . . . . . . .6465666768697172.76777879808182848687. . . . . . . . . . . . .Rules.9596979899.107109110113. . . . . . . . . . . . . . . . . . .-I .- II .

Data mining is the process to discover interesting knowledge from large amounts of data [Han and Kamber, 2000]. It is an interdisciplinary eld with contributions from many areas, such as statistics, machine learning, information retrieval, pattern recognition and bioinformatics. Data mining is widely used in many domains, such as retail, nance, telecommunication and social media. The main .