Chapter 5 Rattle - Maths-people.anu.edu.au

Transcription

Personal Copy for Private UseChapter 5RattleRattle (the R Analytical Tool To Learn Easily) is a graphical data mining application based on the statistical language R. (The R language is described in moredetail in the following chapter, but an understanding of R is not required in orderto use Rattle.) It is a low overhead, rapid development, data mining and modellingtool.Rattle uses theTogawareGnome graphicaluser interfaceand Catalogueruns under GNU/Linux,WatermarkFor TeXSurvival Macintosh OS/X, and MS/Windows. Rattle provides an intuitive interface that takesyou through the basic steps of data mining, as well as illustrating the R code thatis used to achieve this.Whilst the tool itself may be sufficient for all of a user’s needs, it also providesa stepping stone to more sophisticated processing and modelling in R itself, forsophisticated and unconstrained data mining.The user interface for Rattle is designed to flow through the data mining process.This is achieved through the use of a Tab interface, working from left to right: firstload some Data, select Variables for exploring and mining, Sample the data intotraining and test datasets, Explore the data, identify Clusters in the data, buildyour Models and Evaluate them. We describe the data mining process throughthis paradigm in the following sections.But first, we step through the simple process of installing Rattle.Copyright c 2004-2006 Graham Williams

Personal Copy for Private Use60Rattle5.122nd August 2006InstallationRattle has been packaged as an R package and is available from CRAN, the Comprehensive R Archive Network. The latest version is also available as an R packagefrom Togaware. The source is available from Google Code.The most raised issue is ensuring that you have the GTK libraries installed foryour operating system. This is independent of R itself and is emphasised as apreliminary step:1. Install GTK libraries: For GNU/Linux these are already installed if youare running GNOME. If you are not running gnome you may need to installthe GTK libraries in your distribution.For MS/Windows, install the package from Glade for Windows:MS / Windows : run gtk - win32 - devel -2.8.10 - rc1 . exeIf you are new to R, then to get up and running with Rattle there are a few steps:Togaware Watermark For TeX Catalogue Survival1. Install R: R is included in many GNU/Linux distributions, such as Debian:Debian : wajig install r - recommendedVersions for MS/Windows can be obtained from the R Project.MS / Windows : run R -2.3.1 - win32 . exeVersions for Mac/OSX are also available.To check if you have R installed, start up a Terminal and enter the commandR (that’s just the capital letter R). If the response is that the command isnot found, then you probably need to install the R application. RR : Copyright 2006 , The R Foundation for Statistical ComputingVersion 2.3.1 (2006 -06 -01)ISBN 3 -900051 -07 -0R is free software and comes with ABSOLUTELY NO WARRANTY .You are welcome to redistribute it under certain conditions .Type ’ license () ’ or ’ licence () ’ for distribution details .Natural language support but running in an English localeR is a collaborative project with many contributors .Type ’ contributors () ’ for more information and’ citation () ’ on how to cite R or R packages in publications .Copyright c 2004-2006 Graham Williams

Personal Copy for Private Use5.1 Installation22nd August 200661Type ’ demo () ’ for some demos , ’ help () ’ for on - line help , or’ help . start () ’ for an HTML browser interface to help .Type ’q () ’ to quit R .[ Previously saved workspace restored ] 2. Install RGtk2: This package is available on CRAN and from the RGtk2 website. From R, use:R : install . packages ( " RGtk2 " )On Debian you can simply install the Deb package:Debain : wajig install r - cran - gtk2To test whether you have RGtk2 installed enter the R commandR : library ( RGtk2 )3. The following additional R packages are suggested if you want the full functionality of Rattle. Type ?install.packages at the R prompt for furtherhelp on installing packages.Togaware Watermark For TeX Catalogue SurvivalR : install . packages ( c ( " bitops " , " cba " , " combinat " , " ellipse " ," fBasics " , " fpc " , " gbm " , " gregmisc " ," kernlab " , " maptree " , " randomForest " ," RODBC " , " ROCR " , " rpart " , " XML " ))4. Install Rattle: From within R, either install rattle directly from CRAN with:R : install . packages ( " rattle " )or else download the rattle package for GNU/Linux from http://rattle.togaware.com/src/contrib/rattle 2.1.28.tar.gz or for MS/Windowsfrom http://rattle.togaware.com/src/contrib/rattle 2.1.28.zip. Thismight be done using the right mouse button menu on the above links, to Savelink as. Save the files to your local disk, and then install with, for example: install . packages ( " rattle 2.1.28. zip " , repos NULL )5. Now, after starting R ask it to load the rattle package into its library:R : library ( rattle )This loads the Rattle functionality (which is also available without runningthe Rattle GUI). To start the Rattle GUI simply run the command:R : rattle ()Copyright c 2004-2006 Graham Williams

Personal Copy for Private Use625.2Rattle22nd August 2006IntroductionWe present the functionality of Rattle through the use of a simple data set, theaudit data set, which is supplied as part of the Rattle package (it is also availablefor download as a CSV file from http://rattle.togaware.com/audit.csv). Thisis an artificial data set consisting of 2000 fictional clients who have been audited,perhaps for compliance with regard the amount of a tax refund that is beingclaimed. For each case an outcome is recorded (whether the taxpayer’s claimshad to be adjusted or not) and any amount of adjustment that resulted is alsorecorded.The dataset is only 2,000 entities in order to ensure model building is relativelyquick, for illustrative purposes. It contains 13 columns, with the first being aunique client identifier.We proceed through the typical steps of a data mining project, beginning with adata load and selection, then an exploration of the data, and finally, modellingand evaluation.The data mining process steps through each tab, left to right, performing thecorresponding actions. For any tab, the modus operandi is to configure the optionsavailableand to thenclick the Executebutton(or F5) to Survivalperform the appropriateTogawareWatermarkFor TeXCataloguetasks. It is important to note that the tasks are not performed until the Executebutton (or F5 or the Execute menu item under Tools) is clicked.The Status Bar will indicate when the action is completed. Messages from R(e.g., error messages, although many R error messages are captured by Rattle anddisplayed in a popup) will appear in the R console from where Rattle was started.The R Code that is executed underneath will appear in the Log tab. This allowsfor a review of the R commands that perform the corresponding data mining tasks.The R code snippets can be copied as text from the Log tab and pasted into the RConsole from which Rattle is running, to be directly executed. This allows a userto deploy Rattle for basic tasks, yet allow the full power of R to be deployed asneeded, perhaps through using more command options than exposed through theRattle interface.Rattle is being extensively tested on binary classification problems (with 0/1 or atwo level variable as the outcomes for the Target variable). It is less well tested ongeneral classification and regression tasks, but this will follow. Also in the pipelineis support for text mining.Copyright c 2004-2006 Graham Williams

Personal Copy for Private Use5.3 Startup Rattle22nd August 20065.363Startup Rattle R library ( rattle ) rattle ()Togaware Watermark For TeX Catalogue SurvivalThe main Rattle window will pop up. You will see a welcome message and a hintabout using Rattle. Essentially, you will proceed through the tabs in this interfacefrom left to right. Once you have set up the required information on any one ofthe tabs, you need to click the execute button to perform the actions. Take amoment to explore the interface a little. Notice the Help menu and find that thehelp layout mimics the tab layout.Copyright c 2004-2006 Graham Williams

Personal Copy for Private Use645.45.4.1Rattle22nd August 2006Menus and ButtonsProjectsA project is a packaging of a dataset, variable selections, explorations, clusters andmodels built from the data. Rattle allows projects to be saved for later resumptionof the work or for sharing the data mining project with other users.A project is typically saved to a file with the .rattle extension (although in realityit is just a standard .Rdata file.At a later time you can load a project into rattle to restore the data, models, andother displayed information relating to the project, and resume your data miningfrom that point. You can also share these project files with other Rattle users,which is quite useful for data mining teams.You can rename the files, keeping the .rattle extension, without impacting theproject file itself — that is, the file name has no formal bearing on the contents, souse it to be descriptive — but best to avoid vacant spaces and unusual characters!Togaware Watermark For TeX Catalogue SurvivalCopyright c 2004-2006 Graham Williams

Personal Copy for Private Use5.5 Data Tab22nd August 20065.55.5.165Data TabCSV File OptionRattle can load data from a comma separated value (CSV) file, as might be generated by spreadsheets and databases, including Excel, Gnumeric, SAS/EM, QueryMan, and many other applications. This is a good option for importing your datainto Rattle.From the Data tab click the Filename button and choose audit.csv. Now clickthe Execute button to load the data set. This will load the data set from theaudit.csv file. The contents of the window changes to give a brief summary ofthe data set. Notice that we have loaded 2,000 entities, each described by 12variables. The data type and the first few values for each entity are also displayed.We can start getting an idea of the shape of the data, noting that Adjusted, forexample looks like it might be a categorical variable!Togaware Watermark For TeX Catalogue SurvivalCopyright c 2004-2006 Graham Williams

Personal Copy for Private Use66Rattle22nd August 2006The CSV file is assumed to begin with a header row, listing the names of thevariables. The remainder of the file is expected to consist of rows of data thatrecord information about the entities, with fields generally separated by commasrecording the values of the variables for this entity.You can choose the field delimiter through the Separator entry. A comma is thedefault. To load a .txt file which uses a tab as the field separator enter \\t asthe separator. You can also leave the separator empty and any white space willbe used as the separator.Any data with missing values, or having the value “NA” or else “.”, is treated asa missing value, which is represented in R as the string NA. Support for the “.”convention allows the importation of CSV data generated by SAS.Underneath, the corresponding R code uses the read.csv function to load the data.5.5.2ODBC Option5.5.3RData File OptionTogawareWatermarkTeXCatalogueSurvivalUsing theRData Fileoption data Forcan beloadeddirectly froma native R data file(usually with the .Rdata or .RData extension. Such files may contain multipledatasets (compressed) and you will be given an option to choose just one of theavailable data sets.5.5.4R Dataset OptionRattle can use a dataset that is already loaded into R (although it will take a copyof it, with memory implications). Only data frames are currently supported, andRattle will list for you the names of all of the available data frames.The data frames need to be constructed in the same R session that is running Rattle(i.e., the same R Console in which you lo the Rattle package). This provides muchmore flexibility in loading data into Rattle, than is provided directly through theactual Rattle interface. For example, you may want to load data from an SQLitedatabase directly, and have this available in Rattle.Copyright c 2004-2006 Graham Williams

Personal Copy for Private Use5.6 Variables Tab22nd August 20065.65.6.167Variables TabRolesTogaware Watermark For TeX Catalogue SurvivalThe Variables tab is used to identify the role played by each of the variables in thedata set. The default role for most variables is that of an Input variable. generally,these are the variables that will be used to predict the value of a Target variable.Rattle uses simple heuristics to guess at a Target role for one of the variables. Herewe see that Adjusted has been selected as the target variable. In this instance itis correct. The heuristic involves examining the number of distinct values that avariable has, and if it has less than 5, then it is considered as a candidate. Thecandidate list is ordered starting with the last variable (often the last variable isthe target), and then proceeding from the first onwards to find the first variablethat meets the conditions of looking like a target.Any numeric variables that have a unique value for each record is automaticallyCopyright c 2004-2006 Graham Williams

Personal Copy for Private Use68Rattle22nd August 2006identified as an Ident. Any number of variables can be tagged as being an Ident.All Ident variables are ignored when modelling, but are used after scoring a dataset, being written to the resulting score file so that the cases that are scored canbe identified.Sometimes not all variables in your data set should be used or may not be appropriate for a particular modelling task. For example, the random forest modelbuilder does not handle categorical variables with more than 32 levels, so you maychoose to Ignore Accounts. You can change the role of any variable to suit yourneeds, although you can only have one Target and one Risk.For any changes you make to the Va

to use Rattle.) It is a low overhead, rapid development, data mining and modelling tool. Rattle uses the Gnome graphical user interface and runs under GNU/Linux, Mac-intosh OS/X, and MS/Windows. Rattle provides an intuitive interface that takes you through the basic steps of data mining, as well as illustrating the R code that is used to achieve this.