SBU3 R Programming - Routledge

Transcription

A CRC PRESS FREEBOOKR Programming

IntroductionChapter 1: A First Look From DynamicDocuments with R and knitr, Second EditionChapter 2: Introduction to Parallel Processingin R from Parallel Computing for Data Science:With Examples in R, C and CUDAChapter 3: Summarizing and Graphing Datafrom Using the R Commander: A Point-and-ClickInterface for RChapter 4: Getting Started with ReproducibleResearch from Reproducible Research with R andR Studio, Second Edition2

Save on Essential Books in R ProgrammingGet up to date with the latest in R ProgrammingVisit www.crcpress.com to browse our complete collection of books in StatisticsSAVE 20% and receive FREE Shipping, simply enter code AZR08 at time ofcheckout.3

IntroductionR is a programming language and software environment for statistical computing andgraphics. It is now widely used in academic research, education, and industry. It isconstantly growing, with new versions of the core software released regularly and morethan 9,000 packages available. It is difficult for the documentation to keep pace with theexpansion of the software, which led to the launch of the Chapman & Hall/ CRC R Series, aforum for the publication of books covering many aspects of the development andapplication of R.The scope of the series is wide, covering three main threads: applications of R to specificdisciplines such as biology, epidemiology, genetics, engineering, finance, and the socialsciences; using R for the study of topics of statistical methodology, such as linear andmixed modeling, time series, Bayesian methods, and missing data; and the development ofR itself, including programming, building packages, and graphics. The books appeal toprogrammers and developers of R software, as well as applied statisticians and dataanalysts in many fields. They feature detailed worked examples and R code fullyintegrated into the text, ensuring their usefulness to researchers, practitioners andstudents.This free ebook presents four chapters on 'R Programming' from books published in the RSeries. Each chapter presents a different aspect of programming in R, although it is notintended to be comprehensive. We hope you will find the content useful and that it givesyou a snapshot of the quality of the books in the series.Please note this Free Book does not include references, endnotes and footnotes. Fullyreferenced versions of each book can be accessed through crcpress.com.4

Chapt er 1. A First LookFrom Dynamic Documents with R and knitr, Second Edition by Yihui XieThe book makes writing statistical reports easier by integrating computingdirectly with reporting. Reports range from homework, projects, exams, books,blogs, and Web pages to virtually any documents related to statisticalgraphics, computing, and data analysis. The book covers basic applications forbeginners while guiding power users in understanding the extensibility of theknitr package. This chapter gives a first complete example that covers basicconcepts and what we can do with knitr.Chapt er 2. Int roduct ion t o Paral l el Processing in RFrom Parallel Computing for Data Science: With Examples in R, C and CUDAby Norman MatloffThis book is one of the first parallel computing books to concentrateexclusively on data structures, algorithms, software tools, and applications indata science. It covers the classic ?n observations, p variables? matrix format,time series, network graph models, and numerous other structures common indata science. Many examples from statistics, data mining, machine learning,pattern recognition, and analytics illustrate the range of issues encountered inparallel programming. Instead of starting with an abstract overview of parallelprogramming, this chapter starts right to work with a concrete example in R.Chapt er 3. Summarizing and Graphing Dat aFrom Using the R Commander: A Point-and-Click Interface for R by John FoxThe R Commander is a graphical user interface for R, providing access to Rstatistical software through familiar menus and dialog boxes instead of bytyping commands. Both R and the R Commander are freely available for allcommon computing systems? Windows, Mac OS X, and Linux/ Unix. Thischapter explains how to use the R Commander to compute simple numericalsummaries of data, to construct and analyze contingency tables, and to drawcommon statistical graphs.5

Chapt er 4. Get t ing St art ed wit h Reproducibl e ResearchFrom Reproducible Research with R and R Studio, Second Edition byChristopher GandrudThis book brings together the skills and tools needed for doing andpresenting computational research. Using straightforward examples, thebook takes you through an entire reproducible research workflow. Thispractical workflow enables you to gather and analyze data as well asdynamically present results in print and on the web. This chapter first givesa brief overview of the reproducible research process: a workflow forreproducible research. Then it covers some of the key guidelines that canhelp make your research more reproducible.6

1Dynamic Documents with Rand knitr, Second Edition7

Chapter 1: A First LookThe knit r package is a general-purpose literate programming engine? itsupports document formats including LATEX, HTML, and Markdown (seeChapter 5), and programming languages such as R, Python, awk, C , andshell scripts (Chapter 11). Before we get started, we need to install knit r in R.Then we will introduce the basic concepts with minimal examples. Finally,we will show how to generate reports quickly from pure R scripts, which canbe useful for beginners who do not know anything about dynamicdocuments.The following is excerptedfrom Dynamic Documents withR and knitr, Second Edition byYihui Xie 2015 Taylor &Francis Group. All rightsreserved.1.1 Set upSince knit r is an R package, it can be installed from CRAN in the usual way inR:To purchase a copy, click here.Note here that dependencies TRUE is optional, and will install all packagesthat are not absolutely necessary but can enhance this package with someuseful features. The development version is hosted on Github:https:/ / github.com/ yihui/ knitr, and you can always check out the latestdevelopment version, which may not be stable but contains the latest bugfixes and new features. If you have any problems with knit r, the first thing tocheck is its version:If you choose LATEX as the typesetting tool, you may need to install MiKTEX(Windows, http:/ / miktex.org/ ), MacTEX (Mac OS, http:/ / tug.org/ mactex/ ), orTEXLive (Linux, http:/ / tug.org/ texlive/ ). If you are going to work with HTML orMarkdown, nothing else needs to be installed, since the output will beWebpages, which you can view with aWeb browser. Once we have knitr installed,we can compile source documents using the function knit(), e.g.,8

A *.Rnw file is usually a LATEX document with R code embedded in it, as we willsee in the following section and Chapter 5, in which more types of documentswill be introduced.1.2 Minimal Exampl esWe use two minimal examples written in LATEX and Markdown, respectively, toillustrate the structure of dynamic documents. We do not discuss the syntax ofLATEX or Markdown for the time being (see Chapter 5 instead). For the sake ofsimplicity, the cars dataset in base R is used to build a simple linear regressionmodel. Type ?cars in R to see detailed documentation. Basically it has twovariables, speed and distance:1.2.1 An Exampl e in LATEXFigure 3.1 is a full example of R code embedded in LATEX; we call this kind ofdocuments Rnw documents hereafter because their filename extension is Rnwby convention. If we save it as a file minimal.Rnw and run knit(?minimal.Rnw?) inR as described before, knitr will generate a LATEX output document namedminimal.tex. For those who are familiar with LATEX, you can compile thisdocument to PDF via pdflatex. Figure 3.2 is the PDF document compiled fromthe Rnw document. What is essential here is how we embedded R code inLATEX. In an Rnw document, marks the beginning of code chunks, and @terminates a code chunk (this description is not rigorous but is often easier9

FIGURE 3.1 The source of a minimal Rnw document: see output in Figure 3.2to understand); we have four lines of R code between the two markers in thisexample to draw a scatterplot, fit a linear model, and add a regression line to thescatterplot. The command \ Sexpr{} is used to embed inline R code, e.g.,coef(fit)[2] in this example. We can write chunk options for a code chunk between and ; the chunk options in this example specified the plot size to be 4 by 3inches (fig.width and fig.height), and plots should be aligned in the center(fig.align).In this minimal example, we have most basic elements of a report:1. title, author, and date2. model description3. data and computation4. graphics10

5. numeric resultsAll the output is generated dynamically from R. Even if the data has11

FIGURE 3.3 The source of a minimal Rmd document: see output in Figure 3.4changed, we do not need to redo the report from the ground up, and theoutput will be updated accordingly if we update the data and recompile thereport.1.2.2 An Exampl e in MarkdownLATEX may look overwhelming to beginners due to the large number ofcommands. By comparison, Markdown (Gruber, 2004) is a much simplerformat. Figure 3.3 is a Markdown example doing the sameanalysis with the previous example: The ideal output from Markdown is anHTML Web page, as shownin Figure 3.4 (in Mozilla Firefox). Similarly, we can see the syntax for R code ina Markdown document: {r} opens a code chunk, terminates a chunk,and inline R code can be put inside r , where is a backtick. Aslightly longerexample in knitr is a demo named notebook, which is based on Markdown. It12

shows not only the potential power of Markdown, but also the possibility ofbuilding Web applications with knitr.To watch the demo, run the code below:FIGURE 3.4: A minimal example in Markdown with the same analysisas in Figure 3.2, but the output is HTML instead of PDF now.13

Your default Web browser will be launched to show a Web notebook. Thesource code is in the left panel, and the live results are in the right panel. Youare free to experiment with the source code and recompile the notebook.3.3 Quick Report ingIf a user only has basic knowledge of R but knows nothing about knitr, or onedoes not want to write anything other than an R script, it is also possible togenerate a quick report from this R script using the stitch() function.The basic idea of stitch() is that knitr provides a template of the sourcedocument with some default settings, so that the user only needs to feed thistemplate with an R script (as one code chunk); then knitr will compile thetemplate to a report. Currently it has built-in templates for LATEX, HTML, andMarkdown. The usage is like this:3.4 Extracting R CodeFor a literate programming document, we can either compile it to a report (run thecode), or extract the program code in it. They are called ?weaving?and ?tangling,?respectively. Apparently the function knit() is for weaving, and the correspondingtangling function is purl() in knitr. For example,14

The result of tangling is an R script; in the above examples, the default outputwill be your-file.R, which consists of all code chunks in the source document.So far we have been introducing the command line usage of knitr, and it isoften tedious to type the commands repeatedly. In the next chapter, we showhow a decent editor can help edit and compile the source document with onesingle mouse click or a keyboard shortcut.15

2Parallel Computing for DataScience: With Examples in R,C and CUDA16

Chapter 2: Introduction to Parallel Processing in RInstead of starting with an abstract overview of parallel programming, we'llget right to work with a concrete example in R. The abstract overview canwait. But we should place R in proper context first.2.1 Recurring Theme: The Principl e of Pret t y Good Paral l el ismMost of this book's examples involve the R programming language, aninterpreted language. R's core operations tend to have very efficientinternal implementation, and thus the language generally can offer goodperformance if used properly.2.1.1 Fast EnoughThe following is excerptedfrom Parallel Computing forData Science: With Examples inR, C and CUDA by NormanMatloff. 2015 Taylor &Francis Group. All rightsreserved.In settings in which you really need to maximize execution speed, you maywish to resort to writing in a compiled language such as C/ C , which wewill indeed do occasionally in this book. However, the extra speed that maybe attained via the compiled language typically is just not worth the effort. Inother words, we have an analogy to the Pretty Good Privacy security system:To purchase a copy, click here.The Principl e of Pret t y Good Paral l el ism: In many cases just ?pretty fast" isquite good enough. The extra speed we might attain by moving from R toC/ C does not justify the possibly much longer time needed to write,debug and maintain code at that level.This of course is the reason for the popularity of the various parallel Rpackages. They fulfill a desire to code parallel operations yet still stay in R.For example, the Rmpi package provides an R connection to the MessagePassing Interface (MPI), a very widely used parallel processing system inwhich applications are normally written in C/ C or FORTRAN.1 Rmpi givesanalysts the opportunity to take advantage of MPI while staying within R.But as an alternative to Rmpi that also uses MPI, R users could write theirapplication code in C/ C , calling MPI functions, and then interface R to theresulting C / C function. But in doing so, they would be foregoing thecoding convenience and rich package available in R. So, most opt for usingMPI only via the Rmpi interface, not directly in C/ C .The aim of this book is to provide a general treatment of parallel processingin data science. The fact that R provides a rich set of powerful, high-leveldata and statistical operations means that examples in R will be shorter andsimpler than they would typically be in other languages. This enables thereader to truly focus on the parallel computation methods themselves,rather than be distracted by having to wade through the details of, say,17

Chapter 2: Introduction to Parallel Processing in Rintricate nested loops. Not only is this useful from a learning point ofview, but also it will make it easy to adapt the code and techniquespresented here to other languages, such as Python or Julia.2.1.2 ?R X"Indeed, a major current trend in R is what might be called ?R X," where Xis some other language or library. R C, in which one writes one's maincode in R but writes a portion needing extra speed in C or C , has beencommon since the beginnings of R. These days X might be Python, Julia,Hadoop, H2O or lots of other things.2.2 A Not e on Machines Three types of machines will be used forillustration in this book: multicore systems, clusters and graphicsprocessing units (GPUs). As noted in the preface, I am not targeting thebook to those fortunate few who have access to supercomputers (thoughthe methods presented here do apply to such machines). Instead, it isassumed that most readers will have access to more modest systems, saymulticore with 4-16 cores, or clusters with nodes numbering in thedozens, or a single GPU that may not be the absolute latest model.Most of the multicore examples in this book were run on a 32-coresystem on which I seldom used all the cores (as I was a guest user). Thetiming experiments usually start with a small number of cores, say 2 or 4.As to clusters, my coverage of ?message-passing" software was typicallyrun on the multicore system, though occasionally on a real cluster todemonstrate the effects of overhead.The GPU examples here were typically run on modest hardware.Again, the same methods as used here do apply to the more formidablesystems, such as the behemoth supercomputers with multiple GPUs andso on. Tweaking is typically needed for such systems, but this is beyondthe scope of this book.18

2.3 Recurring Theme: Hedging One's Bet sAs I write this in 2014, we are in an exciting time for parallel processing.Hardware can be purchased for the home that has speeds and degrees ofparallelism that were unimaginable for ordinary PC users just a decade ago. Therehave been tremendous improvements in software as well; R now has a number ofapproaches to parallelism, as we will see in this book; Python finally threw off itsGIL yoke a few years ago; C 11 has built-in parallelism; and so on.Due to all this commotion, though, we really are in a rather unsettled state as welook forward to the coming years. Will GPUs become more and more mainstream?Will multicore chips routinely contain so many cores that GPUs remain a nice typeof hardware? Will accelerator chips like the Intel Xeon Phi overtake GPUs? Andwill languages keep pace with the advances in hardware?For this reason, software packages that span more than one type of hardware,and can be used from more than one type of programming language, have greatappeal. The Thrust package, the topic of Chapter 7 and some of the later sections,epitomizes this notion. The same Thrust code can run on either multicore or GPUplatforms, and since it is C based, it is accessible from R or most otherlanguages. In short, Thrust allows us to ?hedge our bets" when we developparallel code.Message-passing software systems, such as R's snow, Rmpi and pbdR, have muchthe same advantage, as they can run on either multicore machines or clusters.2.4 Ext ended Exampl e: Mut ual Web Out -l inksSo, let's look at our promised concrete example.Suppose we are analyzing Web traffic, and one of our questions concerns howoften two websites have links to the same third site. Say we have outlinkinformation for n Web pages. We wish to find the mean number of mutualoutlinks per pair of sites, among all pairs.This computation is actually similar in pattern to those of many statisticalmethods, such asand the U-statistic family. The pattern takes thefollowing form. For data consisting of n observations, the pattern is to computesome quantity g for each pair of observations, then sum all those values, as in thispseudocode (i.e., outline):19

With nested loops like this, you'll find in this book that it is generally easierto parallelize the outer loop rather than the inner one. If we have a dual coremachine, for instance, we could assign one core to handle some values of i inthe above code and the other core to handle the rest. Ultimately we'll do thathere, but let's first take a step back and think about this setting.2.4.1 Serial CodeLet's first implement this procedure in serial code:Here links is a matrix representing outlinks of the various sites, with links[i,j]being 1 or 0, according to whether there is an outlink from site I to site j. Thecode is a straightforward implementation of the pseudocode in Listing 1.4.1above.How does this code do in terms of performance? Consider this simulation:We generate random 1s and 0s, and call the function. Here's a sample run:20

Elapsed time of 106.659 seconds--awful! We're dealing with 500 websites,a tiny number in view of the millions that are out there, and yet it tookalmost 2 minutes to find the mean mutual outlink value for this small groupof sites.It is well known, though, that explicit f or loops are slow in R, and here wehave two of them. The first solution to try for loop avoidance isvectorization, meaning to replace a loop with some vector computation. Thisgives one the speed of the C code that underlies the vector operation, ratherthan having to translate the R repeatedly for each line of the loop, at eachiteration.In the code for mut out ser() above, the inner loops can be rewritten as amatrix product, as we will see below, and that will turn out to eliminate twoof our loops.To see the matrix formulation, suppose we have this matrix:Consider the case in which i is 2 and j is 4 in the above pseudocode, Listing1.4.1. The innermost loop, i.e., the one involving k, computesBut that is merely the inner product of rows i and j of the matrix! In otherwords, it'sBut there's more. Again consider the case in which i is 2. The same reasoningas above shows that the entire computation for all j and k, i.e., the two21

innermost loops, can be written asThe matrix on the left is the portion of our original matrix below row 2, and thevector on the right is row 2 itself. Those numbers, 1, 1 and 2, are the results wewould get from running the code with i equal to 2 and j equal to 3, 4 and 5.(Check this yourself to get a better grasp of how this works.)So, we can eliminate two loops, as follows:This actually brings a dramatic improvement:Wonderful! Nevertheless, that is still only for the very small 500-site case. Let'srun it for 2000:Over 1.5 minutes! And 2000 is still not very large.22

We could further fine-tune our code, but it does seem that parallelizing may be abetter option. Let's go that route.2.4.2 Choice of Paral l el ToolThe most popular tools for parallel R are snow, mul t icore, f oreach and Rmpi.Since the first two of these are now part of the R core in a package namedparal l el , it is easiest to use one of them for our introductory material in thischapter, rather than having the user install another package at this point.Our set of choices is further narrowed by the fact that mul t icore runs only onUnix-family (e.g., Linux and Mac) platforms, not Windows. Accordingly, at thisearly point in the book, we will focus on snow.2.4.3 Meaning of ?snow" in This BookAs noted, an old contributed package for R, snow, was later made part of the Rbase, in the latter's paral l el package (with slight modifications). We will makefrequent use of this part of that package, so we need a short name for it. ?Theportion of paral l el adapted from snow" would be anything but short. So, we'lljust call it snow.2.4.4 Int roduct ion t o snowHere is the overview of how snow operates: All four of the popular packagescited above, including snow, typically employ a scatter/gather paradigm: Wehave multiple instances of R running at the same time, either on severalmachines in a cluster, or on a multicore machine. We'll refer to one of theinstances as the manager, with the rest being workers. The parallel computationthen proceeds as follows:-scat t er: The manager breaks the desired computation into chunks, andsends (?scatters") the chunks to the workers.chunk comput at ion: The workers then do computation on each chunk, andsend the results back to the manager.gat her: The manager receives (?gathers") those results, and combinesthem to solve the original problem.In our mutual-outlink example here, each chunk would consist of some values ofi in the outer for loop in Listing 1.4.1. In other words, each worker woulddetermine the total count of mutual outlinks for this worker's assigned values ofi, and then return that count to the manager. The latter would collect these23

counts, sum them to form the grand total, and then obtain the average by dividing bythe number of node pairs, n(n-1)/ 2.2.4.5 Mut ual Out l inks Probl em, Sol ut ion 1Here's our first cut at the mutual outlinks problem:24

2.4.5.2 TimingsBefore explaining how this code works, let's see if it yields a speed improvement.I ran on the same machine used earlier, but in this case with two workers, i.e., ontwo cores. Here are the results:So we did get a speedup, with run time being diminished by almost 14 seconds.Good, but note that the speedup factor is only 94.071/ 80.348 1.17, not the 2.00one might expect from using two workers. This illustrates that communicationand other overhead can indeed be a major factor.Note the stark discrepancy between user and elapsed time here. Remember,these are times for the manager! The main computation is done by the workers,and their times don't show up here except in elapsed time.You might wonder whether two cores are enough, since we have a total of threeprocesses--two workers and the manager. But since the manager is idle while thetwo workers are computing, there would be no benefit in having the manager runon a separate core, even if we had one (which we in a sense do, withhyperthreading, to be explained shortly).This run was performed on a dual core machine, hence our using two workers.However, we may be able to do a bit better, as this machine has a hyperthreadedprocessor. This means that each core is capable, to some degree, of running twoprograms at once. Thus I tried running with four workers:So, hyperthreading did yield further improvement, raising our speedup factor to1.34. Note, though, that now there is even further disparity between the 4.00speedup we might hope to get with four workers. As noted, these issues will arise25

frequently in this book; the sources of overhead will be discussed, and remediespresented.There is another reason why our speedups above are not so impressive: Our codeis fundamentally unfair--it makes some workers do more work than others. This isknown as a load balancing problem, one of the central issues in the parallelprocessing field. We'll address this in a refined version in Chapter 3.2.4.5.3 Anal ysis of t he CodeSo, how does all this work? Let's dissect the code.Even though snow and multicore are now part of R via the parallel package, thepackage is not automatically loaded. So we need to take care of this first, byplacing a lineat the top of our source file (if all these functions are in one file), or simplyexecute the above l ibrary() call on the command line.Or, we can insert a linein the functions that make use of snow.Now, who does what? It's important to understand that most of the lines of codein the serial version are executed by the manager. The only code run by theworkers will be doichunk(), though of course that is where the main work is done.As will be seen, the manager sends that function (and data) to the workers, whoexecute the function according to the manager's directions. The basic idea is tobreak the values of i in the i loop in our earlier serial code, Listing 1.4.1, intochunks, and then have each worker work on its chunk. Our function doichunk()(?do element i in ichunk"),26

will be executed for each worker, with ichunk being different for each worker.Our function mut out par() wraps the overall process, dividing into the i valuesinto chunks and calling doichunk() on each one. It thus parallelizes the outerloop of the serial code.To get an overview of that function, note that the main actions consist of thefollowing calls to snow and R functions:-We call snow's cl ust erExport () to send our data, in this case the l nksmatrix, to the workers.We call snow's cl ust erAppl y() to direct the workers to perform theirassigned chunks of work.We call R's core function Reduce() as a convenient way to combine theresults returned by the workers.Here are the details: Even before calling mut out par(), we set up our snowcluster:This sets up nworkers workers. Remember, each of these workers will beseparate R processes (as will be the manager). In this simple form, they will allbe running on the same machine, presumably multicore.Clusters are snow abstractions, not physical entities, though we can set up asnow cluster on a physical cluster of machines. As will be seen in detail later, acluster is an R object that contains information on the various workers and howto reach them. So, if I runI create a 4-node snow cluster (for 4 workers) and save its information in an Robject cl s (of class ?cl ust er"), which will be used in my subsequent calls tosnow functions.27

There is one component in cl s for each worker. So after the above call, runningprints out 4.We can also run snow on a physical cluster of machines, i.e., several machinesconnected via a network. Calling the above function init cl s() arranges this. Inmy department, for example, we have student lab machines named pc1, pc2and so on, so for instancewould set up a two-node snow run.In any case, in the above default call to makeCl ust er(), communication betweenthe manager and the workers is done via network sockets, even if we are on amulticore machine.Now, let's take a closer look at mut out par(), first the callThis sends our data matrix l nks to all the workers in cl s.An important point to note is that cl ust erExport () by default requires thetransmitted data to be global in the manager's work space. It is then placed inthe global work space of each worker (without any alternative option offered).To meet this requirement, I made l nks global back when I created this data insnowsim(), using the superassignment operator -:The use of global variables is rather controversial in the software developmentworld. In my book The Art of R Programming (NSP, 2011), I address some of theobjections some programmers have to global variables, and argue that in manycases (especially in R), globals are the best (or least bad) solution.In any case, here the structure of clusterExport() basically forces us to useglobals. For the finicky, there is an option to use an R environment instead ofthe manager's global workspace. We could change the above call withmutoutpar(), for instance, to28

The R function environment () returns the current environment, meaning thecontext of code within mut out par(), in which l nks is a local variable. But eventhen the data would still be global at the workers.Here are the details of the cl ust erAppl y() call. Let's refer to that secondargument of cl ust erAppl y(), in this case ichunks, as the ?work assignment"argument, as it parcels out work to workers.To keep things simple in this introductory example, we have just a single i valuefor each \ chunk":(We'll extend this to larger chunks in Section 3.2.1.)Here cl ust erAppl y() will treat that ichunks vector as an R list of nr- 1

This free ebook presents four chapters on 'R Programming' from books published in the R Series. Each chapter presents a different aspect of programming in R, although it is not intended to be comprehensive. We hope you will find the content useful and that it gives you a snapshot of