The R Book - UPC Universitat Politècnica De Catalunya

Transcription

The R Book

The R BookSecond EditionMichael J. CrawleyImperial College London at Silwood Park, /index.htmA John Wiley & Sons, Ltd., Publication

This edition first published 2013 C 2013 John Wiley & Sons, LtdRegistered officeJohn Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ, United KingdomFor details of our global editorial offices, for customer services and for information about how to apply for permission to reuse thecopyright material in this book please see our website at www.wiley.com.The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs andPatents Act 1988.All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by anymeans, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs andPatents Act 1988, without the prior permission of the publisher.Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available inelectronic books.Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product namesused in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher isnot associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritativeinformation in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged in renderingprofessional services. If professional advice or other expert assistance is required, the services of a competent professional shouldbe sought.Library of Congress Cataloging-in-Publication DataCrawley, Michael J.The R book / Michael J. Crawley. – 2e.pages cmIncludes bibliographical references and index.ISBN 978-0-470-97392-9 (hardback)1. R (Computer program language) 2. Mathematical statistics–Data processing. I. Title.QA276.45.R3C73 2013519.50285 5133–dc232012027339A catalogue record for this book is available from the British Library.ISBN: 978-0-470-97392-9Set in 10/12pt Times by Aptara Inc., New Delhi, India.

ChaptersPrefacexxiii1Getting Started12Essentials of the R Language3Data matics2588Classical Tests3449Statistical Modelling38810Regression44911Analysis of Variance49812Analysis of Covariance53713Generalized Linear Models55714Count Data57915Count Data in Tables59916Proportion Data62817Binary Response Variables65018Generalized Additive Models66619Mixed-Effects Models68120Non-Linear Regression71521Meta-Analysis74022Bayesian Statistics75212

viCHAPTERS23Tree Models76824Time Series Analysis78525Multivariate Statistics80926Spatial Statistics82527Survival Analysis86928Simulation Models89329Changing the Look of Graphics907References and Further Reading971Index977

Detailed ContentsPreface12xxiiiGetting Started1.1 How to use this book1.1.1 Beginner in both computing and statistics1.1.2 Student needing help with project work1.1.3 Done some R and some statistics, but keen to learn more of both1.1.4 Done regression and ANOVA, but want to learn more advanced statisticalmodelling1.1.5 Experienced in statistics, but a beginner in R1.1.6 Experienced in computing, but a beginner in R1.1.7 Familiar with statistics and computing, but need a friendly reference manual1.2 Installing R1.3 Running R1.4 The Comprehensive R Archive Network1.4.1 Manuals1.4.2 Frequently asked questions1.4.3 Contributed documentation1.5 Getting help in R1.5.1 Worked examples of functions1.5.2 Demonstrations of R functions1.6 Packages in R1.6.1 Contents of packages1.6.2 Installing packages1.7 Command line versus scripts1.8 Data editor1.9 Changing the look of the R screen1.10 Good housekeeping1.11 Linking to other computer languages11122222333455566778899101011Essentials of the R Language2.1 Calculations2.1.1 Complex numbers in R2.1.2 Rounding2.1.3 Arithmetic2.1.4 Modulo and integer quotients121313141617

viiiDETAILED CONTENTS2.22.32.42.52.62.72.82.92.102.1.5 Variable names and assignment2.1.6 Operators2.1.7 Integers2.1.8 FactorsLogical operations2.2.1 TRUE and T with FALSE and F2.2.2Testing for equality with real numbers2.2.3Equality of floating point numbers using all.equal2.2.4 Summarizing differences between objects using all.equal2.2.5 Evaluation of combinations of TRUE and FALSE2.2.6 Logical arithmeticGenerating sequences2.3.1 Generating repeats2.3.2 Generating factor levelsMembership: Testing and coercing in RMissing values, infinity and things that are not numbers2.5.1 Missing values: NAVectors and subscripts2.6.1 Extracting elements of a vector using subscripts2.6.2 Classes of vector2.6.3 Naming elements within vectors2.6.4 Working with logical subscriptsVector functions2.7.1 Obtaining tables of means using tapply2.7.2 The aggregate function for grouped summary statistics2.7.3 Parallel minima and maxima: pmin and pmax2.7.4Summary information from vectors by groups2.7.5Addresses within vectors2.7.6Finding closest values2.7.7Sorting, ranking and ordering2.7.8Understanding the difference between unique and duplicated2.7.9Looking for runs of numbers within vectors2.7.10 Sets: union, intersect and setdiffMatrices and arrays2.8.1Matrices2.8.2Naming the rows and columns of matrices2.8.3Calculations on rows or columns of the matrix2.8.4Adding rows and columns to the matrix2.8.5The sweep function2.8.6 Applying functions with apply, sapply and lapply2.8.7Using the max.col function2.8.8Restructuring a multi-dimensional array using apermRandom numbers, sampling and shuffling2.9.1The sample functionLoops and repeats2.10.1 Creating the binary representation of a number2.10.2 Loop 374

DETAILED CONTENTS2.112.122.132.142.152.10.3The slowness of loops2.10.4Do not ‘grow’ data sets by concatenation or recursive function calls2.10.5Loops for producing time seriesLists2.11.1Lists and lapply2.11.2Manipulating and saving listsText, character strings and pattern matching2.12.1Pasting character strings together2.12.2Extracting parts of strings2.12.3Counting things within strings2.12.4Upper- and lower-case text2.12.5The match function and relational databases2.12.6Pattern matching2.12.7Dot . as the ‘anything’ character2.12.8Substituting text within character strings2.12.9Locations of a pattern within a vector using regexpr2.12.10 Using %in% and which2.12.11 More on pattern matching2.12.12 Perl regular expressions2.12.13 Stripping patterned text out of complex stringsDates and times in R2.13.1Reading time data from files2.13.2The strptime function2.13.3The difftime function2.13.4Calculations with dates and times2.13.5The difftime and as.difftime functions2.13.6Generating sequences of dates2.13.7Calculating time differences between the rows of a dataframe2.13.8Regression using dates and times2.13.9Summary of dates and times in REnvironments2.14.1Using with rather than attach2.14.2Using attach in this bookWriting R functions2.15.1Arithmetic mean of a single sample2.15.2Median of a single sample2.15.3Geometric mean2.15.4Harmonic mean2.15.5Variance2.15.6Degrees of freedom2.15.7Variance ratio test2.15.8Using variance2.15.9Deparsing: A graphics function for error bars2.15.10 The switch function2.15.11 The evaluation environment of a function2.15.12 Scope2.15.13 Optional 5116118119119120121123125126126126

xDETAILED CONTENTS2.15.14 Variable numbers of arguments (.)2.15.15 Returning values from a function2.15.16 Anonymous functions2.15.17 Flexible handling of arguments to functions2.15.18 Structure of an object: str2.16 Writing from R to file2.16.1Saving your work2.16.2Saving history2.16.3Saving graphics2.16.4Saving data produced within R to disc2.16.5Pasting into an Excel spreadsheet2.16.6Writing an Excel readable file from R2.17 Programming tips1271281291291301331331331341341351351353Data Input3.1 Data input from the keyboard3.2 Data input from files3.2.1The working directory3.2.2Data input using read.table3.2.3Common errors when using read.table3.2.4Separators and decimal points3.2.5Data input directly from the web3.3 Input from files using scan3.3.1Reading a dataframe with scan3.3.2Input from more complex file structures using scan3.4 Reading data from a file using readLines3.4.1Input a dataframe using readLines3.4.2Reading non-standard files using readLines3.5 Warnings when you attach the dataframe3.6 Masking3.7 Input and output formats3.8 Checking files from the command line3.9 Reading dates and times from files3.10 Built-in data files3.11 File paths3.12 Connections3.13 Reading data from an external database3.13.1Creating the DSN for your computer3.13.2Setting up R to read from the 1491501501511511521521531541551554Dataframes4.1 Subscripts and indices4.2 Selecting rows from the dataframe at random4.3 Sorting dataframes4.4 Using logical conditions to select rows from the dataframe4.5 Omitting rows containing missing values, NA4.5.1Replacing NAs with zeros4.6 Using order and !duplicated to eliminate pseudoreplication159164165166169172174174

DETAILED CONTENTS4.74.84.94.104.114.124.134.144.155Complex ordering with mixed directionsA dataframe with row names instead of row numbersCreating a dataframe from another kind of objectEliminating duplicate rows from a dataframeDates in dataframesUsing the match function in dataframesMerging two dataframesAdding margins to a dataframeSummarizing the contents of dataframesGraphics5.1 Plots with two variables5.2 Plotting with two continuous explanatory variables: Scatterplots5.2.1 Plotting symbols: pch5.2.2 Colour for symbols in plots5.2.3 Adding text to scatterplots5.2.4 Identifying individuals in scatterplots5.2.5 Using a third variable to label a scatterplot5.2.6 Joining the dots5.2.7 Plotting stepped lines5.3 Adding other shapes to a plot5.3.1 Placing items on a plot with the cursor, using the locator function5.3.2Drawing more complex shapes with polygon5.4 Drawing mathematical functions5.4.1Adding smooth parametric curves to a scatterplot5.4.2Fitting non-parametric curves through a scatterplot5.5 Shape and size of the graphics window5.6 Plotting with a categorical explanatory variable5.6.1Boxplots with notches to indicate significant differences5.6.2Barplots with error bars5.6.3Plots for multiple comparisons5.6.4Using colour palettes with categorical explanatory variables5.7 Plots for single samples5.7.1Histograms and bar charts5.7.2Histograms5.7.3Histograms of integers5.7.4Overlaying histograms with smooth density functions5.7.5Density estimation for continuous variables5.7.6Index plots5.7.7Time series plots5.7.8Pie charts5.7.9The stripchart function5.7.10 A plot to test for normality5.8 Plots with multiple variables5.8.1The pairs function5.8.2The coplot function5.8.3Interaction 220220221224225226227228230231232234234236237

xiiDETAILED CONTENTS5.9Special plots5.9.1 Design plots5.9.2 Bubble plots5.9.3 Plots with many identical values5.10 Saving graphics to file5.11 Summary2382382392402422426Tables6.1 Tables of counts6.2 Summary tables6.3 Expanding a table into a dataframe6.4 Converting from a dataframe to a table6.5 Calculating tables of proportions with prop.table6.6 The scale function6.7 The expand.grid function6.8 The model.matrix function6.9 Comparing table and 7.1 Mathematical functions7.1.1Logarithmic functions7.1.2Trigonometric functions7.1.3Power laws7.1.4Polynomial functions7.1.5Gamma function7.1.6Asymptotic functions7.1.7Parameter estimation in asymptotic functions7.1.8Sigmoid (S-shaped) functions7.1.9Biexponential model7.1.10 Transformations of the response and explanatory variables7.2 Probability functions7.3 Continuous probability distributions7.3.1Normal distribution7.3.2The central limit theorem7.3.3Maximum likelihood with the normal distribution7.3.4Generating random numbers with exact mean and standard deviation7.3.5Comparing data with a normal distribution7.3.6Other distributions used in hypothesis testing7.3.7The chi-squared distribution7.3.8Fisher’s F distribution7.3.9Student’s t distribution7.3.10 The gamma distribution7.3.11 The exponential distribution7.3.12 The beta distribution7.3.13 The Cauchy distribution7.3.14 The lognormal distribution7.3.15 The logistic distribution7.3.16 The log-logistic 01

DETAILED CONTENTS7.47.57.67.787.3.17 The Weibull distribution7.3.18 Multivariate normal distribution7.3.19 The uniform distribution7.3.20 Plotting empirical cumulative distribution functionsDiscrete probability distributions7.4.1 The Bernoulli distribution7.4.2 The binomial distribution7.4.3 The geometric distribution7.4.4 The hypergeometric distribution7.4.5 The multinomial distribution7.4.6 The Poisson distribution7.4.7 The negative binomial distribution7.4.8 The Wilcoxon rank-sum statisticMatrix algebra7.5.1 Matrix multiplication7.5.2 Diagonals of matrices7.5.3 Determinant7.5.4 Inverse of a matrix7.5.5 Eigenvalues and eigenvectors7.5.6 Matrices in statistical models7.5.7 Statistical models in matrix notationSolving systems of linear equations using matricesCalculus7.7.1 Derivatives7.7.2 Integrals7.7.3 Differential equationsClassical Tests8.1 Single samples8.1.1 Data summary8.1.2 Plots for testing normality8.1.3 Testing for normality8.1.4 An example of single-sample data8.2 Bootstrap in hypothesis testing8.3 Skew and kurtosis8.3.1 Skew8.3.2 Kurtosis8.4 Two samples8.4.1 Comparing two varian

2 Essentials of the R Language 12 2.1 Calculations 13 2.1.1 Complex numbers in R 13 2.1.2 Rounding 14 2.1.3 Arithmetic 16 2.1.4 Modulo and integer quotients 17. viii DETAILED CONTENTS 2.1.5 Variable names and assignment 18 2.1.6 Operators 19 2.1.7 Integers 19 2.1.8 Factors 20 2.2 Logical operations 22 2.2.1 TRUE and T with FALSE and F 22 2.2.2 Testing for equality with real numbers 23 2.2.3 .