Statistical Analysis Handbook - StatsRef

Transcription

Statistical AnalysisHandbookA Comprehensive Handbook of StatisticalConcepts, Techniques and Software Tools2018 EditionDr Michael J de Smith

Statistical AnalysisHandbookA Comprehensive Handbook of StatisticalConcepts, Techniques and Software ToolsDr Michael J de Smith

Copyright 2015-2018 All Rights reserved. 2018 Edition. Issue version: 2018-1No part of this publication may be reproduced, stored in a retrieval system or transmitted in any formor by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, exceptunder the terms of the UK Copyright Designs and Patents Act 1998 or with the written permission ofthe authors. The moral right of the authors has been asserted. Copies of this edition are available inelectronic book and web-accessible formats only.Disclaimer: This publication is designed to offer accurate and authoritative information in regard tothe subject matter. It is provided on the understanding that it is not supplied as a form of professionalor advisory service. References to software products, datasets or publications are purely made forinformation purposes and the inclusion or exclusion of any such item does not imply recommendationor otherwise of the product or material in question.For more details please refer to the Guide’s website: www.statsref.comISBN-13978-1-912556-06-9 Hardback978-1-912556-07-6 Paperback978-1-912556-08-3 eBookPublished by: The Winchelsea Press, Drumlin Security Ltd, EdinburghFront inside cover image: Polar bubble plot (MatPlotLib, Python)Rear inside cover image: Florence Nightingale's polar diagram of causes of mortality, by month(source: Wikipedia)Cover image: Mandlebrot set fractal

5Table of Contents1 Introduction131.1 How to use this Handbook171.2 Intended audience and scope181.3 Suggested reading191.4 Notation and symbology231.5 Historical context251.6 An applications-led discipline312 Statistical data372.1 The Statistical Method532.2 Misuse, Misinterpretation and Bias602.3 Sampling and sample size712.4 Data preparation and cleaning802.5 Missing data and data errors822.6 Statistical error872.7 Statistics in Medical Research882.7.1Causation902.7.2Conduct and reporting of medical research933 Statistical concepts1053.1 Probability theory1083.1.1Odds1093.1.2Risks1103.1.3Frequentist probability theory1123.1.4Bayesian probability theory1163.1.5Probability distributions1203.2 Statistical modeling1223.3 Computational statistics1253.4 Inference126www.statsref.com(c) 2018

63.5 Bias1273.6 Confounding1293.7 Hypothesis testing1303.8 Types of error1323.9 Statistical significance1343.10 Confidence intervals1373.11 Power and robustness1413.12 Degrees of freedom1423.13 Non-parametric analysis1434 Descriptive statistics1454.1 Counts and specific values1484.2 Measures of central tendency1504.3 Measures of spread1574.4 Measures of distribution shape1664.5 Statistical indices1704.6 Moments1725 Key functions and expressions1755.1 Key functions1785.2 Measures of Complexity and Model selection1855.3 Matrices1906 Data transformation and standardization1996.1 Box-Cox and Power transforms2026.2 Freeman-Tukey (square root and arcsine) transforms2046.3 Log and Exponential transforms2076.4 Logit transform2106.5 Normal transform (z-transform)2127 Data exploration7.1 Graphics and vizualisationwww.statsref.com213216(c) 2018

77.2 Exploratory Data Analysis8 Randomness and Randomization2332418.1 Random numbers2458.2 Random permutations2548.3 Resampling2568.4 Runs test2608.5 Random walks2618.6 Markov processes2718.7 Monte Carlo methods2778.7.1Monte Carlo Integration2778.7.2Monte Carlo Markov Chains (MCMC)2809 Correlation and autocorrelation2859.1 Pearson (Product moment) correlation2889.2 Rank correlation2989.3 Canonical correlation3029.4 Autocorrelation3049.4.1Temporal autocorrelation3059.4.2Spatial autocorrelation31010 Probability distributions33310.1 Discrete Distributions33910.1.1Binomial distribution33910.1.2Hypergeometric distribution34310.1.3Multinomial distribution34510.1.4Negative Binomial or Pascal and Geometric distribution34710.1.5Poisson distribution34910.1.6Skellam distribution35410.1.7Zipf or Zeta distribution35510.2 Continuous univariate distributions35610.2.1Beta distribution35610.2.2Chi-Square distribution35810.2.3Cauchy distribution361www.statsref.com(c) 2018

810.2.4Erlang distribution36210.2.5Exponential distribution36410.2.6F distribution36710.2.7Gamma distribution36910.2.8Gumbel and extreme value distributions37110.2.9Normal distribution37410.2.10Pareto distribution37910.2.11Student's t-distribution (Fisher's distribution)38110.2.12Uniform distribution38410.2.13von Mises distribution38610.2.14Weibull distribution39010.3 Multivariate distributions39210.4 Kernel Density Estimation39611 Estimation and estimators40511.1 Maximum Likelihood Estimation (MLE)40911.2 Bayesian estimation41412 Classical tests12.1 Goodness of fit re 7Lilliefors43112.2 Z-tests43312.2.1Test of a single mean, standard deviation known43312.2.2Test of the difference between two means, standard deviations known43512.2.3Tests for proportions, p43612.3 T-tests43812.3.1Test of a single mean, standard deviation not known43812.3.2Test of the difference between two means, standard deviation not known43912.3.3Test of regression coefficients440www.statsref.com(c) 2018

912.4 Variance tests44312.4.1Chi-square test of a single variance44312.4.2F-tests of two variances44412.4.3Tests of homogeneity44512.5 Wilcoxon rank-sum/Mann-Whitney U test44912.6 Sign test45313 Contingency tables45513.1 Chi-square contingency table test45913.2 G contingency table test46113.3 Fisher's exact test46213.4 Measures of association46513.5 McNemar's test46614 Design of experiments46714.1 Completely randomized designs47514.2 Randomized block designs47614.2.1Latin squares47714.2.2Graeco-Latin squares47914.3 Factorial designs48114.3.1Full Factorial designs48114.3.2Fractional Factorial designs48314.3.3Plackett-Burman designs48514.4 Regression designs and response surfaces48714.5 Mixture designs48915 Analysis of variance and covariance15.1 ANOVA49149615.1.1Single factor or one-way ANOVA50015.1.2Two factor or two-way and higher-way ANOVA50415.2 MANOVA50715.3 ANCOVA50915.4 Non-Parametric ANOVA510www.statsref.com(c) 2018

1015.4.1Kruskal-Wallis ANOVA51015.4.2Friedman ANOVA test51215.4.3Mood's Median51316 Regression and smoothing51516.1 Least squares52216.2 Ridge regression52816.3 Simple and multiple linear regression52916.4 Polynomial regression54316.5 Generalized Linear Models (GLIM)54516.6 Logistic regression for proportion data54716.7 Poisson regression for count data55016.8 Non-linear regression55416.9 Smoothing and Generalized Additive Models (GAM)55816.10 Geographically weighted regression (GWR)56016.11 Spatial series and spatial autoregression56516.11.1SAR models57116.11.2CAR models57516.11.3Spatial filtering models57917 Time series analysis and temporalautoregression58117.1 Moving averages58817.2 Trend Analysis59317.3 ARMA and ARIMA (Box-Jenkins) models59917.4 Spectral analysis60818 Resources61118.1 Distribution tables61418.2 Bibliography62918.3 Statistical Software63818.4 Test Datasets and data archives64018.5 Websites653www.statsref.com(c) 2018

1118.6 Tests Index65418.6.1Tests and confidence intervals for mean values65418.6.2Tests for proportions65418.6.3Tests and confidence intervals for the spread of datasets65518.6.4Tests of randomness65518.6.5Tests of fit to a given distribution65518.6.6Tests for cross-tabulated count data65618.7 R Code samples65718.7.1Scatter Plot: Inequality65718.7.2Latin Square ANOVA65818.7.3Log Odds Ratio Plot65918.7.4Normal distribution plot66018.7.5Bootstrapping660www.statsref.com(c) 2018

Chapter1

Introduction115IntroductionThe definition of what is meant by statistics and statistical analysis has changed considerably over the last fewdecades. Here are two contrasting definitions of what statistics is, from eminent professors in the field, some 60 years apart:"Statistics is the branch of scientific method which deals with the data obtained by counting or measuring theproperties of populations of natural phenomena. In this definition 'natural phenomena' includes all thehappenings of the external world, whether human or not." Professor Maurice Kendall, 1943, p2 [MK1]"Statistics is: the fun of finding patterns in data; the pleasure of making discoveries; the import of deepphilosophical questions; the power to shed light on important decisions, and the ability to guide decisions.in business, science, government, medicine, industry." Professor David Hand [DH1]As these two definitions indicate, the discipline of statistics has moved from being grounded firmly in the world ofmeasurement and scientific analysis into the world of exploration, comprehension and decision-making. At thesame time its usage has grown enormously, expanding from a relatively small set of specific application areas(such as design of experiments and computation of life insurance premiums) to almost every walk of life. Aparticular feature of this change is the massive expansion in information (and misinformation) available to allsectors and age-groups in society. Understanding this information, and making well-informed decisions on thebasis of such understanding, is the primary function of modern statistical methods.Our objective in producing this Handbook is to be comprehensive in terms of concepts and techniques (but notnecessarily exhaustive), representative and independent in terms of software tools, and above all practical interms of application and implementation. However, we believe that it is no longer appropriate to think of astandard, discipline-specific textbook as capable of satisfying every kind of new user need. Accordingly, aninnovative feature of our approach here is the range of formats and channels through which we disseminate thematerial — web, ebook and print. A major advantage of the electronic formats is that the text can be embeddedwith internal and external hyperlinks (shown underlined). In this Handbook we utilize both forms of link, withexternal links often referring to a small number of well-established sources, including MacTutor for bibliographicinformation and a number of other web resources, such as Eric Weisstein's Mathworld and the statistics portal ofWikipedia, that provide additional material on selected topics.The treatment of topics in this Handbook is relatively informal, in that we do not provide mathematical proofs formuch of the material discussed. However, where it is felt particularly useful to clarify how an expression arises,we do provide simple derivations. More generally we adopt the approach of using descriptive explanations andworked examples in order to clarify the usage of different measures and procedures. Frequently convenientsoftware tools are used for this purpose, notably SPSS/PASW, The R Project, MATLab and a number of morespecialized software tools where appropriate.Just as all datasets and software packages contain errors, known and unknown, so too do all books and websites,and we expect that there will be errors despite our best efforts to remove these! Some may be genuine errors ormisprints, whilst others may reflect our use of specific versions of software packages and their documentation.Inevitably with respect to the latter, new versions of the packages that we have used to illustrate this Handbookwill have appeared even before publication, so specific examples, illustrations and comments on scope orrestrictions may have been superseded. In all cases the user should review the documentation provided with thewww.statsref.com(c) 2018

16software version they plan to use, check release notes for changes and known bugs, and look at any relevantonline services (e.g. user/developer forums and blogs on the web) for additional materials and insights.The interactive web and PDF versions of this Handbook provide color images and active hyperlinks, and may beaccessed via the associated Internet site: www.statsref.com. The contents and sample sections of the PDF versionmay also be accessed from this site. In both cases the information is regularly updated. The Internet is now wellestablished as society’s principal mode of information exchange, and most aspiring users of statistical methods areaccustomed to searching for material that can easily be customized to specific needs. Our objective for such usersis to provide an independent, reliable and authoritative first port of call for conceptual, technical, software andapplications material that addresses the panoply of new user requirements.Readers wishing to obtain a more in-depth understanding of the background to many of the topics covered in thisHandbook should review the Suggested Reading topic.References[DH1] D Hand (2009) President of the Royal Statistical Society (RSS), RSS Conference Presentation, November 2009[MK1] Kendall M G, Stuart A (1943) The Advanced Theory of Statistics: Volume 1, Distribution Theory. Charles Griffin &Company, London. First published in 1943, revised in 1958 with Stuartwww.statsref.com(c) 2018

Introduction1.117How to use this HandbookThis Handbook is designed to provide a wide-ranging and comprehensive, though not exhaustive, coverage ofstatistical concepts and methods. Unlike a Wiki the Handbook has a more linear flow structure, and in principlecan be read from start to finish. In practice many of the topics, particularly some of those described in later partsof the document, will be of interest only to specific users at particular times, but are provided for completeness.Users are recommended to read the initial four topics — Introduction, Statistical Concepts, Statistical Data andDescriptive Statistics, and then select subsequent sections as required.Navigating around the PDF or web versions of this

7.2 Exploratory Data Analysis 233 8 Randomness and Randomization 241 8.1 Random numbers 245 8.2 Random permutations 254 8.3 Resampling 256 8.4 Runs test 260 8.5 Random walks 261 8.6 Markov processes 271 8.7 Monte Carlo methods 277 8.7.1 Monte Carlo Integration 277 8.7.2 Monte Carlo Markov Chains (MCMC) 280 9 Correlation and autocorrelation 285File Size: 1MBPage Count: 100