Week 7: More Graphics And SAS/BASE Procedures - University Of New Mexico

Transcription

Week 7: More graphics and SAS/BASE proceduresSAS ProgrammingOctober 2, 20141 / 93

Paneled graphics: crime dataYou can make paneled graphics using SGPANEL. This allows manysubfigures in the same plot. For example, for the crime data, you can plotcrime against population for each state.SAS ProgrammingOctober 2, 20142 / 93

Paneled graphicsNote that you can specify the dimensions of the array, for example 6x6 or5x7 etc. SAS does a good job at not having white space between figuresand utilizing the same axes for all rows and columns, which saves space.SAS ProgrammingOctober 2, 20143 / 93

Paneled graphics: UNISCALE optionSAS ProgrammingOctober 2, 20144 / 93

Paneled graphics: Earthquake exampleHere is a 1x3 arrangement, but it stretches the y-axis.SAS ProgrammingOctober 2, 20145 / 93

Paneled graphics: Earthquake exampleMaking a 1x3 arragement not stretched. If you have a better solution, letme know!SAS ProgrammingOctober 2, 20146 / 93

Paneled graphics: Earthquake exampleYou can also have two BY variables for paneled graphics. It fills in eachcombination of the BY variable.SAS ProgrammingOctober 2, 20147 / 93

Paneled graphics: Earthquake exampleSome combinations of the BY variables can be empty. In this case thereare 19 nonempty plots.SAS ProgrammingOctober 2, 20148 / 93

Paneled graphics: Earthquake exampleSome combinations of the BY variables can be empty. In this case thereare 19 nonempty plots.SAS ProgrammingOctober 2, 20149 / 93

Peneled graphics: Earthquake exmapleThere are a several options for how to present the data. You could usepanelby day eventtype instead. You can use an option to skip emptypanels. And you can use the layout option to control other aspects howthe panels are done. The layout lattice option makes the rows andcolumns.SAS ProgrammingOctober 2, 201410 / 93

Scatterplot matrix: Earthquake data(I used linux because SAS Studio had a bad connection.)SAS ProgrammingOctober 2, 201411 / 93

Scatterplot matrix: SGSCATTER for Earthquake data22:12 Monday, September 29, 2014SAS Programming1October 2, 201412 / 93

SGSCATTER subfigures: Earthquake dataYou can also create arrays of plots using plot statements withinSGSCATTER, but these take up more space outside the plotting area thanSGPANEL.SAS ProgrammingOctober 2, 201413 / 93

SGSCATTER subfigures: Earthquake data22:25 Monday, September 29, 2014SAS ProgrammingOctober 2, 2014114 / 93

3D plots: Earthquake dataSAS ProgrammingOctober 2, 201415 / 93

3D Plots: Earthquake dataSAS ProgrammingOctober 2, 201416 / 93

3D Plots: Earthquake dataThere are options for tilting and rotating.SAS ProgrammingOctober 2, 201417 / 93

3D PlotsThere’s a lot more that you can do with 3D plots that I haven’t explored.You can create grid lines and evaluate a function of two variables at allpoints on the grid and create a surface plot.To do something like this with data that isn’t evenly spaced, you can usePROC KDE which gives kernel density estimates for the value of thesurface and use output from this procedure to generate 3D plots.SAS ProgrammingOctober 2, 201418 / 93

Some basic SAS proceduresWe’ll look at some basic SAS procedures useful for examining andsummarizing data in a little more depth. In particular are PROCUNIVARIATE, PROC MEANS, and PROC FREQ. We’ve used the lattertwo a little bit, but we’ll look in more depth at what they can do, and alsotake a look at PROC UNIVARIATE.SAS ProgrammingOctober 2, 201419 / 93

PROC UNIVARIATEThis procedure summarizes your data one variable at a time. The“summary” tends to be very extensive, so this can generate tons ofoutput. If you’ve ever wanted to summarize four observations with afive-number summarySAS ProgrammingOctober 2, 201420 / 93

PROC UNIVARIATE: outputSAS ProgrammingOctober 2, 201421 / 93

PROC UNIVARIATE: outputSAS ProgrammingOctober 2, 201422 / 93

PROC UNIVARIATE: outputUsually the output is a lot more than you are interested in. Here is aguideline:1. N, the sample size or number of observations2. Mean, the sample average3. Std Deviation the usual formulaq1n 1Pni 1 (xi x̄)24. Skewness a measure of how asymmetric a distribution is. Asymmetric distribution has skewness 0. Negative skew means that itis skewed to the left. Positive skew means that it is skewed to theright (like an exponential distribution). Skewness is based on thethird moment of a distribution E [X 3 ]. A formula for skewness isnXn(xi x̄)3(n 1)(n 2)i 1SAS ProgrammingOctober 2, 201423 / 93

PROC UNIVARIATE: output1. Uncorrected SS, thePuncorrected sum of squares, sum of thesquared observations, ni 1 xi22. Coefficient of variation, the estimated standard deviation overthe estimated mean, x̄/s, this is used in industrial applications,quality control, and engineering3. Sum of weights, normally this is the same as the sample size unlessyou have observations weighted by how frequently they appear in aseparate variableP4. Sum of observations, ni 1 xi5. Variance, the sample variance6. Kurtosis, a measure of how peaked a distribution is, based on thefourth moment, E [X 4 ]. Theoretically, this is defined as E [X 4 ]/σ 4where (σ 2 )2 σ 4 is the square of the variance. For a standardnormal distribution, the kurtosis is 3, but SAS substracts 3 from thekurtosis so that if your data is standard normal, the reported kurtosisclose to 0.SAS ProgrammingOctober 2, 201424 / 93

Kurtosis for different distributionsAll of these distributions have identical first, second, and third moments,but can be distinguished by their fourth moments. Distributions areLaplace (double exponential), hyperbolic secant, logistic, normal, raisedcosine, Wigner semicircular, and uniform.SAS ProgrammingOctober 2, 201425 / 93

PROC UNIVARIATE1. Std Error Mean The standard error of the mean is the samplestandard deviation divided by the square root of the sample size, so s/ n, which is what you use for constructing confidence intervals.2. A a single sample t test is done automatically, as well as somenonparametric tests, testing whether the data are different from 0.The Sign Test for example tests whether it is likely that as manyobservations had the observed signs. If data were equally likely to bepositive or negative, then there would be a 1/8 chance that all 4observations would have the same sign, hence the p-value of 1/8.3. Extreme observations are also highlighted to show the highest andlowest observations. In this case, since there are only 4 observations,it lists them all, but for larger data sets, this can be useful forchecking for outliers or values that are not within acceptable limits(negative heights or lengths, magnitude 67 earthquakes, etc.)SAS ProgrammingOctober 2, 201426 / 93

PROC UNIVARIATEYou can also do a plot option in PROC UNIVARIATE, which generateshistograms or stem-and-leaf plots and plots for checking for normality.SAS ProgrammingOctober 2, 201427 / 93

PROC UNIVARIATESAS ProgrammingOctober 2, 201428 / 93

PROC UNIVARIATE plot optionSAS ProgrammingOctober 2, 201429 / 93

PROC UNIVARIATE plot optionSAS ProgrammingOctober 2, 201430 / 93

PROC UNIVARIATE plot optionCharming text graphics in the .lst file from linux SAS.SAS ProgrammingOctober 2, 201431 / 93

PROC UNIVARIATE BY statementYou can also run PROC UNIVARIATE BY some variable, for example sexfor the temperature data. This will generate twice as much output, so itcan of course generate a ridiculous amount of output.Typically, you might run UNIVARIATE initially when exploring your datato understand its range, look for outliers, and count missing values foreach variable (but doesn’t describe patterns of missingness for jointrandom variables). So you might use PROC UNIVARIATE initially toexplore your data, but then not include it in your final SAS code.Note that PROC UNIVARIATE also describes quantitative variables, notcharacter variables. PROC FREQ is more useful for describing charactervariables.SAS ProgrammingOctober 2, 201432 / 93

PROC MEANSWe’ve encountered just a little bit of what PROC MEANS can do before.We’ll take a more thorough look now. PROC MEANS, like UNIVARIATEis also useful for quantitative variables. The default behavior is to computethe MEAN, STANDARD DEVIATION, number of nonmissing values, andMIN and MAX values. PROC MEANS generates less output than PROCUNIVARIATE and also useful for catching outliers.Some options for PROC MEANS include NPLACES (for number of digitsof precision when printing), WIDTH (number of columns in the output)and SUM to calculate the sum of the observations.The CLASS statement allows you to compute MEANS and other statisticswithin different class variables (i.e., separate means for men versuswomen).SAS ProgrammingOctober 2, 201433 / 93

PROC MEANSSAS ProgrammingOctober 2, 201434 / 93

PROC MEANS: output data setYou can use PROC MEANS to create an output data set.SAS ProgrammingOctober 2, 201435 / 93

PROC MEANS: output data setYou can use PROC MEANS to create an output data set.SAS ProgrammingOctober 2, 201436 / 93

PROC MEANS: output data setYou can then extract relevant information from this dataset if desired.SAS ProgrammingOctober 2, 201437 / 93

PROC MEANS: getting the average into a new columnSuppose we want the average temperature to be available for thetemperature data, and to be in the same dataset as the other temperaturedata. How can we do this?The easiest thing would be to run PROC MEANS, write down the averageon a piece of paper, and then hard-code it by hand into the original dataas a fixed column (every row has the same value). This might be fine formost applications, but if you needed to repeat this every month or everyweek, you might prefer a more automatic solution.SAS ProgrammingOctober 2, 201438 / 93

PROC MEANS: getting the average into a new columnAnother solution is to compute the mean in proc means, and then readthat in to a copy of the data set.SAS ProgrammingOctober 2, 201439 / 93

PROC MEANS: getting the average into a new columnSAS ProgrammingOctober 2, 201440 / 93

PROC MEANS: OptionsThese are from the book. Another useful option is noprint whichsurpresses output. This is useful if you are mostly interested in creating anoutput data set.SAS ProgrammingOctober 2, 201441 / 93

PROC MEANS: naming variablesYou can either have SAS automatically name variables in the outputdataset or you can name them yourself. Consider the following:SAS ProgrammingOctober 2, 201442 / 93

PROC MEANS: naming variablesSAS ProgrammingOctober 2, 201443 / 93

PROC MEANS: TYPEThis seems a bit obscure. If you use chartype as an option, the 0 or 1tells you whether it is computing the marginal versus cell means. 0indicates a marginal mean and 1 indicates a cell mean. Thus 0 means thatit is taking an overall average. If you tell the procedure to do means forindividual sexes, a 1 will indicate that it is giving means for that particularsex. You can also tell this from FREQ since 129 is the number ofobservations in the data set.SAS ProgrammingOctober 2, 201444 / 93

PROC MEANS: different stats for different variablesSAS ProgrammingOctober 2, 201445 / 93

PROC FREQPROC FREQ is most useful for categorical data or quantitative data withfew values. We’ve already seen some use of PROC FREQ before — nowwe’ll look at some options that can be used. The most basic use of PROCFREQ isproc freq data mydata;tables myvar; /* Use tables instead of var */run;SAS ProgrammingOctober 2, 201446 / 93

PROC FREQ: optionsSAS ProgrammingOctober 2, 201447 / 93

The COMPRESS option is useful when you are generating .lst files anddon’t want output to be too verbose.SAS ProgrammingOctober 2, 201448 / 93

Example with Earthquake dataSAS ProgrammingOctober 2, 201449 / 93

Example with Earthquake dataNotice that the categories are left-justified. This is true even if I classifyquakesize numerically (without quotes).SAS ProgrammingOctober 2, 201450 / 93

Example with Earthquake dataPutting the asterisk makes 2-way contingency tables instead of analyzingeach variable separately.SAS ProgrammingOctober 2, 201451 / 93

Example with Earthquake dataNote that something funny happened with the names of the magnitudecategories. The 6 category got truncated to have length 1 because thefirst value of category had length 1. This can be fixed by a LENGTHstatement in the original datastep put before quakesize is first used.(Usually LENGTH statements occur before INPUT.)Puttinglength quakesize 2;fixes the problem.SAS ProgrammingOctober 2, 201452 / 93

Example with earthquake dataThe FORMCHAR option allows you to adjust the vertical, horizontal, andintersection points in the outputted table. Many journals accept horizontallines but not vertical separators between columns.SAS ProgrammingOctober 2, 201453 / 93

Example with earthquake dataThe FORMCHAR option allows you to adjust the vertical, horizontal, andintersection points in the outputted table. Many journals accept horizontallines but not vertical separators between columns. Unfortunately, this hasno effect in SAS Studio.SAS ProgrammingOctober 2, 201454 / 93

Example with earthquake dataIf you use LATEX, you can make your life easier.SAS ProgrammingOctober 2, 201455 / 93

Example with earthquake dataHow it looks in SAS Studio.SAS ProgrammingOctober 2, 201456 / 93

Example with earthquake data: NLEVELS optionSAS ProgrammingOctober 2, 201457 / 93

Example with earthquake dataThe NLEVELS option is useful for debugging. In this case, no earthquakeswere recorded below magnitude 1.0, which is why my category of “0” is amissing level.For eventtype, PROC FREQ told us quickly from the output that therewere three levels (earthquake, quarry blast, and out of network of interest),but if there are more categories, it might be difficult to catch this by eye.For example, you might have data that were recorded at 30 field stations,or 3000 counties across the US, and you need to make sure that data fromeach one is in your analysis.SAS ProgrammingOctober 2, 201458 / 93

Example with earthquake data: TABLES optionsYou can reduce the output with some of the options for the TABLESstatement.SAS ProgrammingOctober 2, 201459 / 93

Example with earthquake data: TABLES optionsSAS ProgrammingOctober 2, 201460 / 93

Example with earthquake data: TABLES optionsSAS ProgrammingOctober 2, 201461 / 93

Example with earthquake data: TABLES optionsA three-way table is presented as a sequence of two-way tables. This wasgenerated from tables eventtype*quakesize*depthsizeSAS ProgrammingOctober 2, 201462 / 93

Example: birthmonth distributionSuppose you wanted to do a chi-square using PROC FREQ on thefollowing data counting number of babies born by birthmonth and sex.SAS ProgrammingOctober 2, 201463 / 93

Example: birthmonth distributionHow can you enter the data to be read in by PROC FREQ? Twopossibilities are to have one row for each birth, like thissex monthF JuneM MayM AprilF AprilF June.SAS ProgrammingOctober 2, 201464 / 93

This will have 88273 rows, the number of observations. A second way is tohave weights, or counts, for each combination of categorical variable:sex month countF January 3537F February 3407.F December 3371M January 3743.M December 3761Both approaches are legitimate, and which one is more convenient mightdepend on how you initially received the data.SAS ProgrammingOctober 2, 201465 / 93

Using the WEIGHT statement in PROC FREQSAS ProgrammingOctober 2, 201466 / 93

Using the WEIGHT statement in PROC FREQSAS ProgrammingOctober 2, 201467 / 93

Using the WEIGHT statement in PROC FREQSAS ProgrammingOctober 2, 201468 / 93

Using the WEIGHT statement in PROC FREQORDER DATA preserves the order of the values encountered in the datainstead of alphabetizing.SAS ProgrammingOctober 2, 201469 / 93

χ2 versus Likelihood-ratio χ2The χ2 statistic is computed using the usual formulaX (Oi Ei )2Eiiwhere the sum is over all cells in the table. The expected count for the acell is its row total times column total divided by overall sample size.The Likelihood-ratio χ2 is also called G 2 , and it also has an asymptoticallyχ2 distribution. It’s formula isXOi log(Oi /Ei )iusing natural logs. The values are often very similar for the two teststatistics.SAS ProgrammingOctober 2, 201470 / 93

Other tests in PROC FREQIn addition to χ2 and G 2 tests, PROC FREQ has statements for Fisher’sexact tests, odds ratios, and Cochran-Armitage test for trend (for ordinaldata).The test of trend test is useful for 2xK contingency tables, where theassociation between the two-valued variable and the K valued variable isthought to be changing over time. In the case of the birthdata, this mightmean that the ratio of male-to-female births (which should be constantbut not necessarily 50-50 if there is no association) could be changinglinearly throughout the year. Perhaps the descrepancy is highest in thespring and smallest in the fall.SAS ProgrammingOctober 2, 201471 / 93

PROC FREQ: test of trendsSAS ProgrammingOctober 2, 201472 / 93

PROC FREQ: test of trendsNote that the statistic is sensitive to the order of the categories (as itshould be), and we get a different result looking for a trend from Jan-Deccompared to Apr-Mar.SAS ProgrammingOctober 2, 201473 / 93

ORDER option for PROC FREQIn addition to ORDER you can use ORDER FREQ so that the table ispresented in decreasing order of frequency. We did this in Week 2 with theRomeo and Juliet data. You can also use ORDER FORMAT to ordervalues by their formatted labels rather than names in the original data(this wouldn’t make a difference for our example).SAS ProgrammingOctober 2, 201474 / 93

Alternative to the WEIGHT statement in PROC FREQSAS ProgrammingOctober 2, 201475 / 93

Alternative to the WEIGHT statement in PROC FREQSAS ProgrammingOctober 2, 201476 / 93

Alternative to the WEIGHT statement in PROC FREQSAS ProgrammingOctober 2, 201477 / 93

Skinny versus wide dataOften there’s a choice of how to organize your data: a skinny format withlots of observations and fewer variables versus a wide format with morevariables and fewer observations. Using the weight column makes the datamore compact, but sometimes data is more easily organized in the skinnyrepresentation. Different SAS procedures might prefer one versus the otherrepresentation for input, so sometimes you have to convert from one tothe other.SAS ProgrammingOctober 2, 201478 / 93

Another place this comes up is with repeated measures data. Suppose youhave patients who are measured at 3 time points. Your data could looklike this:patientid time10001147 1450002135 1250003162 155time2 time3142125156versuspatientid time bp0001 1 1470001 2 1450001 3 1420002 1 1350002 2 1250002 3 1250003 1 1620003 2 1550003 3 156SAS ProgrammingOctober 2, 201479 / 93

Controlling category labels with PROC FORMATInstead of having ”f” and ”m” appear in out tables (which is how I codedthe data), I might want to have ”male” and ”female”. Similarly, I mightwant ”January” to appear instead of ”jan”. This could be achieved in adata step usingif month "jan" then month2 "January"This creates extra variables, and if your data set has 88000 observations,uses a lot of extra memory or makes your program slower. Also, you migthdecide that sometimes you want to display ”jan” and sometimes ”January”and sometimes just ”J” for space reasons (like cramming those words inthe x-axis of a time series). This can be handled using PROC FORMAT.SAS ProgrammingOctober 2, 201480 / 93

PROC FORMATSAS ProgrammingOctober 2, 201481 / 93

PROC FORMATSAS ProgrammingOctober 2, 201482 / 93

PROC FORMAT: grouping variablesAnother common use for PROC FORMAT is to group variables. Supposewe want to classify births as Winter, Spring, Summer, and Fall. This couldbe done by creating a variable season in the data step and using IFstatements. Another approach is to use a format to group variablesSAS ProgrammingOctober 2, 201483 / 93

PROC FORMATSAS ProgrammingOctober 2, 201484 / 93

PROC FORMAT: numeric variables with rangesYou can use PROC FORMAT to group data into ranges instead of definingthem in a data step. Suppose we want to define cities to be either small( 500,000), medium (500,000–1,000,000), or large (over 1,000,000).SAS ProgrammingOctober 2, 201485 / 93

PROC FORMAT: numeric variables with rangesSAS ProgrammingOctober 2, 201486 / 93

PROC FORMAT: numeric variables with rangesSAS ProgrammingOctober 2, 201487 / 93

PROC FORMAT: numeric variables with rangesThe syntax for the previous example was appropriate for integer values.For floating point values with decimals, we need to use inequalities. Thesyntax is a little weird. This creates intervals that are half-open, andclosed on the left, such as [55, 60). To make half-open intervals closed onthe left, use - instead.proc formatvalue age low - 50 "less than 50"50 - 55 "50 to less than 55"55 - 60 "55 to less than 60"65 - 70 "60 to less than 65"70 - 80 "70 to less than 80"80 - high "80 and over"other "missing";run;SAS ProgrammingOctober 2, 201488 / 93

Some Uses of FormatsThe main use of formats is perhaps to make the output prettier, but hereare some more statistically valuable uses:1. Formats can be a way of collapsing categories in contingency tables(if cell counts are too low for χ2 tests)2. You can use them to deal with sloppy/inconsistent coding of thedata. For example, if survey data has a mix of responses such as “Y”,“y”, “Yes”, then you can format them to all be the same value. Or ifsex is coded as “M”, “m”, and ”man”, ”male”, etc. Another way todeal with this might be to read in just the first character and use theUPCASE function. However, if states are sometimes coded as “NM”and sometimes “New Mexico”, the FORMAT approach might behandy.SAS ProgrammingOctober 2, 201489 / 93

PROC FORMAT and graphicsFormats are also a good way to make your plots more readable. You canalso use a LABEL statement to change how a variable name appears in aplot.SAS ProgrammingOctober 2, 201490 / 93

PROC FORMAT and graphicsSAS ProgrammingOctober 2, 201491 / 93

PROC FORMAT and graphicsSAS ProgrammingOctober 2, 201492 / 93

More on LABEL statements and FormatsLabel statements can be done in individual procedures, in a formatstatement or in a datastep, depending on what is most convenient.More advanced uses of formats are to make permanent formats in auser-defined library. Instead of having a PROC FORMAT that you userepeatedly (such as converting states to their two-letter abbreviations),you can have SAS search your format library usingoptions fmtsearch (myfmts);so that you can reuse the same format for multiple programs. To me, thismakes your program less portable and less self-contained (it might make ithard to switch between different computers), but if you are reliably at thesame computer all the time, this might save a lot of time programmingand make your code shorter.SAS ProgrammingOctober 2, 201493 / 93

Week 7: More graphics and SAS/BASE procedures SAS Programming October 2, 2014 1 / 93. Paneled graphics: crime data You can make paneled graphics using SGPANEL. This allows many . SAS Programming October 2, 2014 38 / 93. PROC MEANS: getting the average into a new column Another solution is to compute the mean in proc means, and then read