Public Health & Intelligence - Information Services Division

Transcription

Public Health &IntelligenceSPSS syntax to RDocument ControlVersionDate IssuedVersion 1.126th June 2019James McMahon, Chris Deansand Gavin ClarkNSS.rusergroupinbox@nhs.netAuthorsComments toVersionDateCommentAuthor(s)1.021 June 2019Initial versionJames McMahon, Chris Deansand Gavin Clark1.126 June 2019Comments from SAGJames McMahon0

ContentsContents . 1Scope . 2Prior Knowledge . 2Similarities and Differences . 3RStudio . 3Some differences . 3Syntax Examples comparison. 4Reading files . 4Saving data to file. 5Reading SMRA data . 5Data manipulation . 6Variables. 6Conditionals . 7Filtering . 7Sorting . 7Frequencies/Crosstabs . 7Aggregation . 8Combining Datasets . 8Transforming Data . 8Appendices . 9A1 – R Resources . 9A2 – R Packages used . 10A3 – Example SPSS syntax and equivalent R script . 111

ScopeThe aim of this paper is to highlight the similarities between SPSS syntax and R. In it we give examplesolutions to common tasks in SPSS, along with their equivalents in R.This paper is not meant to teach R syntax or best practice. It does not replace an actualprogramming course in R. Also, in most cases it will not be possible to fully translate a piece of SPSSsyntax into R using this guide, and the resulting R code may not be ‘best practice’ or the mostefficient way of completing the task. This paper should be used as a starting / reference point only.For further resources see the appendices.Prior KnowledgeThis paper assumes knowledge of SPSS syntax and experience conducting simple analyses usingSPSS. It would also be useful to have some knowledge of basic R syntax for example by completingan introductory online course. The R examples given, where possible, use ‘tidyverse’ functions,therefore it would also be useful to have some knowledge of the tidyverse.2

Similarities and DifferencesRStudioTechnically “R” refers to the name of the programming language, whereas the analysts in PHI willuse R within the RStudio environment. In reality, “R” and “RStudio” tend to be used interchangeably.RStudio provides lots of useful functionality to make using the R programming language much easierand more efficient. It is similar in many ways to development in SPSS with windows for code,outputs, and the ability to view your data. Where SPSS has menus to perform certain commands thisis generally not possible with R / RStudio, with the majority of task being done purely with code.Some differencesCase sensitiveIn R all syntax is case sensitive; this includes commands, functions, variable names etc. RStudio canbe a massive help here as it will offer to auto-complete function and variable names (by pressing theTAB key).Packages and functionsOne of the main differences between R and SPSS is that R has the majority of its functions availablein “packages”. This contrasts with SPSS where all functions are available for use and can bereferenced in the script. With R you have access to a universe of community written functions,provided in packages which need to be downloaded and installed before use. This means that formost tasks you may need to accomplish, there is probably code or a package available to make thejob much easier.R also makes it easier to reuse code. Code which you write to do one job can be easily turned into afunction and then reused to complete the same or similar job in the future. While this can be similarto macros in SPSS, the option to create functions and packages in R is considerably more flexible andpowerful.For most of the examples below, tidyverse is used; this is a collection of R packages designed fordata analysis. It can be installed with install.packages(tidyverse), and packages can be loaded for usecollectively using library(tidyverse), or individually using, for example, library(dplyr).Working with multiple datasetsIn R, many objects (datasets, lookups, filepaths etc.) can be held in the analysis environment at thesame time and accessed straightforwardly using their name. While still possible, in SPSS this isconsiderably clumsier, as most commands can only be performed on whichever single dataset hasbeen designated as the active one. This frequently leads to SPSS syntax saving and loadingtemporary files instead. In R, analysis can be more easily ‘self contained’ and therefore easier toreproduce.Descriptive termsSome functions have similar names to those in SPSS but may not perform the same task. Forexample select() from dplyr will select columns (i.e. the equivalent of dropping or keeping variables)whereas select in SPSS will remove cases, in R the function to do this (also from dplyr) is calledfilter().3

Syntax Examples comparisonIn the following syntax examples we have used highlighting to show the correspondence betweenthe SPSS and R code. Note that this syntax highlighting is not standard to R or SPSS.Most of the functions in the following R examples us functions from external packages, there is atable of functions used and their respective packages in the appendices.Reading filesCSV filesSPSS syntaxR codeGET DATA /TYPE TXT/FILE "/Path/To/Data.csv"/DELCASE LINE/DELIMITERS ","/ARRANGEMENT DELIMITED/FIRSTCASE 2/VARIABLES Name A28Age F3.0DOB edate11.CACHE.EXECUTE.datafile read csv("/Path/To/Data.csv")# This uses the names in the first rowand guesses the types based on thedata.datafile read csv("/Path/To/Data.csv",col names c("Name", "Age", "DOB"),col types "ciD",skip 1)# The variable types are specifiedusing a compact representation: “c” forcharacter (string), “i” for integer,“D” for date. See the help for read csvfor a full list of available types.Excel filesSPSS syntaxR codeGET DATA/TYPE XLSX/FILE "/Path/To/Data.xlsx"/SHEET name "GP Contact Details"/CELLRANGE RANGE "A6:P199"/READNAMES ON.datafile read excel("/Path/To/Data.xlsx",sheet "GP Contact Details",range "A6:P199")SPSS SAV filesSPSS syntaxR codeGET FILE "/Path/To/Data.sav"datafile read sav("/Path/To/Data.sav")4

Saving data to fileCSV filesSPSS syntaxR codeSAVE TRANSLATE/OUTFILE "/Path/To/File.csv"/TYPE CSV/REPLACE/FIELDNAMES.write csv(data,"/Path/To/File.csv",Na "")#This is for a dataset named data.# By default missing values will bereplaced by “NA” in the new file; hereit is set to instead use a blank.RDS filesSPSS syntaxN/AR code# There is a special file format forstoring R data objects, RDS. Thecommands for this are:write rds(data, "/Path/To/File.rds")data - read rds("/Path/To/File.rds")Reading SMRA dataSPSS syntaxR codeINSERT FILE pass.sps.SMRA connect - dbConnect(odbc(),dsn "SMRA",uid rstudioapi::showPrompt(title "Username",message "Username:"),pwd rstudioapi::askForPassword("SMRA Password:")GET DATA/TYPE ODBC/CONNECT !connect/SQL "SELECT LOCATION,ADMISSION DATE, DISCHARGE DATE,HBTREAT CURRENTDATE, LINK NO,AGE IN YEARS, ADMISSION TYPE,CIS MARKER, ADMISSION, DISCHARGE, URI ""FROM ANALYSIS.SMR01 PI ""WHERE ADMISSION DATE '2006-41'".CACHE.EXECUTE.Query SMR01 "SELECT LOCATION, ADMISSION DATE,DISCHARGE DATE, HBTREAT CURRENTDATE,LINK NO, AGE IN YEARS,ADMISSION TYPE, CIS MARKER,ADMISSION, DISCHARGE, URIFROM ANALYSIS.SMR01 PIWHERE ADMISSION DATE '01-APR-2006"data - dbGetQuery(SMRA connect,Query SMR01) % % as tibble()dbDisconnect(SMRA connect)rm(SMRA connect)5

Data manipulationCommentsSPSS syntaxR code* Comment in SPSS.# Comment in Rcompute x 0. /* Comment after command.x - 0 # Comment after commandVariablesCreate new variable/assign value to existing variableSPSS syntaxR codecompute Var1 0.compute Var2 2 * (Var1 5).data - mutate(data, Var1 0,Var2 2 * (Var1 5))Rename variablesSPSS syntaxR coderename variables OldVar NewVar.data - rename(data, NewVar OldVar)Delete/Drop/Keep variablesSPSS syntaxR codedelete variables Var1 Var2.data - select(data, -Var1, -Var2)* As part of GET or SAVE etc./DROP Var1 Var2.data - select(data, -Var1, -Var2)* As part of GET or SAVE etc./KEEP Var3 Var4.data - select(data, Var3, Var4)Changing Variable TypeSPSS syntaxR code* Number to string.alter type Year (a4).data - mutate(data, Year as.character(Year))* String to numberalter type Var1 (f2.0).data - mutate(data, Var1 as.numeric(Var1))* Datescompute DOB date.dmy(day, month,year).data - mutate(data,DOB dmy(paste0(day,"-", year))RoundingSPSS syntaxR code* Round a numbercompute Var2 rnd(Var1)"-", month,# Use round half up() from janitor, notthe default round()data - mutate(data, Var2 round half up(Var1))6

ConditionalsSPSS syntaxR codecompute Var1 0.if Var2 "Yes" Var1 1.data - mutate(data, Var1 if else(Var2 "Yes", 1, 0)if (Var1 1 and Year 2018)or Var1 1 Var2 "Yes"data - mutate(data, Var2 if else((Var1 1 & Year 2018) Var1 ! 1, "Yes", Var2)data - mutate(data,age grp case when(age 20 1,between(age, 21, 50) 2,age 51 3))recode age(lo thru 20 1)(21 thru 50 2)(51 thru hi 3)into age grp.FilteringSPSS syntaxselect ifR codeVar1 1.data - filter(data, Var1 1)# Note that; depending on the packageswhich you have loaded, filter() canbecome masked and refer to a differentfunction. In this case try thisinstead:data - dplyr::filter(data, Var1 1)SortingSPSS syntaxR codesort cases by Var1.data - arrange(data, Var1)sort cases by Var1 (d) Var2 (a).data - arrange(data, desc(Var1), Var2)Frequencies/CrosstabsSPSS syntaxR codefrequencies Var1.count(data, Var1)crosstabs/tables Var1 by Var2count(data, Var1, Var2) % %spread(Var2, n)7

AggregationSPSS syntaxR codeaggregate outfile */break Year Category/count n/los sum(los).data - data % %group by(Year, Category) % %summarise(count n(),los sum(los)) % %ungroup()aggregatemode addvariables/break Category/highest value max(value).data - data % %group by(Category) % %mutate(highest value max(value))% %ungroup()data - data % %group by(Year, Category) % %summarise(patients n distinct(PatientID)) % %ungroup()* Number of patientsaggregate outfile */break Year Category PatientID/dup n.aggregate outfile */break Year Category/patients n.Combining DatasetsSPSS syntaxR codeadd files /file "/Path/To/File A.sav",/file "/Path/To/File B.sav").data A read csv("/Path/To/File A.csv")data B read csv("/Path/To/File B.csv")data - bind rows(data A, data B)match files /file "/Path/To/File.sav",/table "/Path/To/Lookup.sav",/by Year Category.data - read csv("/Path/To/File.sav")lookup read csv("/Path/To/Lookup.sav")data - left join(data, lookup,by c("Year", "Category"))Transforming DataSPSS syntaxR codevarstocases /make Value from Var1 Var2Var3/index Key.data - gather(data, "Key", "Value",Var1, Var2, Var3)casestovars /id GroupVar1/index Key.* Other variables in the dataset: Valuedata - spread(data, Key, Value)8

AppendicesA1 – R Resources R usergroup - nss.Rusergp@nhs.netSlack channelso http://datasciencescotland.slack.como http://health-ds.slack.comTransforming Publications teamo GitHub organisation pageo nss.isdtransformingpublishing@nhs.netR-Resources (Health and Social Care Scotland) GitHub repositoryOnline R tutorial(s)Stack overflowRStudio cheat sheetsTemplateso R project structure or automatically using this packageo ggplot themeR style guide9

A2 – R Packages usedThis is a list of the packages required for the examples in this document; they are listed according toorder of appearance. You should load a package with library( package name ).R packageFunction(s)readrread csv()write csv()read excel()read sav()dbConnect()odbc()dbGetQuery()% %as tibble()mutate()rename()select()if else()between()case when()filter()arrange()count()group by()summarise()ungroup()n distinct()bind rows()left join()round half up()dmy()spread()gather()readxlhavenodbcmagrittr or dplyrdplyrjanitorlubridatetidyr10

A3 – Example SPSS syntax and equivalent R scriptSPSS syntaxR code SPSS to R.spsSPSS to R.R11

to macros in SPSS, the option to create functions and packages in R is considerably more flexible and powerful. For most of the examples below, tidyverse is used; this is a collection of R packages designed for data analysis. It can be installed with install.packages(tidyverse), and packages can be loaded for use