Getting Started In Data Analysis Using Stata

Transcription

Getting Started in Data Analysisusing Stata(v. 6.0)Oscar Torres-Reynaotorres@princeton.eduDecember 2007http://dss.princeton.edu/training/

Stata Tutorial Topics What is Stata?Stata screen and general descriptionFirst steps: Setting the working directory (pwd and cd .) Log file (log using ) Memory allocation (set mem ) Do-files (doedit) Opening/saving a Stata datafile Quick way of finding variables Subsetting (using conditional “if”) Stata color coding systemFrom SPSS/SAS to StataExample of a dataset in ExcelFrom Excel to Stata (copy-and-paste, *.csv)Describe and summarizeRenameVariable labelsAdding value labelsCreating new variables (generate)Creating new variables from other variables (generate)Recoding variables (recode)Recoding variables using egenChanging values (replace)Indexing (using n and N) Creating ids and ids by categories Lags and forward values Countdown and specific valuesSorting (ascending and descending order)Deleting variables (drop)Dropping cases (drop if)Extracting characters from regular expressions MergeAppendMerging fuzzy text (reclink)Frequently used Stata commandsExploring data: Frequencies (tab, table) Crosstabulations (with test for associations) Descriptive statistics (tabstat)Examples of frequencies and crosstabulationsThree way crosstabsThree way crosstabs (with average of a fourth variable)Creating dummiesGraphs Scatterplot Histograms Catplot (for categorical data) Bars (graphing mean values)Data preparation/descriptive statistics(open a differentfile): inear Regression (open a different 101.pdfPanel data (fixed/random effects) (open a differentfile): ilevel Analysis (open a different 101.pdfTime Series (open a different seful sites (links only) Is my model OK? I can’t read the output of my model!!! Topics in Statistics Recommended booksPU/DSS/OTR

What is Stata? It is a multi-purpose statistical package to help you explore, summarize and analyze datasets. It is widely used in social science research.A dataset is a collection of several pieces of information called variables (usually arranged by columns). A variable can have one or several values (information forone or several cases).FeaturesLearningcurveSPSSSASStataJMP (SAS)RPython(Pandas)GradualPretty steepGradualGradualPretty steepSteepProgramming/Point-andProgramming point-andclickclickUser interfacePoint-andclickDatamanipulationStrongVery strongStrongStrongVery strongStrongData analysisVery strongVery strongVery strongStrongVery strongStrongGoodGoodVery goodVery goodExcellentGoodExpensive(perpetual,cost only withnew ual,cost only withnew amming ProgrammingOpen source Open source(free)(free)Free studentStudent disc.Student disc. version, 2014 Student disc.Released196819721985198919952008PU/DSS/OTR

Stata’s previous screensStata 10 and olderStata 11

Stata 12/13 screenVariables in dataset hereOutput hereHistory ofcommands, thiswindow?Files will besaved hereWrite commands hereProperty of eachvariable herePU/DSS/OTR

First steps: Working directoryTo see your working directory, typepwd. pwdh:\statadataTo change the working directory to avoid typing the whole path whencalling or saving files, type:cd c:\mydata. cd c:\mydatac:\mydataUse quotes if the new directory has blank spaces, for examplecd “h:\stata and data”. cd "h:\stata and data"h:\stata and dataPU/DSS/OTR

First steps: log fileCreate a log file, sort of Stata’s built-in tape recorder and where you can:1) retrieve the output of your work and 2) keep a record of your work.In the command line type:log using mylog.logThis will create the file ‘mylog.log’ in your working directory. You canread it using any word processor (notepad, word, etc.).To close a log file type:log closeTo add more output to an existing log file add the option append, type:log using mylog.log, appendTo replace a log file add the option replace, type:log using mylog.log, replaceNote that the option replace will delete the contents of the previousversion of the log.PU/DSS/OTR

First steps: memory allocationStata 12 will automatically allocate the necessary memory to open a file. It is recommended touse Stata 64-bit for files bigger than 1 g.If you get the error message “no room to add more observations ”, (usually in olderStata versions, 11 or older) then you need to manually set the memory higher. You can type, forexampleset mem 700mOr something higher.If the problem is in variable allocation (default is 5,000 variables), you increase it by typing, forexample:set maxvar 10000To check the initial parameters typequery memory

First steps: do-fileDo-files are ASCII files that contain of Stata commands to run specific procedures. It is highly recommended to usedo-files to store your commands so do you not have to type them again should you need to re-do your work.You can use any word processor and save the file in ASCII format, or you can use Stata’s ‘do-file editor’ with theadvantage that you can run the commands from there. Either , in the command window type:doeditOr, click on the icon here:You can write the commands, to run them select the line(s), and click on the last icon in the do-file windowCheck the following site for more info on do-files: http://www.princeton.edu/ otorres/Stata/

First steps: Opening/saving Stata files (*.dta)To open files already in Stata with extension *.dta, run Stata and you can either: Go to file- open in the menu, or Type use “c:\mydata\mydatafile.dta”If your working directory is already set to c:\mydata, just typeuse mydatafileTo save a data file from Stata go to file – save as or just type:save, replaceIf the dataset is new or just imported from other format go to file – save as orjust type:save mydatafile /*Pick a name for your file*/For ASCII data please see U/DSS/OTRPU/DSS/OTR

First steps: Quick way of finding variables (lookfor)You can use the command lookfor to find variables in a dataset, for example youwant to see which variables refer to education, type:lookfor educ. lookfor educvariable nameeducstorage displaytypeformatbyte%10.0gvaluelabelvariable labelEducation of R.lookfor will look for the keyword ‘educ’ in the variable name and labels. Youwill need to be creative with your keyword searches to find the variables youneed.It always recommended to use the codebook that comes with the dataset tohave a better idea of where things are.PU/DSS/OTRPU/DSS/OTR

First steps: Subsetting using conditional ‘if’Sometimes you may want to get frequencies, crosstabs or run a model just for aparticular group (lets say just for females or people younger than certain age).You can do this by using the conditional ‘if’, for example:/*Frequencies of var1 when gender 1*/tab var1 if gender 1, column row/*Frequencies of var1 when gender 1 and age 33*/tab var1 if gender 1 & age 33, column row/*Frequencies of var1 when gender 1 and marital status single*/tab var1 if gender 1 & marital 2 marital 3 marital 4, column row/*You can do the same with crosstabs: tab var1 var2 *//*Regression when gender 1 and age 33*/regress y x1 x2 if gender 1 & age 33, robust/*Scatterplots when gender 1 and age 33*/scater var1 var2 if gender 1 & age 33“if” goes at the end of the command BUT before the comma that separatesthe options from the command.PU/DSS/OTRPU/DSS/OTR

First steps: Stata color-coded systemAn important step is to make sure variables are in their expected format.Stata has a color-coded system for each type. Black is for numbers, red is for text or stringand blue is for labeled variables.Var2 is a string variable even though yousee numbers. You can’t do any statisticalprocedure with this variable other thansimple frequenciesFor var1 a value 2 has thelabel “Fairly well”. It is still anumeric variableVar3 is a numeric You can do any statisticalprocedure with this variableVar4 is clearly a string variable.You can do frequencies andcrosstabulations with this butnot statistical procedures.PU/DSS/OTRPU/DSS/OTR

First steps: starting the log file using the menuLog files help you to keep a record of your work, and lets you extract output. When usingextension *.log any word processor can open the file.Click on “Save as type:” right below ‘File name:” andselect Log (*.log). This will create the file *.log whichcan be read by any word processor or by Stata (go to File– Log – View). If you save it as *.smcl (Formatted Log)only Stata can read it. It is recommended to save the logfile as *.log

From SPSS/SAS to StataIf you have a file in SAS XPORT format you can use fduse (or go to file-import).If your data is already in SPSS format (*.sav) or SAS(*.sas7bcat). Two options:Option A) Use Stat/Transfer, see .pdfOption B) You can use the command usespss to read SPSS files in Stata or the command usesasto read SAS files.For SPSS and SAS, you may need to install it by typingssc install usespssssc install usesasOnce installed just typeusespss using “c:\mydata.sav”usesas using “c:\mydata.sas7bcat”Type help usespss or help usesas for more details.For ASCII data please see U/DSS/OTRPU/DSS/OTR

Example of a dataset in Excel.Variables are arranged by columns and cases by rows. Each variable has more than one valuePath to the file: http://www.princeton.edu/ otorres/Stata/Students.xlsPU/DSS/OTR

From Excel to Stata using copy-and-pasteIn Excel, select and copy the data you want. Then, in Stata type edit in the command line to open the data editor.Point the cursor to the first cell, then right-click, select ‘Paste’.

Saving data as Stata fileChange the working directorySaving as Stata datafileData will besaved in thisfolderNOTE: You can also use the menu, go toFile - Save AsSaving as Stata datafile

Excel to Stata (using insheet) step 1Another way to bring excel data into Stata is by saving the Excel file as *.csv (commaseparated values) and import it in Stata using the insheet command.In Excel go to File- Save as and save the Excel file as *.csv:You may get the following messages, click OK andYES Go to the next page PU/DSS/OTR

Excel to Stata (insheet using *.csv, - step 2)From *.csv using the menuFrom *.xls(x) using the menuimport delimited "H:\students.csv", clearinsheet using "H:\students.csv", clearimport excel "H:\Students.xlsx", sheet(“Sheet1") firstrow clear

Command: describeTo get a general description of the dataset and the format for each variable typedescribe. describeContains data from 30vars:1429 Sep 2009 17:12size:2,580 (99.9% of memory free)storagevariable enderstudent statusmajorcountryagesataveragescoreg eheightinnewspaperread eintbytebytebytevaluelabelvariable labelIDLast NameFirst NameCityStateGenderStudent StatusMajorCountryAgeSATAverage score (grade)Height (in)Newspaper readershipType help describe for more information PU/DSS/OTR

Command: summarizeType summarize to get some basic descriptive statistics. summarizeVariableObsMeanStd. genderstudentsta smajorcountryage00003025.2sataveragesco eheightinnewspaperr 3230996757Zeros indicate string variablesType help summarize for more information Use ‘min’ and ‘max’ values to check for avalid range in each variable. For example,‘age’ should have the expected values(‘don’t know’ or ‘no answer’ are usuallycoded as 99 or 999)PU/DSS/OTR

Exploring data: frequenciesFrequency refers to the number of times a value is repeated. Frequencies are used to analyzecategorical data. The tables below are frequency tables, values are in ascending order. In Stata usethe command tab varname.variable. tab ’ provides a raw count of each value. In this case 10students for each major.‘Percent’ gives the relative frequency for each value. Forexample, 33.33% of the students in this group are econmajors.‘Cum.’ is the cumulative frequency in ascending order ofthe values. For example, 66.67% of the students areecon or math majors.variable. tab .6790.00100.00Total30100.00‘Freq.’ Here 6 students read the newspaper 3 days aweek, 9 students read it 5 days a week.‘Percent’. Those who read the newspaper 3 days a weekrepresent 20% of the sample, 30% of the students in thesample read the newspaper 5 days a week.‘Cum.’ 66.67% of the students read the newspaper 3 to 5days a week.Type help tab for more details.PU/DSS/OTR

Exploring data: frequencies and descriptive statistics (using table)Command table produces frequencies and descriptive statistics per category. For more info and a list ofall statistics type help table. Here are some examples, typetable gender, contents(freq mean age mean score). table gender, contents(freq mean age mean 523.227.278.7333382The mean age of females is 23 years, for males is 27. The mean score is 78 for females and 82 formales. Here is another example:table major, contents(freq mean age mean sat mean score mean readnews). table major, contents(freq meanage mean sat meanscore e)mean(read .279.885.14.45.34.9PU/DSS/OTR

Exploring data: crosstabsAlso known as contingency tables, crosstabs help you to analyze the relationship between two ormore categorical variables. Below is a crosstab between the variable ‘ecostatu’ and ‘gender’. We usethe command tab var1 var2The first value in a cell tells you the number ofobservations for each xtab. In this case, 90respondents are ‘male’ and said that theeconomy is doing ‘very well’, 59 are ‘female’and believe the economy is doing ‘very well’Options ‘column’, ‘row’ gives you thecolumn and row percentages.var1var2. tab ecostatu gender, column rowKeyfrequencyrow percentagecolumn percentageStatus ofNat'l EcoGender of RespondentMaleFemaleTotalVery well9060.4014.335939.607.92149100.0010.85Fairly well33750.3053.6633349.7044.70670100.0048.80Fairly badly13939.9422.1320960.0628.05348100.0025.35Very badly5729.849.0813470.1617.99191100.0013.91Not 26100.001,373100.00100.00The second value in a cell gives you rowpercentages for the first variable in the xtab.Out of those who think the economy is doing‘very well’, 60.40% are males and 39.60% arefemales.The third value in a cell gives you columnpercentages for the second variable in the xtab.Among males, 14.33% think the economy isdoing ‘very well’ while 7.92% of females havethe same opinion.NOTE: You can use tab1 for multiple frequencies or tab2 torun all possible crosstabs combinations. Type help tab forfurther details.PU/DSS/OTR

Exploring data: crosstabs (a closer look)You can use crosstabs to compare responses among categories in relation to aggregateresponses. In the table below we can see how opinions for males and females divergefrom the national average. tab ecostatu gender, column rowKeyfrequencyrow percentagecolumn percentageStatus ofNat'l EcoGender of RespondentMaleFemaleTotalVery well9060.4014.335939.607.92149100.0010.85Fairly well33750.3053.6633349.7044.70670100.0048.80Fairly badly13939.9422.1320960.0628.05348100.0025.35Very badly5729.849.0813470.1617.99191100.0013.91Not .000.0062845.74100.0074554.26100.00RefusedTotalAs a rule-of-thumb, a margin of error of 4 percentage points can beused to indicate a significant difference (some use 3).For example, rounding up the percentages, 11% (10.85) answer ‘verywell’ at the national level. With the margin of error, this gives a rangeroughly between 7% and 15%, anything beyond this range could beconsidered significantly different (remember this is just anapproximation). It does not appear to be a significant bias betweenmales and females for this answer.In the ‘fairly well’ category we have 49%, with range between 45%and 53%. The response for males is 54% and for females 45%. Wecould say here that males tend to be a bit more optimistic on theeconomy and females tend to be a bit less optimistic.If we aggregate responses, we could get a better picture. In the tablebelow 68% of males believe the economy is doing well (comparing to60% at the national level, while 46% of females thing the economy isbad (comparing to 39% aggregate). Males seem to be more optimisticthan females.RECODE ofecostatu(Status ofNat'l Eco)Gender of 19636.3631.2134363.6446.04539100.0039.26Not 5.74100.0074554.26100.001,373100.00100.00recode ecostatu (1 2 1 "Well") (3 4 2 "Bad") (5 6 3 "Not sure/ref"), gen(ecostatu1) label(eco)PU/DSS/OTR

Exploring data: crosstabs (test for associations)To see whether there is a relationship between two variables you can choose a number oftests. Some apply to nominal variables some others to ordinal. I am running all of themhere for presentation purposes.tab ecostatu1 gender, column row nokey chi2 lrchi2 V exact gamma taubLikelihood-ratio χ2(chi-square)X2(chi-square)Goodman & Kruskal’s γ (gamma)Cramer’s VKendall’s τb (tau-b). tab ecostatu1 gender, column row nokey chi2 lrchi2 V exact gamma taubEnumerating sample-spacestage 3: enumerations stage 2: enumerations stage 1: enumerations RECODE ofecostatu(Status ofNat'l Eco)combinations:1160Gender of RespondentMaleFemaleFisher’s exact .65Bad19636.3631.2134363.6446.04539100.0039.26Not .81620.15630.30950.1553Pr 0.000Pr 0.000Pearson chi2(2)likelihood-ratio chi2(2)Cramér's VgammaKendall's tau-bFisher's exact ASE 0.050ASE 0.0260.000– For nominal data use chi2, lrchi2, V– For ordinal data use gamma and taub– Use exact instead of chi2 whenfrequencies are less than 5 across thetable.X2(chi-square) tests for relationships between variables. The nullhypothesis (Ho) is that there is no relationship. To reject this we need aPr 0.05 (at 95% confidence). Here both chi2 are significant. Thereforewe conclude that there is some relationship between perceptions of theeconomy and gender. lrchi2 reads the same way.Cramer’s V is a measure of association between two nominal variables. Itgoes from 0 to 1 where 1 indicates strong association (for rXc tables). In2x2 tables, the range is -1 to 1. Here the V is 0.15, which shows a smallassociation.Gamma and taub are measures of association between two ordinalvariables (both have to be in the same direction, i.e. negative to positive,low to high). Both go from -1 to 1. Negative shows inverse relationship,closer to 1 a strong relationship. Gamma is recommended when thereare lots of ties in the data. Taub is recommended for square tables.Fisher’s exact test is used when there are very few cases in the cells(usually less than 5). It tests the relationship between two variables. Thenull is that variables are independent. Here we reject the null andconclude that there is some kind of relationship between variablesPU/DSS/OTR

Exploring data: descriptive statisticsFor continuous data use descriptive statistics. These statistics are a collection of measurements of:location and variability. Location tells you the central value the variable (the mean is the most commonmeasure of this) . Variability refers to the spread of the data from the center value (i.e. variance,standard deviation). Statistics is basically the study of what causes such variability. We use thecommand tabstat to get these stats.tabstat age sat score heightin readnews, s(mean median sd var count range min max). tabstatage sat score heightin readnews, s(mean median sd var count range min 54.86666751.2793681.63678230437Type help tabstat for acomplete list of descriptivestatistics The mean is the sum of the observations divided by the total number of observations. The median (p50 in the table above) is the number in the middle . To get the median you have to order the datafrom lowest to highest. If the number of cases is odd the median is the single value, for an even number of casesthe median is the average of the two numbers in the middle. The standard deviation is the squared root of the variance. Indicates how close the data is to the mean. Assuminga normal distribution, 68% of the values are within 1 sd from the mean, 95% within 2 sd and 99% within 3 sd The variance measures the dispersion of the data from the mean. It is the simple mean of the squared distancefrom the mean. Count (N in the table) refers to the number of observations per variable. Range is a measure of dispersion. It is the difference between the largest and smallest value, max – min. Min is the lowest value in the variable. Max is the largest value in the variable.PU/DSS/OTR

Exploring data: descriptive statisticsYou could also estimate descriptive statistics by subgroups (i.e. gender, age, etc.)tabstat age sat score heightin readnews, s(mean median sd var count range min max) by(gender). tabstatage sat score heightin readnews, s(mean median sd var count range min max) by(gender)Summary statistics: mean, p50, sd, variance, N, range, min, maxby categories of: gender 01659754.86666751.2793681.63678230437Type help tabstat for more options.PU/DSS/OTR

Examples of frequencies and crosstabulationsCrosstabulations (tab with two variables)Frequencies (tab command). tab gender studentstatus, column row. tab 0Total30100.00Cum.frequencyrow percentagecolumn percentage50.00100.00GenderIn this sample we have 15 females and 15 males. Each represents50% of the total cases.Student StatusGraduate l1550.00100.001550.00100.0030100.00100.00. tab gender major, sum(sat)Average SAT scores by gender andmajor. Notice, ‘sat’ variable is acontinuous variable. The first cellreads the average SAT score for afemale whose major is econ is1952.3333 with a standard deviation312.43, there are only 3 females witha major in econ.Means, Standard Deviations and Frequencies of SS/OTR

Three way crosstabs. bysortstudentstatus: tab gender major, column row- studentstatus GraduateKeybysort var3: tab var1 var2, colum rowbysort studentstatus: tab gendermajor, colum rowfrequencyrow percentagecolumn 0.00320.00100.00853.33100.0015100.00100.00Total- studentstatus UndergraduateKeyfrequencyrow percentagecolumn /DSS/OTR

Three way crosstabs with summary statistics of a fourth variable. bysortstudentstatus: tab gender major, sum(sat)- studentstatus GraduateMeans, Standard Deviations and Frequencies of SATGenderAverage SAT scores by gender andmajor for graduate andundergraduate students. The thirdcell reads: The average SAT scoreof a female graduate student whosemajor is politics is 2092.6667 with astandard deviation of 2.82.13, thereare 3 graduate female students witha major in 6981841.2300.3821915- studentstatus UndergraduateMeans, Standard Deviations and Frequencies of .554.44722221856.6257.7268215PU/DSS/OTR

Renaming variables and adding variable labelsBeforeRenaming variables, type:Afterrename [old name] [new 4var5idcountrypartyimportsexportsAdding/changing variable labels, type:BeforeAfterlabel variable [var name] evariablevariablevariableid "Unique identifier"country "Country name"party "Political party in power"imports "Imports as % of GDP"exports "Exports as % of GDP"PU/DSS/OTR

Assigning value labelsAdding labels to each category in a variable is a two step process in Stata.Step 1: You need to create the labels using label define, type:label define label1 1 “Agree” 2 “Disagree” 3 “Do not know”Setp 2: Assign that label to a variable with those categories using label values:label values var1 label1If another variable has the same corresponding categories you can use the samelabel, typelabel values var2 label1Verify by running frequencies for var1 and var2 (using tab)If you type labelbook it will list all the labels in the datafile.NOTE: Defining labels is not the same as creating variablesPU/DSS/OTR

Creating new variablesTo generate a new variable use the command generate (gen for short), typegenerate [newvar] [expression] results for the first five students generate score2 score/100generate readnews2 readnews*4You can use generate to create constant variables. For example: results for the first five students generate x 5generate y 4*15generate z y/xYou can also use generate with string variables. For example: results for the first five students generate fullname last “, “ firstlabel variable fullname “Student full name”browse id fullname last firstPU/DSS/OTR

Creating variables from a combination of other variablesTo generate a new variable as a conditional from other variables type:generate newvar (var1 1 & var2 1)generate newvar (var1 1 & var2 26)NOTE: & and, or. gen fem less25 (gender 1 & age 26). tab. gen fem grad (gender 1 & status 1). tab fem gradfem l30100.00fem less25fem otal30100.00. tabage gender. tab gender statusGenderStudent StatusGraduate S/OTR

Recoding variables1.- Recoding ‘age’ into three groups. tab .3376.6780.0090.0093.3396.67100.00Total30100.002.- Use recode command, typeType help recode for more detailsrecode age (18 19 1 “18 to 19”) ///(20/29 2 “20 to 29”) ///(30/39 3 “30 to 39”) (else .), generate(agegroups) label(agegroups)3.- The new variable is called ‘agegroups’:. tab agegroupsRECODE ofage (Age)Freq.PercentCum.18 to 1920 to 2930 to 0PU/DSS/OTR

Recoding variables using egenYou can recode variables using the command egen and options cut/group.egen newvariable cut (oldvariable), at (break1, break2, break3, etc.)Notice that the breaks show ranges. Below we type four breaks. The first starts at 18 and ends before 20, thesecond starts at 20 and ends before 30, the third starts at 30 and ends before 40. egen agegroups2 cut(age), at(18, 20, 30, 40). tab .0036.67Total30100.00Cum.33.3363.33100.00You could also use the option group, which specifies groups with equal frequency (you have to add valuelabels:egen newvariable cut (oldvariable), group(# of groups). egen agegroups3 cut(age), group(3). tab 36.67Total30100.00Cum.33.3363.33100.00For more details and options type help egenPU/DSS/OTR

Changing variable values (using replace)BeforeAfter. tab read, missing. tab 0036.6766.67100.00Total30100.

Stata Tutorial Topics What is Stata? Stata screen and general description First steps: Setting the working directory ( pwd and cd .) Log file (log using ) Memory allocation ( set mem ) Do-files (doedit) Openi