Exploratory Factor Analysis - Principal Components Analysis

Transcription

1Causal Modeling in Social ResearchExploratory Factor Analysis – Principal Components AnalysisTOPICS1) Exploratory factor analysis2) An overview of the SPSS factor analysis procedure3) Worked PCA examples:a) Checking the dimensionality of a well-being scaleb) More than one dimension: the ‘getting ahead’ scales4) Desirable job characteristics dimensions (optional quick example of exploratory factor analysiswith categorical indicators - Mplus)REQUIRED READINGS:Pohlmann, John T. (2004) ‘Use and Interpretation of Factor Analysis in "The Journal of EducationalResearch": 1992 - 2002’, The Journal of Educational Research 98(1): 14-22.OPTIONAL READINGS:Treiman, Donald J. (2009) Quantitative data analysis : doing social research to test ideas. SanFrancisco, CA: Jossey-Bass. Chapter 11 – Scale Construction.Field, Andy P. (2009) Discovering statistics using SPSS : (and sex and drugs and rock 'n' roll). LosAngeles ; London: Sage. Chapter 17 – Exploratory Factor Analysis.Factor analysis is a mathematical/statistical technique used for linking a set of observed variables to asmaller number of latent dimensions à it allows one to use several observed variables tooperationalize one concept/ dimensionThe resulting latent dimension is defined by what the observed variables have in commonExploratory Factor Analysis (EFA) – Characteristics:Does not employ a model that specifies the way in which observed variables are linked to latentvariables – The structure of relationships is inferred after the analysis is completed, using thesizes of the loadingsDoes not specify the number of latent variables, prior to the analysis – the number of factors isdetermined after the analysis is completed, using certain conventions (e.g.: eigenvalues over 1 or“the elbow rule”)In most types of EFA, both in the initial solution and in the solution after extraction, each andevery factor determines each and every observed variableError terms, if included in the model, cannot be correlated (Principal Components Analysis –PCA – one of the most often used EFA, does not include error terms in the model at all)In most cases, EFA is used to extract orthogonal (uncorrelated factors) factorsThe EFA model is under-identified – there is no unique solution, but an infinite number ofsolutions, each of which has an equally good fit to the data – from among these solutions, theanalyst chooses one solution, deemed to be more interpretable (this solution is called the “simplestructure solution”: it is the solution in which each variable has high loadings on a single factorbut very low loadings on the other factors; a solution that makes the factors more interpretable)Required measurement level for the observed variables in EFA: interval level ( ordinal levelvariables accepted under the assumption of equal distances between categories)

2Causal Modeling in Social ResearchPrincipal Components Analysis (PCA)Works with standardized variables è each variable has a variance of 1The initial “model” – all observed variables load on all components/factors1, and the number offactors is equal to the number of observed variables. PCA looks for a first factor, such that thisfactor is a linear combination of all the observed variables in the analysis and the factor explainsthe maximum possible variance out of the total variance (in the following model, the totalvariance 6). After the first factor is found, PCA looks for the second factor (also a linearcombination of all observed variables) that would explain the maximum possible variance out ofthe leftover variance (variance not explained by the first factor). The procedure is repeated untilthe last factor is found.The initial model:Var1 p11F1 p12F2 p16F6Var2 p21F1 p22F2 p26F6.Var6 p61F1 p62F2 p66F6Figure 1. PCA – Before ExtractionFactor extraction (rules of thumb):2 alternative rules of thumb:Eigenvalues over 1: One extracts/ keeps only the factors with eigenvalues ( the amount ofvariance explained by the factor out of the total variance) greater than 1The elbow rule: One retains the factors above “the elbow” in the scree plotFigure 2. PCA – After Extraction1Technically, PCA works with “components”, which are similar to factors. I use the two terms – „components” and „factors” –interchangeably.

3Causal Modeling in Social ResearchGraphic representation of the 2 extracted factors:The 2 axes represent the 2 factors extracted in the above example; the observed variables areplaced in the graph according to their relationships to the factors (their loadings on the factors,estimated using PCA)This solution, estimated using PCA is not the only possible solution. By rotating the axes, onemay get different solutions, each of which has an equally good fit to the data.Figure 3. PCA – Before RotationAxis/factor rotation – out of the infinite number of possible solutions, one looks for a moreinterpretable solution. In VARIMAX rotation, for example, one looks for the solution that makesthe factors more interpretable (each of the observed variables has a high loading on one factor andvery low loadings on the rest of the factors)Figure 4. PCA – Rotation

4Causal Modeling in Social ResearchThe factors after rotation:Figure 5. PCA – RotationFactor scores:Factor scores are predicted scores for the latent dimensions – they may be used as scales inregression analysesTo compute factor scores, one “reverses” the direction of causality from the factor analysismodel:𝐹" b11Var1 b12Var2 b16Var6𝐹# b21Var1 b22Var2 b26Var6AN OVERVIEW OF THE SPSS FACTOR ANALYSIS PROCEDUREAccessing the factor analysis procedure in SPSS: Analyze à Dimension Reduction à FactorDescriptives – A preliminary look at the data:It is useful to take a look first at the correlation matrix for the variables being analyzed (in theSPSS “Descriptives” options, ask for: correlation matrix coefficients and significance levels).Ideally, if you expect multiple factors to emerge from the data being analyzed, the observedindicators within a factor should be moderately/highly correlated, while the observed indicatorsfrom different factors should be more weakly correlated.In order to assess whether factor analysis is appropriate for your variables, you can examine twotests: KMO and Bartlett’s test of sphericity.

5Causal Modeling in Social ResearchKMO is a test that determines whether partial correlations among the observed variables arehigh enough. Conventionally, KMO values lower than 0.6 suggest that the partial correlationsare not high enough and the variables won’t ‘factor’ well.Bartlett’s test has the following null hypothesis (H0): the correlation matrix for the observedvariables is an identity matrix. In other words, the null hypothesis being tested is that all of theobserved variables being analyzed are uncorrelated with each other (a situation that is notdesirable when doing factor analysis). A statistically significant Bartlett’s test suggests that atleast one of the correlations between your observed variables is different from zero (the H0 ofthe test can be rejected).Note: In practice, if the purpose of the analysis is to extract factors that will be used in otheranalyses (e.g. regression analyses) as independent or dependent variables, it is also useful to takea look at the relationships between each of the variables used in the factor analysis and the othervariables that will be employed in the final model (Treiman, 2009: 247).Also, as a preliminary step to running a factor analysis, you should make sure that themeasurement level of the variables employed in the analysis is appropriate. Interval-levelmeasurement for the observed variables is required in most types of factor analyses (the SPSSfactor analysis procedure is technically designed for this type of variables only). In practice,ordinal level observed variables may be used as well, as long as one can make the assumption thatthese variables are ‘interval-like’ variables. If you are dealing with other types of indicators (forexample dichotomous variables), there is specialized software that can be used to run factoranalysis (e.g.: Mplus).Extraction methods – the method/mathematical model used in order to extract the factors. SPSSprovides several options here. The most commonly used (and the default in SPSS): PrincipalComponents Analysis (PCA). It is not a statistical model per se, but a mathematical model forextracting the factors, and it is the simplest and most intuitive model. It extracts uncorrelated(orthogonal factors), although correlated factors can be obtained through a method of oblique rotation(see below), and it disregards the error part/ unique variance/ noise in the variables being included inthe analysis.Cut-off criteria for extracting factors:Eigenvalues over one criterion (SPSS default, found under the ‘Extract’ heading in the‘Extraction’ window)Fixed number of factors (the user specifies the number of factors to be extracted, found underthe ‘Extract’ heading in the ‘Extraction’ window)The ‘elbow’ rule (applied by inspecting the scree plot, found under the ‘Display’ heading inthe ‘Extraction’ window)Rotation methods – methods used in order to choose one solution from the infinite number ofsolutions from an exploratory factor analysis. There are several options here as well:Methods that result in orthogonal (uncorrelated) factors:Varimax – makes the factors more interpretableQuartimax – makes the variables more interpretableEquamax – a hybrid between the two previous methodsMethods that result in oblique (correlated) factors:Direct Oblimin – you can control the size of the correlation between factors by manipulationthe ‘delta’ parameter (the default is 0, and results in a solution in which the factors areprevented from being highly correlated; you can increase the value of the delta parameter upto 0.8, resulting in the most oblique solution or you can decrease the value of the deltaparameter down to -0.8, resulting in the least oblique solution)Promax – a similar procedure as the one above, designed for larger datasetsConstructing and saving factor scores (options found in the ‘Scores’ window).To have SPSS construct a factor score based on your factor analysis results, check the option‘Save as variables’ in the ‘Scores’ window. If you want to see the factor score coefficients used inthe computation of factor scores, check the option ‘Display factor score coefficient matrix’ in the‘Scores’ window. Available methods for saving factor scores/ computing factor score coefficients:

6Causal Modeling in Social ResearchThe regression method (SPSS default): the simplest method, in which the factor loadingsresulting from the analysis are adjusted to take into account the initial correlations between theobserved variables. The disadvantage of the method: may result in correlated factor scores orfactor scores correlated to the other true factors, even if the original factor analysis extractedorthogonal factors. This is the best option when you are not particularly interested inconstructing independent factor scores (Field, 2009: 635).The Bartlett method constructs factor scores that correlate only with their own factor, butfactor scores can still correlate with each other.The Anderson-Rubin method produces uncorrelated factor scores. This is the best option whenyou need to construct independent factor scores (Field, 2009: 635).Other options – Finally, you can control some cosmetic but also some substantive aspects of yourfactor analysis in the ‘Options’ window.The ‘cosmetic’ options: the rotated solution table is more easily interpretable if the coefficients inthat table are sorted by size (check the option ‘Sorted by size’ under the ‘Coefficient displayformat’ heading); you may also use the option to ‘Suppress small coefficients’ which will deletecoefficients smaller than a user-specified value (use this option with caution, especially if youplan to input a different cut-off value, and you are not exactly sure what this option does).The ‘substantive’ option: how to handle missing data in your analysis. SPSS provides threechoices: listwise deletion, pairwise deletion, and replace with mean. Depending on the amount ofmissing data and patterns of missing data in your variables, each choice may result in dramaticallydifferent factor analysis results, and what is worse, all three may result in biased factor analysisresults. If the above statement seems very vague, that is because it is intended to be like that.There is a lengthy discussion about missing data and ways to handle it in statistical analyses, andwe’ll have that discussion in a couple of lectures from now. SPSS’ default is listwise deletion –that’s fine, as long as you don’t lose a high percentage of cases from your sample by employing it(you can check the effective remaining sample size by asking for ‘Univariate descriptives’ in theFactor Analysis ‘Descriptives’ window).

7Causal Modeling in Social ResearchEXAMPLE 1: CHECKING THE DIMENSIONALITY OF A WELL-BEING SCALEData for this analysis: Family Life [Viaţa de Familie] (2008) Data file, Survey designed andexecuted by Soros Foundation Romania. If you are interested, you can find out more details about thissurvey here: http://www.fundatia.ro/en/family-life-2008. This is a nationally representative surveyconducted in 2008 in Romania (N 1,400). The example will employ an 80% random sample from thenationally representative sample. You can find the data file for the example on the class website(filename: FL2008 random80 wellbeing.sav)Variables: Among the many variables designed to capture aspects of family life, the surveyemployed a series of indicators of personal well-being. The wording of the questions is given below.Note: If you are interested in this dimension and you want to analyze data for other countries, ESS(European Social Survey) employed this scale in Round 2 (2004). You can find out more detailsabout ESS and links to data downloads here: http://www.europeansocialsurvey.org/Firstly, I am going to read out a list of statements about how you may have been feeling recently. For each statement,using this card, I would like you to say how often you have felt like this over the last two weeks. Please use this card.G1. I have felt cheerful and in good spiritsG2. I have felt calm and relaxedG3. I have felt active and vigorousG4. I have woken up feeling fresh and restedG5. My daily life has been filled with things that interestmeAll ofthe time66666Very oftenOftenSeldomVery seldomNever5555544444333332222211111DK NA8888899999Theoretically, all indicators should load on a single underlying/latent dimension measuring personalwell-being. In order to check this assumption, one can run an exploratory factor analysis.As preliminary steps to the factor analysis, you should run some frequencies to see if any recodes arenecessary (i.e., in this case, assigning missing values to the DK/NA responses). For this example, wewill recode variables G1-G5 into the new variables: CHEERFUL, CALM, ACTIVE, RESTED,INTERESTINGNote: The original database contains a weight variable (weight vs), designed to adjust the sample compositionaccording to the age by sex distribution in the population. This variable is provided in theFL2008 random80 wellbeing dataset and you might want to weight the data (Data – Weight Cases – WeightCases by – weight vs) before proceeding to run the factor analysis in order to insure the generalizability of resultsto the entire Romanian population.Factor analysis options for this analysis:

8Causal Modeling in Social ResearchOutput interpretation for this analysis:Listwise deletion, employed here as a method of dealing with incomplete data, has resulted in a weighted samplesize for the analysis of 1,116 cases (out of the total weighted number of cases: 1,137). The descriptive statisticstable below also shows means and standard deviations for the unstandardized version of the variables in theanalysis. Since we are using the correlation matrix as an input for the analysis, the actual factor analysis resultswill pertain to the standardized version of these iptive StatisticsMeanStd. 60841.179983.39391.23631Analysis N11161116111611161116The correlation matrix shows statistically significant, moderate to high correlations among the observed indicatorsused in the analysis.CorrelationSig. FULCALMACTIVERESTEDINTERESTINGCorrelation 51.000.000.000.000.000.000The KMO test and Bartlett’s test both suggest that the variables ‘hang together’ well (partial correlations amongvariables are high enough for a factor analysis, as suggested by the KMO test, and at least one bivariatecorrelation among variables is non-zero, as suggested by Bartlett’s test).KMO and Bartlett's TestKaiser-Meyer-Olkin Measure of Sampling Adequacy.Bartlett's Test of SphericityApprox. Chi-SquaredfSig.8582745.91510.000The communalities table shows the proportion of variance in each observed variable explained by all of thecomponents/factors. In the initial solution, where the number of factors is equal to the number of observedvariables, the 5 factors explain all of the variance in each of the observed variables. After extraction, the smallernumber of retained factors explains lower proportions of variance in each of the observed INTERESTING1.000.581Extraction Method: Principal Component Analysis.The eigenvalues show the amount of variance (out of the total variance) explained by each of the factors. Since theobserved variables are standardized for this analysis (by default, the input data is the correlation matrix), each ofthe observed variables in the analysis will have a mean of 0 and a variance of 1. As a result, the total variance (inall observed variables) will equal the number of observed variables in the analysis (5, in this case). The left half ofthe table below presents information about the eigenvalues for all of the factors in the initial solution, while theright half presents the same information only for the factor(s) that was/were retained after extraction. In this case,we can see that a single factor was retained after extraction (using the default cutoff criterion for extraction:eigenvalues over 1), and that factor explains approximately 67% of the total variance in our 5 observed variables.The remaining factors from the initial solution all have eigenvalues smaller than 1, and are subsequently‘discarded’ from the final factor analysis solution.

9Causal Modeling in Social ResearchTotal Variance ExplainedInitial EigenvaluesExtraction Sums of Squared LoadingsComponentTotal% of VarianceCumulative %Total% of VarianceCumulative 00Extraction Method: Principal Component Analysis.The scree plot (which can be used as an alternative criterion for determining the number of factors to retain afterextraction), in this particular case, also suggests that a one-factor solution is appropriate (the ‘elbow’ occurs atthe second component, and we may retain the one component that is placed above the elbow). The two criteria fordetermining the number of factors to be extracted (the ‘eigenvalues over 1’ and the ‘elbow rule’) will not alwayssuggest the same number of factors to extract. In that case, you should decide theoretically and based on theresults of a preliminary factor analysis how many factors to extract. You can choose the number of factors to beextracted by using the ‘Fixed number of factors’ option in the ‘Extraction’ window.In terms of our particular interest in running this factor analysis, at this point, we already have an answer to ourquestion: the five observed indicators all load on a single dimension/ measure a single underlying, latent concept.We could label this latent concept as ‘personal well-being’, based on the particular things that the observedindicators measure (auto-evaluations of states of mind associated with a feeling of well-being).One last thing that we have to do to ensure that all of the observed variables are highly associated with theunderlying factor is to check the loadings of these variables on the factor. In the case of a factor analysis that hasextracted only one factor, these loadings are found in the ‘Component Matrix’ table. In this example, all loadingsare high (over .7).There are various criteria being used to determine what constitutes a high or moderate loadingin current research, and rules of thumb are usually different depending on the research topic.Component ALM.813INTERESTING.762Extraction Method: Principal ComponentAnalysis.a. 1 components extracted.Finally, if you are interested in using the well-being dimension in further analyses, as a summary measure insteadof using each and all of the observed indicators, factor analysis provides you with the option of saving predictedfactor scores for each individual in the analysis. The factor score coefficients used in the computation of these

10Causal Modeling in Social Researchfactor scores are displayed in the ‘Component Score Coefficient Matrix’ table. The resulting factor score variable,saved by SPSS, is a standardized variable, with a mean of 0 and a standard deviation of 1.Component Score Coefficient ED.245INTERESTING.227Extraction Method: Principal ComponentAnalysis.Component Scores.Note: If you would like to try some further analyses on your own, using the constructed factor score, the databaseprovided with the example includes a couple of other variables (age, sex, marital status, income, and education).MORE THAN ONE DIMENSION: THE ‘GETTING AHEAD’ SCALESData: Class Structure and Social Stratification in Present Day Romania, 2010 (CNMP, grant:131/20.11.2008). This is a nationally representative survey conducted in 2010 in Romania (N 4,508).The example will employ a 50% random sample from the nationally representative sample. Thedatabase used in this example is available on the class website (filename:STRAT2010 random50 ahead.sav)Variables: Opinions about prevailing successful strategies for getting ahead in life in Romania. Thewording of the questions is given below. Theoretically, the observed indicators should capture 4dimensions: strategies based on ascription, merit, discrimination, and corruption.Note: if you are interested in these variables in an international context, they are also employed in the ISSP(International Social Survey Programme), Social Inequality Modules (1992, 1999, 2009). For more details aboutISSP and links to data downloads, you can consult the survey webpage: http://www.issp.org/ .V16. How important are each of the following for a person in present-day Romania to get ahead in life?FairlyVery importantNeither/norimportanta. coming from a wealthy family543b. having well-educated parents543c. having a good education yourself543d. having ambition543e. hard work543f. knowing the right people543g. having political connections543h. giving bribes543i. belief in God?543j. a person’s ethnicity543k. a person’s religion543l. a person’s gender543FairlyNot -9-9-9-9-9-9-9-9-9-9Preliminary steps: variables recodes (missing labels for NA responses) result in the following newvariables: wealthfam, educpar, youreduc, ambition, hardwrk, rightppl, polconnect, bribes,gdbelief, pethnic, prelig, pgender. Preliminary analyses suggest that gdbelief does not behave asexpected (loads on a different dimension than expected, so we’ll leave it out of the analyses).Factor analysis options: same as before, with a couple of minor differences. Since we know thattheoretically, 4 dimensions are expected to emerge, we can require that 4 factors be extracted. Also,since we are dealing with a multiple factor solution, we can choose a rotation method (we’ll try anorthogonal rotation method – Varimax, and an oblique rotation method – Oblimin).Differences from a one-factor solution:A new panel added to the ‘Total Variance Explained’ table, containing results for the rotatedsolution.In orthogonal rotation variants, factor loadings are presented in the unrotated solution (found inthe ‘Component Matrix’ table) and the rotated solution (‘Rotated Component Matrix’). Since therotated solution provides more interpretable factors, usually one interprets the loadings in theRotated Component Matrix Table and disregards the loadings in the Component Matrix Table.

11Causal Modeling in Social ResearchIn oblique rotation variants, there are three versions of factor loadings presented: loadings in theunrotated solution (in the ‘Component Matrix’ table), loadings (as regression coefficients) in therotated solution (in the ‘Pattern Matrix’ table), and loadings (as correlation coefficients) in therotated solution (in the ‘Structure Matrix’ table). Usually, the factor pattern coefficients are themost informative.(Substantive) interpretation of loadings (other tests – i.e. KMO and coefficients – i.e. eigenvalues areinterpreted in the same way as in the one-factor solution)Using Varimax rotation, the rotated solution diverges somewhat from an ideal ‘simple structure’ solution(two variables load highly on more than one component: the importance of coming from a wealthy familyand the importance of your education). Also, the emerging grouping of variables diverges somewhat fromthe theoretical expectations (the same two variables mentioned before load on an additional,theoretically unrelated dimension).Rotated Component MatrixaComponent1 - Corruption 2 - Discrimination 3 - Merit 4 - 63youreduc-.100.019.571.671Extraction Method: Principal Component Analysis.Rotation Method: Varimax with Kaiser Normalization.a. Rotation converged in 5 iterations.The Oblimin solution is very similar to the Varimax solution: the same discrepancies arise.Pattern MatrixaComponent1 - Corruption 2 - Merit 3 - Discrimination 4 - 29-.621wealthfam.496-.201-.054-.587Extraction Method: Principal Component Analysis.Rotation Method: Oblimin with Kaiser Normalization.a. Rotation converged in 9 iterations.A possible solution is to revise:(a)the interpretation of the ‘importance of coming from a wealthy family’ indicator (it could beconsidered as both an indicator of strategies based on ascription but also as a proxy forstrategies based on corruption, at least at the public opinion level)(b)the label for what was supposed to be the ‘ascription-based strategies’ dimension. Given thatthis dimension is captured by the importance of parental education, of one’s own education,and of coming from a wealthy family, it seems to measure socio-economic resources – basedstrategies. An alternative explanation could be that some respondents view the educationalsystem rather as a mechanisms of transmitting the advantages in the family of origin (theparents’ education and wealth) rather than as a meritocratic mechanism.

12Causal Modeling in Social ResearchDESIRABLE JOB CHARACTERISTICS DIMENSIONS (OPTIONAL QUICK EXAMPLE OFEXPLORATORY FACTOR ANALYSIS WITH CATEGORICAL INDICATORS - MPLUS)Data: Romanian sample from the EVS (European Values Study), the 2008 wave. You can find outmore details about this survey here: http://www.europeanvaluesstudy.eu/ . The survey is a crossnational survey, and the Romanian subsample is a nationally representative sample for the Romanianpopulation.Variables: job characteristics that people consider as being important. Theoretically, the variables(question wording given below) are considered to tap two underlying dimensions: extrinsic andintrinsic motivations in job performanceHere are some aspects of a job that people say are important. Please look at them and tell me which ones you personally thinkare important in a job?Mentioned Not mentionedGood pay (GOODPAY)10Good job security (SECURE)10Not too much pressure (PRESSURE)10Good hours (GOODHRS)10Generous holidays (HOLIDAY)10An opportunity to use initiative (INITIAT)10A job in which you feel you can achieve something (ACHIEVE) 10A responsible job (RESPONSE)10A job that is interesting (INTEREST)10A job that meets one’s abilities (ABILITY)10Mplus code:TITLE:EFA 2008DATA:File is Y:\01 Date\sample2.dat ;VARIABLE:Names areMyunid SURVYR BIRTHYR AGE goodpay pressure securGoodhrs initiat holiday achieve response interest ability respectPleasppl promote useful meetppl learn bal say equal weight;Usevariables are goodpay - ability;Categorical are goodpay - ability;Missing ARE all (-1234) ;USEOBS SURVYR EQ 2008;WEIGHT IS weight;ANALYSIS:TYPE efa 1 3;ESTIMATOR wlsmv ;

13Causal Modeling in Social ResearchMplus output:Summary loading plot:

Exploratory factor analysis 2) An overview of the SPSS factor analysis procedure 3) Worked PCA examples: a) Checking the dimensionality of a well-being scale b) More than one dimension: the 'getting ahead' scales 4) Desirable job characteristics dimensions (optional quick example of exploratory factor analysis