Principal Component Analysis (PCA) For Clustering Gene Expression Data

Transcription

Principal component analysis(PCA) for clustering geneexpression dataKa Yee YeungWalter L. RuzzoBioinformatics, v17 #9 (2001) pp 763-7741

Outline of talk Background and motivation Design of our empirical study Results Summary and Conclusions2

Principal Component Analysis(PCA) Reduce dimensionality Retain as muchvariation as possiblePC2PC1 Linear transformationof the originalvariables Principal components(PC’s) are uncorrelatedand ordered5

Definition of PC’s FIRST principle component – the directionwhich maximizes variability of the datawhen projected on that axis Second PC – the direction, among thoseorthogonal to the first, maximizingvariability . Fact: They are eigenvectors of ATA ;eigenvalues are the variances6

Motivation Chu et al. [1998] identified 7 clusters usingEisen et al.’s CLUSTER software(hierarchical centroid-link) on the yeastsporulation data set. Raychaudhuri et al. [2000] applied PCA tothe sporulation data, and claimed that thedata showed a unimodal distribution in thespace of the first 2 PC’s.8

PC’s in Sporulation Data1st two PC’s: 90% of variance1st three PC’s: 95% of variance9

“Theunimodaldistributionofexpressionin the mostinformativetwodimensionssuggeststhe genesdo not fallinto welldefinedclusters.”-- Raychaudhuriet al.10

11

12

15

PCA and clustering Euclidean distance:– using all p variables,Euclidean distancebetween a pair of genesunchanged after PCA[Jolliffe 1986]PC1– using m variables (m p) approximation Correlation coefficient– no general relationshipbefore and after PCA16

Intuitively, PCA helpsclustering. But. Under some assumptions,– Chang[1983] showed thatthe set of PC’s with thelargest eigenvalues doesnot necessarily capturecluster structure infoPC1PC118

Outline of talk Background and motivation Design of our empirical study Results Summary and Conclusions19

Our empirical study Goal: Compare the clustering results withand without PCA to an external criterion:– expression data set with external criterion– synthetic data sets– methodology to compare to an externalcriterion– clustering algorithms– similarity metrics20

Ovary data (Michel Schummer) Randomly selected cDNA’s on membrane arrays Subset of data:– 235 clones– 24 experiments (7 from normal tissues, 4 fromblood samples, 13 from ovarian cancers) 235 clones correspond to 4 genes (sizes 58,88, 57,32) The four genes form the 4 classes (externalcriterion)21

PCA on ovary data Number of PC’s to adequately represent thedata:– 14 PC’s cover 90% of the variation– scree 1517component number19212322

Synthetic data sets (1) Mixture of normal distributions– Compute the mean vector and covariance matrixfor each class in the ovary data– Generate a random mixture of normal distributionsusing the mean vectors, covariance matrices, andsize distributions from the ovary dataHistogram of a tumor classfrequencyHistogram of a normal classExpression level23

Synthetic data sets (2)– Random sample (withreplacement) theexpression levels in thesame class– Empirical distrubutionpreserved– But covariance matrix notpreserved1classClones (observations) Randomly permuted ovarydataExperimental conditions(variables)p1jn24

Clustering algorithms andsimilarity metrics CAST [Ben-Dor and Yakhini 1999] with correlation– build one cluster at a time– add or remove genes from clusters based onsimilarity to the genes in the current cluster k-means with correlation and Euclideandistance– initialized with hierarchical average-link26

Outline of talk Background and motivation Design of our empirical study Results Summary and Conclusions27

Our approachPCAPCACluster theoriginal dataCluster with thefirst m PC’s(m m0, , p)Cluster with sets of PCwith “high” adjustedRand indices:– greedy approach exhaustive search form0 components greedily add the nextcomponent– modified greedy approachCompare toexternal criterion28

Results: ovary datak-means with correlation0.750.7Adjusted Rand0.650.60.550.50.45no PCAgreedyfirstmodified greedy0.43 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24number of components Adjusted Rand index for the first m (m 7) PC’shigher than without PCA adjusted Rand index with all 24 PC’s higher thanwithout PCA29

Results: ovary datak-means with Euclidean distance0.75Adjusted Rand0.70.650.60.550.5firstno PCAgreedymodified greedy0.450.424681012141618number of components202224 Sharp drop of adjusted Rand index from the first3 to first 4 PC’s30

Results: ovary dataCAST with correlation0.75Adjusted Rand0.70.650.60.550.50.45Firstno PCAgreedymodified greedy0.43 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24Number of components Adjusted Rand index on the first components without PCA greedy or modified greedy approach usually achievehigher adjusted Rand than without PCA31

Ovary/Cast, Correlation32

Real data: did PCA help? 2 data sets, 5 algorithms “ ” means clustering using the first cPC’s helped (for some c)CASTk-Meansk-Means Ave-linkcorrelation distance correlationAve-linkdistanceOvary- - -Cell cycle----36

Synth data: Did PCA help?p-values from Wilcoxon signed rank test. (p 5% are bold)Synthetic dataAlternativehypothesisCAST k-mean k-mean ave-link ave-linkcorr. Corr distcorrdistMixture of normalMixture of normalno PCA first 0.039 0.995no PCA first 0.969 0.0310.268 0.9290.760 0.0800.6090.418Random resampled no PCA first 0.243 0.909Random resampled no PCA first 0.781 0.1030.824 0.9550.200 0.0490.6840.337Cyclic dataCyclic data0.2960.732no PCA first 0.023 NAno PCA first 0.983 NA0.053 0.7990.956 0.22037

Some successes Alter,O., Brown,P.O. and Botstein,D. (2000)Singular value decomposition for genome-wideexpression data processing and modeling. Proc.Natl Acad. Sci. USA, 97, 10 101̲10 106. Holter,N.S., Mitra,M., Maritan,A., Cieplak,M.,Banavar,J.R. and Fedoroff,N.V. (2000)Fundamental patterns underlying gene expressionprofiles: simplicity from complexity. Proc. NatlAcad. Sci. USA, 97, 8409̲8414. Hastie T, Tibshirani R, Eisen MB, Alizadeh A, Levy R,Staudt L, Chan WC, Botstein D, Brown P. 'Geneshaving' as a method for identifying distinct sets ofgenes with similar expression patterns. Genome38Biol. 2000;1(2):RESEARCH0003.

Outline of talk Background and motivation Design of our empirical study Results Summary and Conclusions39

Summary & Conclusions (1) PCA may not improve cluster quality– first PC’s may be worse than without PCA– another set of PC’s may be better than first PC’s Effect of PCA depends on clusteringalgorithms and similarity metrics– CAST with correlation: first m PC’s usually worsethan without PCA– k-means with correlation: usually PCA helps– k-means with Euclidean distance: worse after thefirst few PC’s40

Summary & Conclusions (2) No general trends in the components chosenby the greedy or modified greedy approach– usually the first 2 components are chosen by theexhaustive search step Results on the synthetic data similar to realdata41

Bottom Line Successes by other groups makeit a technique worth considering,but it should not be appliedblindly.42

Acknowledgements Ka Yee Yeung Michèl SchummerMore o}UW CSE Computational Biology Group43

Mathematical definition of PCA The k-th PC: z p a x k, j jkExperimental conditions(variables)p1jj 1var (z1) a1TSa1, such thata1Ta1 1, where S is thecovariance matrix k-th PC: maximizevar (zk) akTSak, such thatakTak 1 and akTai 0,where i kGenes (observations) First PC :maximize1iXi,jnVector xj48

More details on PCA It can be shown that ak is aneigenvector of S corresponding to thek-th largest eigenvalue lk var (zk) lk Use sample covariance matrix:n (xi, jS ( j, k ) i 1- m x j )( xi ,k - m xk )n -1nÂxi, j, where m x j i 1n49

hypothesis corr. Corr dist corr dist Mixture of normal no PCA first 0.039 0.995 0.268 0.929 0.609 Mixture of normal no PCA first 0.969 0. . Summary and Conclusions. 40 Summary & Conclusions (1) PCA may not improve cluster quality -first PC's may be worse than without PCA