Transcription
Principal component analysis(PCA) for clustering geneexpression dataKa Yee YeungWalter L. RuzzoBioinformatics, v17 #9 (2001) pp 763-7741
Outline of talk Background and motivation Design of our empirical study Results Summary and Conclusions2
Principal Component Analysis(PCA) Reduce dimensionality Retain as muchvariation as possiblePC2PC1 Linear transformationof the originalvariables Principal components(PC’s) are uncorrelatedand ordered5
Definition of PC’s FIRST principle component – the directionwhich maximizes variability of the datawhen projected on that axis Second PC – the direction, among thoseorthogonal to the first, maximizingvariability . Fact: They are eigenvectors of ATA ;eigenvalues are the variances6
Motivation Chu et al. [1998] identified 7 clusters usingEisen et al.’s CLUSTER software(hierarchical centroid-link) on the yeastsporulation data set. Raychaudhuri et al. [2000] applied PCA tothe sporulation data, and claimed that thedata showed a unimodal distribution in thespace of the first 2 PC’s.8
PC’s in Sporulation Data1st two PC’s: 90% of variance1st three PC’s: 95% of variance9
“Theunimodaldistributionofexpressionin the mostinformativetwodimensionssuggeststhe genesdo not fallinto welldefinedclusters.”-- Raychaudhuriet al.10
11
12
15
PCA and clustering Euclidean distance:– using all p variables,Euclidean distancebetween a pair of genesunchanged after PCA[Jolliffe 1986]PC1– using m variables (m p) approximation Correlation coefficient– no general relationshipbefore and after PCA16
Intuitively, PCA helpsclustering. But. Under some assumptions,– Chang[1983] showed thatthe set of PC’s with thelargest eigenvalues doesnot necessarily capturecluster structure infoPC1PC118
Outline of talk Background and motivation Design of our empirical study Results Summary and Conclusions19
Our empirical study Goal: Compare the clustering results withand without PCA to an external criterion:– expression data set with external criterion– synthetic data sets– methodology to compare to an externalcriterion– clustering algorithms– similarity metrics20
Ovary data (Michel Schummer) Randomly selected cDNA’s on membrane arrays Subset of data:– 235 clones– 24 experiments (7 from normal tissues, 4 fromblood samples, 13 from ovarian cancers) 235 clones correspond to 4 genes (sizes 58,88, 57,32) The four genes form the 4 classes (externalcriterion)21
PCA on ovary data Number of PC’s to adequately represent thedata:– 14 PC’s cover 90% of the variation– scree 1517component number19212322
Synthetic data sets (1) Mixture of normal distributions– Compute the mean vector and covariance matrixfor each class in the ovary data– Generate a random mixture of normal distributionsusing the mean vectors, covariance matrices, andsize distributions from the ovary dataHistogram of a tumor classfrequencyHistogram of a normal classExpression level23
Synthetic data sets (2)– Random sample (withreplacement) theexpression levels in thesame class– Empirical distrubutionpreserved– But covariance matrix notpreserved1classClones (observations) Randomly permuted ovarydataExperimental conditions(variables)p1jn24
Clustering algorithms andsimilarity metrics CAST [Ben-Dor and Yakhini 1999] with correlation– build one cluster at a time– add or remove genes from clusters based onsimilarity to the genes in the current cluster k-means with correlation and Euclideandistance– initialized with hierarchical average-link26
Outline of talk Background and motivation Design of our empirical study Results Summary and Conclusions27
Our approachPCAPCACluster theoriginal dataCluster with thefirst m PC’s(m m0, , p)Cluster with sets of PCwith “high” adjustedRand indices:– greedy approach exhaustive search form0 components greedily add the nextcomponent– modified greedy approachCompare toexternal criterion28
Results: ovary datak-means with correlation0.750.7Adjusted Rand0.650.60.550.50.45no PCAgreedyfirstmodified greedy0.43 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24number of components Adjusted Rand index for the first m (m 7) PC’shigher than without PCA adjusted Rand index with all 24 PC’s higher thanwithout PCA29
Results: ovary datak-means with Euclidean distance0.75Adjusted Rand0.70.650.60.550.5firstno PCAgreedymodified greedy0.450.424681012141618number of components202224 Sharp drop of adjusted Rand index from the first3 to first 4 PC’s30
Results: ovary dataCAST with correlation0.75Adjusted Rand0.70.650.60.550.50.45Firstno PCAgreedymodified greedy0.43 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24Number of components Adjusted Rand index on the first components without PCA greedy or modified greedy approach usually achievehigher adjusted Rand than without PCA31
Ovary/Cast, Correlation32
Real data: did PCA help? 2 data sets, 5 algorithms “ ” means clustering using the first cPC’s helped (for some c)CASTk-Meansk-Means Ave-linkcorrelation distance correlationAve-linkdistanceOvary- - -Cell cycle----36
Synth data: Did PCA help?p-values from Wilcoxon signed rank test. (p 5% are bold)Synthetic dataAlternativehypothesisCAST k-mean k-mean ave-link ave-linkcorr. Corr distcorrdistMixture of normalMixture of normalno PCA first 0.039 0.995no PCA first 0.969 0.0310.268 0.9290.760 0.0800.6090.418Random resampled no PCA first 0.243 0.909Random resampled no PCA first 0.781 0.1030.824 0.9550.200 0.0490.6840.337Cyclic dataCyclic data0.2960.732no PCA first 0.023 NAno PCA first 0.983 NA0.053 0.7990.956 0.22037
Some successes Alter,O., Brown,P.O. and Botstein,D. (2000)Singular value decomposition for genome-wideexpression data processing and modeling. Proc.Natl Acad. Sci. USA, 97, 10 101̲10 106. Holter,N.S., Mitra,M., Maritan,A., Cieplak,M.,Banavar,J.R. and Fedoroff,N.V. (2000)Fundamental patterns underlying gene expressionprofiles: simplicity from complexity. Proc. NatlAcad. Sci. USA, 97, 8409̲8414. Hastie T, Tibshirani R, Eisen MB, Alizadeh A, Levy R,Staudt L, Chan WC, Botstein D, Brown P. 'Geneshaving' as a method for identifying distinct sets ofgenes with similar expression patterns. Genome38Biol. 2000;1(2):RESEARCH0003.
Outline of talk Background and motivation Design of our empirical study Results Summary and Conclusions39
Summary & Conclusions (1) PCA may not improve cluster quality– first PC’s may be worse than without PCA– another set of PC’s may be better than first PC’s Effect of PCA depends on clusteringalgorithms and similarity metrics– CAST with correlation: first m PC’s usually worsethan without PCA– k-means with correlation: usually PCA helps– k-means with Euclidean distance: worse after thefirst few PC’s40
Summary & Conclusions (2) No general trends in the components chosenby the greedy or modified greedy approach– usually the first 2 components are chosen by theexhaustive search step Results on the synthetic data similar to realdata41
Bottom Line Successes by other groups makeit a technique worth considering,but it should not be appliedblindly.42
Acknowledgements Ka Yee Yeung Michèl SchummerMore o}UW CSE Computational Biology Group43
Mathematical definition of PCA The k-th PC: z p a x k, j jkExperimental conditions(variables)p1jj 1var (z1) a1TSa1, such thata1Ta1 1, where S is thecovariance matrix k-th PC: maximizevar (zk) akTSak, such thatakTak 1 and akTai 0,where i kGenes (observations) First PC :maximize1iXi,jnVector xj48
More details on PCA It can be shown that ak is aneigenvector of S corresponding to thek-th largest eigenvalue lk var (zk) lk Use sample covariance matrix:n (xi, jS ( j, k ) i 1- m x j )( xi ,k - m xk )n -1nÂxi, j, where m x j i 1n49
hypothesis corr. Corr dist corr dist Mixture of normal no PCA first 0.039 0.995 0.268 0.929 0.609 Mixture of normal no PCA first 0.969 0. . Summary and Conclusions. 40 Summary & Conclusions (1) PCA may not improve cluster quality -first PC's may be worse than without PCA