Classification Using Support Vector - UH

Transcription

Gene Selection for CancerClassification using Support VectorMachinesAixia GuoDepartment of Mathematics, UHMay 08, 2014

Content Motivation Methodology Problem Description and Prior Work Support Vector Machines (SVMs) andRecursive Feature Elimination (RFE) Experimental Results Conclusion

Motivation Cancer vs Normal tissuesMicro-array technology measures expression level of10,000 30,000 genes simultaneously in a singleexperimentMicro-array devices generate bewildering amountsof raw dataMethods are needed to sort out whether cancertissues have distinctive signatures of geneexpression over normal tissues or other types ofcancer tissues

Motivation Cancer vs Normal tissuesMicro-array technology measures expression level of10,000 30,000 genes simultaneously in a singleexperimentMicro-array devices generate bewildering amountsof raw dataMethods are needed to sort out whether cancertissues have distinctive signatures of geneexpression over normal tissues or other types ofcancer tissues

MethodologyAddress the problem by a new method of gene selectionutilizing Support Vector Machine(SVM) methods based onRecursive Feature Elimination (RFE)(1)To select a small subset of genes from broad geneexpression data(2) To build a classifier by using available training examplesfrom cancer and normal patients

TerminologyGene feature attribute columnPattern: a vector of n components (features)PatientGene 1Gene 2 Gene nClass1100.30200.52 1000.11 220.56500.31 600.75- m150.241000.20 300.33-Example of gene expressions

Prior Works of Space Dimensionality Reduction A common method to reduce feature space dimensionProject on the first few principle directions of the data(see,e.g.Duda, 73)New features obtained are linear combinations of the originalfeatures DisadvantagesNone of the original input features can be discarded New pruning techniques are neededEliminate some of the original input features and retain aminimum subset of features that yield best classificationperformance

Feature-ranking Technique Feature ranking with correlation coefficientsSelect the genes that individually classify best the trainingdataEliminate genes that are useless for discrimination Evaluating how well an individual feature contributes to theseparation (e.g. cancer vs normal) can produce a simplefeature (gene) ranking.

Various correlation coefficients are used as ranking criteria. Thecoefficient used in Golub (1999) is defined as:wi (μi ( ) μi ( ))/(σi ( ) σi ( ))where μi and σi are the mean and standard deviation of thegene expression values of gene i for all the patients of class( ) or class ( ), i 1, . . . n. Large positive wi values indicatestrong correlation with class ( ) whereas large negative wivalues indicate strong correlation with class ( ).

Recursive Feature Elimination1) Train the classifier2) Compute the ranking criterionfor all features3) Remove the featurewith smallest ranking criterion4) Repeat

Feature Ranking with Support Vector machines(SVM) Idea from using the weights of a classifier to produce afeature rankingIn this paper, the classifier used is linear SVMs (Boser,1992; Vapnik, 1998)Presently SVM is one of the best-known classificationtechniques with computational advantages over theircontenders (Cristianini, 1999).

SVM Recursive Feature Elimination (SVM RFE)SVM RFE is an application of RFE using the weight magnitude as rankingcriterion. Below is an outline of the algorithm in the linear case, usingSVM-train in Eq. (5).

Experimental Results1) A small subset of selected features have thebest classification results2) The features selected matter more than theclassifier used3) SVM-RFE selects relevant genes

Gene Expression Dataset and the Classification ProblemLeukemia data is available on-line. The problem is to distinguishbetween two variants of leukemia (ALL and AML).The data is split into two subsets: A training set, used to selectgenes and adjust the weights of the classifiers, and anindependent test set used to estimate the performance of thesystem obtained.Their training set consists of 38 samples (27 ALL and 11 AML)from bone marrow specimens. Their test set has 34 samples (20ALL and 14 AML), prepared under different experimentalconditions and including 24 bone marrow and 10 blood samplespecimens. All samples have 7129 features, corresponding tosome normalized gene expression value extracted from themicro-array image.

A small subset of selected features have the best classificationresults# of genesTrain accuracyTest accuracyAll .97161.001.0081.001.0040.970.9120.970.8810.920.79

Features selected matter more than the classifier usedFewer genes selected by SVMRFE have better classificationresults comparing to the genesselected by correlation

Features selected matter more than the classifier usedBaseline classifier (not SVM) hasbetter classification results withthe SVM-RFE features comparingto using the baseline genes(correlation selected genes)

SVM-RFE selects relevant genes

Conclusion The genes selected by SVM-RFE yield betterclassification performance (rather than theclassifiers)The selected genes are closely related to thediseasesIn contrast with the baseline method, theirmethod eliminates gene redundancyautomatically and yields better and morecompact gene subsets

References[1] Boser, B., Guyon, I., & Vapnik, V. (1992). An training algorithm for optimalmargin classifiers. In Proceedings of the Fifth Annual Workshop onComputational Learning Theory (pp. 144–152). Pittsburgh: ACM.[2] Cristianini, N. & Shawe-Taylor, J. (1999).An introduction to support vectormachines. Cambridge, MA: Cambridge University Press.[3] Duda, R. O. & Hart, P. E. (1973). Pattern classification and scene analysis.New York: Wiley.[4]Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov,J. P., Coller, H., Loh, M. L.,Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., &Lander, E. S. (1999). Molecular classification of cancer:Class discovery andclass prediction by gene expression monitoring. Science, 286, 531–537. Thedata is available on-line at http://www.genome.wi.mit. edu/MPR/data set ALLAML.html.[5] Vapnik, V. N. (1998). Statistical learning theory. Wiley Interscience.

Thank you

e.g. Duda, 73) New features obtained are linear combinations of the original features Disadvantages None of the original input features can be discarded New pruning techniques are needed Eliminate some of the original input features and retain a minimum subset of features that yield best classification performance