Analysis Of Mass Spectrometry Data For Protein Identi .

Transcription

Analysis of Mass SpectrometryData for Protein Identificationin Complex Biological MixturesbyMarina SpivakA dissertation submitted in partial fulfillmentof the requirements for the degree ofDoctor of PhilosophyDepartment of Computer ScienceNew York UniversitySeptember 2010Leslie Greengard

c Marina SpivakAll Rights Reserved, 2010

AcknowledgementsI thank my collaborators Jason Weston, William Stafford Noble and MichaelMacCoss for excellent supervision and friendship. I also thank my advisorLeslie Greengard for help with writing this dissertation, as well as for valuablescientific guidance and encouragement. Finally, many thanks to my parentsand friends for moral support during all the difficult years in graduate school.iii

AbstractMass spectrometry is a powerful technique in analytical chemistry that wasoriginally designed to determine the composition of small molecules in termsof their constituent elements. In the last several decades, it has begun to beused for much more complex tasks, including the detailed analysis of the aminoacid sequence that makes up an unknown protein and even the identification ofmultiple proteins present in a complex mixture. The latter problem is largelyunsolved and the principal subject of this dissertation.The fundamental difficulty in the analysis of mass spectrometry data isthat of ill-posedness. There are multiple solutions consistent with the experimental data and the data is subject to significant amounts of noise. In thiswork, we have developed application-specific machine learning algorithms that(partially) overcome this ill-posedness. We make use of labeled examples ofa single class of peptide fragments and of the unlabeled fragments detectedby the instrument. This places the approach within the broader framework ofsemi-supervised learning.Recently, there has been considerable interest in classification problems ofthis type, where the learning algorithm only has access to labeled examples ofa single class and unlabeled data. The motivation for such problems is that inmany applications, examples of one of the two classes are easy and inexpensiveiv

to obtain, whereas the acquisition of examples of a second class is difficult andlabor-intensive. For example, in document classification, positive examplesare documents that address specific subject, while unlabeled documents areabundant. In movie rating, the positive data are the movies chosen by clients,while the unlabeled data are all remaining movies in a collection. In medicalimaging, positive (labeled) data correspond to images of tissue affected bya disease, while the remaining available images of the same tissue comprisethe unlabeled data. Protein identification using mass spectrometry is anothervariant of such a general problem.In this work, we propose application-specific machine learning algorithmsto address this problem. The reliable identification of proteins from mixturesusing mass spectrometry would provide an important tool in both biomedicalresearch and clinical practice.v

ContentsAcknowledgementsiiiAbstractivList of FiguresxiList of Tablesxxi1 Introduction12 Mass Spectrometry For Proteomics62.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .62.2Mass Spectrometry Pipeline . . . . . . . . . . . . . . . . . . .82.2.1Tandem Mass Spectrometry (MS/MS) . . . . . . . . .9From Spectra to Proteins . . . . . . . . . . . . . . . . . . . . .112.3.1152.3Database Search . . . . . . . . . . . . . . . . . . . . .vi

2.3.2Target-Decoy Strategy . . . . . . . . . . . . . . . . . .172.3.3Q-value Estimation . . . . . . . . . . . . . . . . . . . .192.3.4Peptide-Spectrum Match Verification: PeptideProphetand Percolator . . . . . . . . . . . . . . . . . . . . . .22Composing the Protein Set . . . . . . . . . . . . . . . .24Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .262.3.52.43 PSM Verification3.13.23.33.429Learning with Positive and Unlabeled Examples . . . . . . . .313.1.1Two-Step Strategy: First Step . . . . . . . . . . . . . .323.1.2Two-Step Strategy: Second Step . . . . . . . . . . . . .363.1.3Fully Supervised Setting with Noisy Negatives . . . . .39Algorithms for Mass Spectrometry . . . . . . . . . . . . . . .403.2.1PeptideProphet . . . . . . . . . . . . . . . . . . . . . .403.2.2Percolator . . . . . . . . . . . . . . . . . . . . . . . . .43Fully-Supervised Approach . . . . . . . . . . . . . . . . . . . .483.3.1Motivation for Percolator Algorithm . . . . . . . . . .483.3.2Choosing Loss Functions in Fully Supervised Setting .503.3.3Supervised Learning Yields Performance Comparable toPercolator . . . . . . . . . . . . . . . . . . . . . . . . .55Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .58vii

4 Ranking4.14.24.360Previous Work: Review . . . . . . . . . . . . . . . . . . . . . .624.1.1Optimizing Area Under the ROC Curve . . . . . . . .624.1.2OWA-based Optimization . . . . . . . . . . . . . . . .634.1.3ROC Optimization at a Single Point . . . . . . . . . .664.1.4Lambda Rank . . . . . . . . . . . . . . . . . . . . . . .69Algorithm for Direct Q-value Optimization: Q-ranker . . . . .714.2.1Ordered Weighted Average(OWA) Operator . . . . . .714.2.2Loss Function Definition . . . . . . . . . . . . . . . . .724.2.3Training Heuristics . . . . . . . . . . . . . . . . . . . .754.2.4Weight Decay . . . . . . . . . . . . . . . . . . . . . . .764.2.5Use of Non-Linear Models . . . . . . . . . . . . . . . .764.2.6Algorithm . . . . . . . . . . . . . . . . . . . . . . . . .774.2.7Comparison of Algorithms Across Multiple Data Sets .79Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .895 Protein Identification5.15.2Previous Work91. . . . . . . . . . . . . . . . . . . . . . . . . .935.1.1ProteinProphet . . . . . . . . . . . . . . . . . . . . . .945.1.2IDPicker . . . . . . . . . . . . . . . . . . . . . . . . . .95The Algorithm: Protein Q-ranker . . . . . . . . . . . . . . . .97viii

5.35.45.55.2.1Input . . . . . . . . . . . . . . . . . . . . . . . . . . . .995.2.2PSM Scoring Function . . . . . . . . . . . . . . . . . .1005.2.3Peptide Scoring Function . . . . . . . . . . . . . . . . .1015.2.4Protein Scoring Function . . . . . . . . . . . . . . . . .101Training the Model . . . . . . . . . . . . . . . . . . . . . . . .1025.3.1Target-Decoy Protein Identification Problem . . . . . .1025.3.2Loss Function . . . . . . . . . . . . . . . . . . . . . . .1065.3.3Training the Model . . . . . . . . . . . . . . . . . . . .1085.3.4Reporting Final Results: Parsimony Rules . . . . . . .109Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1115.4.1Data Set Description . . . . . . . . . . . . . . . . . . .1115.4.2Main Result . . . . . . . . . . . . . . . . . . . . . . . .1135.4.3Validation Against Alternative Experimental Techniques 1165.4.4Overlap between ProteinProphet and Protein Q-ranker1195.4.5Length of Identified Proteins . . . . . . . . . . . . . . .1225.4.6Multitask Learning . . . . . . . . . . . . . . . . . . . .123Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1266 Learning Parameters of Theoretical Spectrum Generation1286.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .1286.2Parameterizing the Theoretical Spectrum . . . . . . . . . . . .130ix

6.36.46.2.1SEQUEST-style Search . . . . . . . . . . . . . . . . . .1306.2.2Learning the Theoretical Spectrum Peak Heights . . .1326.2.3Learning Model Parameters . . . . . . . . . . . . . . .134Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1366.3.1. . .137Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .140Learning in the Context of Protein Identification7 Conclusions and Future Work145Bibliography148x

List of Figures2.1Mass Spectrometry Pipeline. A protein population is prepared from a biological source such as a cell culture. The proteins are first separated from a mixed sample by means of anSDS gel. They are then digested by a proteolytic enzyme suchas trypsin. The resulting peptides are loaded onto an HPLCcolumn coupled to a mass spectrometer. The peptides are ionized before entering the mass spectrometer [Steen and Mann,2004]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.28Fragmentation of a Tetrapeptide Under the CollisionDissociation [P. Hernandez and Appel, 2006]. At low collision energies, fragmentation mainly occurs along the peptidebackbone bonds, i.e N-C bond between amino acids. . . . . . .xi12

2.3Idealized Set of b- and y- Ions due to Collision Dissociation. The expectation is that each molecule of a given peptidewill break in a single place, so that many molecules of the samepeptide will generate nested sets of fragments. The figure showsthe ions for peptide ABCDEFG.2.4. . . . . . . . . . . . . . . .12Example of an Annotated MS Spectrum. The information about the peptide sequence can be inferred from the massdifferences between the peaks [Jonscher, 2005]. . . . . . . . . .2.513Example of Theoretical and Observed Spectrum. Theoretical (A) and experimental (B) spectra for the peptide sequence DLRSWTAADTAAQISQ [Eng et al., 1994] . . . . . .2.615Q-value assignment. The ranking is induced by the discriminant function of a classifier. To assign a q-value to an example,set the threshold such that all the examples ranking above itare considered “accepted” by the classifier, and all those beloware considered “rejected”. . . . . . . . . . . . . . . . . . . . .xii21

3.1Schematic Description of the Percolator Algorithm [Kället al., 2007]. Percolator first selects an initial set of positivepeptide-spectrum matches based on the SEQUEST cross-correlationscore. It then proceeds in an iterative manner: 1) learns an SVMclassifier; 2) re-assigns positive labels based on the discriminantfunction scores. . . . . . . . . . . . . . . . . . . . . . . . . . .3.245Comparison Between Percolator and a Linear SVM.Each panel plots the number of distinct peptides as a functionof q value. The series correspond to two different algorithms,including variants of each that use 17 features and 37 features(see table 3.1).3.3. . . . . . . . . . . . . . . . . . . . . . . . . .51Three Types of Loss Function. Each panel plots the loss asa function of the difference in the true and predicted label. Thesquared loss L(f (x), y) (f (x) y)2 is often used in regressionproblems, but also in classification [LeCun et al., 1998]. Thehinge loss L(f (x), y) max(0, 1 yf (x)) is used as a convexapproximation to the zero-one loss in support vector machines[Cortes and Vapnik, 1995]. The sigmoid loss L(f (x), y) 1/(1 eyf (x) ) is perhaps less commonly used, but is discussed in Masonet al. [2000], Shen et al. [2003].xiii. . . . . . . . . . . . . . . . .53

3.4Comparison of Loss Functions. Each panel plots the number of accepted peptide-spectrum matches for the yeast (A)training set and (B) test set as a function of the q value threshold. Each series corresponds to one of the three loss functionsshown in Figure 3.3, with series for Percolator and SEQUESTincluded for comparison.3.5. . . . . . . . . . . . . . . . . . . .56“Cutting” the Hinge Loss Makes a Sigmoid-like LossCalled the Ramp Loss.Making the hinge loss have zerogradient when z yi f (x) s for some chosen value s effectivelymakes a piece-wise linear version of a sigmoid function.4.1. . .57Optimal ROC Curve for Q-value Threshold Optimization. While ROC curve B may reflect better ranking performance of a classifier on an entire data set, ROC curve A ismore desirable as an output of a q-value optimization procedure. ROC curve A crosses the line defined by the specifiedq-value threshold at a point that will give more true positiveswith the same statistical confidence level.xiv. . . . . . . . . . .61

4.2Optimization of a Single Point on the ROC Curve A) OnROC curve B, the target values of ρCA and ρCR correspond totwo distinct points on the curve. The goal of the algorithms in[Mozer et al., 2002] is to find a classifier producing ROC curve Asuch that the target values ρCA and ρCR correspond to a singlepoint on the curve. The first family of algorithms emphasizesthe examples lying between the two points on the ROC curveB to arrive to a classifier with ROC curve A. (Alternativelyit de-emphasizes irrelevant examples.) B) The second familyof algorithms defines a constrained optimization problem whichmaximizes the correct rejection (CR) rate while maintaining thecorrect acceptance rate (CA) fixed to ρCA , for example.4.3. . .67Comparison of PeptideProphet, Percolator and Q-rankeron Training Set. Each panel plots the number of accepted target peptide-spectrum matches as a function of q value on thetraining (trn) set. The series correspond to the three different algorithms, including two variants of Q-ranker that use 17features and 37 features. . . . . . . . . . . . . . . . . . . . . .xv82

4.4Comparison of PeptideProphet, Percolator and Q-rankeron Testing Set. Each panel plots the number of accepted target peptide-spectrum matches as a function of q value on thetesting (tst) set. The series correspond to the three differentalgorithms, including two variants of Q-ranker that use 17 features and 37 features.4.5. . . . . . . . . . . . . . . . . . . . . .83Comparison of Training Optimization Methods (Iteration vs. Error Rate). The Q-ranker optimization starts fromthe best result of sigmoid loss optimization achieved during thecourse of training and continues for a further 300 iterations.These results are on the training set. Note that for each q valuechoice, Q-ranker improves the training error over the best resultfrom the classification algorithm. . . . . . . . . . . . . . . . .4.684Comparison of PeptideProphet, Percolator and Q-rankeron Training Set in Terms of the Number Unique Peptides Identified Over the Range of Q-values. Each panelplots the number of unique real database peptides as a function of q value on the training (trn) set. The series correspondto the thre

The fundamental di culty in the analysis of mass spectrometry data is that of ill-posedness. There are multiple solutions consistent with the exper-imental data and the data is subject to signi cant amounts of noise. In this work, we have developed application-speci c machine learning algorithms that (partially) overcome this ill-posedness. We make use of labeled examples ofFile Size: 1MBPage Count: 180