Bioinformatics - WordPress

Transcription

Bioinformatics

Adaptive Computation and Machine LearningThomas Dietterich, EditorChristopher Bishop, David Heckerman, Michael Jordan, and Michael Kearns,Associate EditorsBioinformatics: The Machine Learning Approach, Pierre Baldi and Søren BrunakReinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto

Pierre BaldiSøren BrunakBioinformaticsThe Machine Learning ApproachA Bradford BookThe MIT PressCambridge, MassachusettsLondon, England

c 2001Massachusetts Institute of TechnologyAll rights reserved. No part of this book may be reproduced in any formby any electronic or mechanical means (including photocopying, recording,or information storage and retrieval) without permission in writing from thepublisher.This book was set in Lucida by the authors and was printed and bound in theUnited States of America.Library of Congress Cataloging-in-Publication DataBaldi, Pierre.Bioinformatics : the machine learning approach / Pierre Baldi,Søren Brunak.—2nd ed.p. cm.—(Adaptive computation and machine learning)"A Bradford Book"Includes bibliographical references (p. ).ISBN 0-262-02506-X (hc. : alk. paper)1. Bioinformatics. 2. Molecular biology—Computer simulation. 3. Molecularbiology—Mathematical models. 4. Neural networks (Computer science). 5.Machine learning. 6. Markov processes. I. Brunak, Søren. II. Title. III. Series.QH506.B35 2001572.8 01 13—dc212001030210

Series ForewordThe first book in the new series on Adaptive Computation and Machine Learning, Pierre Baldi and Søren Brunak’s Bioinformatics provides a comprehensiveintroduction to the application of machine learning in bioinformatics. Thedevelopment of techniques for sequencing entire genomes is providing astronomical amounts of DNA and protein sequence data that have the potentialto revolutionize biology. To analyze this data, new computational tools areneeded—tools that apply machine learning algorithms to fit complex stochastic models. Baldi and Brunak provide a clear and unified treatment of statistical and neural network models for biological sequence data. Students and researchers in the fields of biology and computer science will find this a valuableand accessible introduction to these powerful new computational techniques.The goal of building systems that can adapt to their environments andlearn from their experience has attracted researchers from many fields, including computer science, engineering, mathematics, physics, neuroscience,and cognitive science. Out of this research has come a wide variety of learningtechniques that have the potential to transform many scientific and industrialfields. Recently, several research communities have begun to converge on acommon set of issues surrounding supervised, unsupervised, and reinforcement learning problems. The MIT Press series on Adaptive Computation andMachine Learning seeks to unify the many diverse strands of machine learningresearch and to foster high quality research and innovative applications.Thomas Dietterichix

ContentsSeries ForewordixPrefacexi1 Introduction1.1Biological Data in Digital Symbol Sequences1.2Genomes—Diversity, Size, and Structure1.3Proteins and Proteomes1.4On the Information Content of Biological Sequences1.5Prediction of Molecular Function and Structure1171624432 Machine-Learning Foundations: The Probabilistic Framework2.1Introduction: Bayesian Modeling2.2The Cox Jaynes Axioms2.3Bayesian Inference and Induction2.4Model Structures: Graphical Models and Other Tricks2.5Summary4747505360643 Probabilistic Modeling and Inference: Examples3.1The Simplest Sequence Models3.2Statistical Mechanics6767734 Machine Learning Algorithms4.1Introduction4.2Dynamic Programming4.3Gradient Descent4.4EM/GEM Algorithms4.5Markov-Chain Monte-Carlo Methods4.6Simulated Annealing4.7Evolutionary and Genetic Algorithms4.8Learning Algorithms: Miscellaneous Aspects818182838487919394v

viContents5 Neural Networks: The Theory5.1Introduction5.2Universal Approximation Properties5.3Priors and Likelihoods5.4Learning Algorithms: Backpropagation99991041061116 Neural Networks: Applications6.1Sequence Encoding and Output Interpretation6.2Sequence Correlations and Neural Networks6.3Prediction of Protein Secondary Structure6.4Prediction of Signal Peptides and Their Cleavage Sites6.5Applications for DNA and RNA Nucleotide Sequences6.6Prediction Performance Evaluation6.7Different Performance Measures1131141191201331361531557 Hidden Markov Models: The Theory7.1Introduction7.2Prior Information and Initialization7.3Likelihood and Basic Algorithms7.4Learning Algorithms7.5Applications of HMMs: General Aspects1651651701721771848 Hidden Markov Models: Applications8.1Protein Applications8.2DNA and RNA Applications8.3Advantages and Limitations of HMMs1891892092229 Probabilistic Graphical Models in Bioinformatics9.1The Zoo of Graphical Models in Bioinformatics9.2Markov Models and DNA Symmetries9.3Markov Models and Gene Finders9.4Hybrid Models and Neural Network Parameterization ofGraphical Models9.5The Single-Model Case9.6Bidirectional Recurrent Neural Networks for Protein Secondary Structure Prediction22522523023410 Probabilistic Models of Evolution: Phylogenetic Trees10.1Introduction to Probabilistic Models of Evolution10.2Substitution Probabilities and Evolutionary Rates10.3Rates of Evolution10.4Data Likelihood10.5Optimal Trees and Learning265265267269270273239241255

viiContents10.610.711 imonyExtensionsGrammars and LinguisticsIntroduction to Formal GrammarsFormal Grammars and the Chomsky HierarchyApplications of Grammars to Biological SequencesPrior Information and InitializationLikelihoodLearning AlgorithmsApplications of SCFGsExperimentsFuture Directions27327527727727828428828929029229329512 Microarrays and Gene Expression12.1Introduction to Microarray Data12.2Probabilistic Modeling of Array Data12.3Clustering12.4Gene Regulation29929930131332013 Internet Resources and Public Databases13.1A Rapidly Changing Set of Resources13.2Databases over Databases and Tools13.3Databases over Databases in Molecular Biology13.4Sequence and Structure Databases13.5Sequence Similarity Searches13.6Alignment13.7Selected Prediction Servers13.8Molecular Biology Software Links13.9Ph.D. Courses over the Internet13.10 Bioinformatics Societies13.11 HMM/NN simulator323323324325327333335336341343344344A 51352352353Decision Theory and Loss FunctionsQuadratic Loss FunctionsThe Bias/Variance Trade-offCombining EstimatorsError BarsSufficient StatisticsExponential FamilyAdditional Useful Distributions

viiiContentsA.9Variational Methods354B Information Theory, Entropy, and Relative EntropyB.1EntropyB.2Relative EntropyB.3Mutual InformationB.4Jensen’s InequalityB.5Maximum EntropyB.6Minimum Relative Entropy357357359360361361362C Probabilistic Graphical ModelsC.1Notation and PreliminariesC.2The Undirected Case: Markov Random FieldsC.3The Directed Case: Bayesian Networks365365367369D HMM Technicalities, Scaling, Periodic Architectures,State Functions, and Dirichlet MixturesD.1ScalingD.2Periodic ArchitecturesD.3State Functions: BendabilityD.4Dirichlet Mixtures375375377380382E Gaussian Processes, Kernel Methods, and SupportVector MachinesE.1Gaussian Process ModelsE.2Kernel Methods and Support Vector MachinesE.3Theorems for Gaussian Processes and SVMs387387389395F Symbols and Abbreviations399References409Index447

This page intentionally left blank

PrefaceWe have been very pleased, beyond our expectations, with the reception ofthe first edition of this book. Bioinformatics, however, continues to evolvevery rapidly, hence the need for a new edition. In the past three years, fullgenome sequencing has blossomed with the completion of the sequence ofthe fly and the first draft of the Human Genome Project. In addition, severalother high-throughput/combinatorial technologies, such as DNA microarraysand mass spectrometry, have considerably progressed. Altogether, these highthroughput technologies are capable of rapidly producing terabytes of datathat are too overwhelming for conventional biological approaches. As a result, the need for computer/statistical/machine learning techniques is todaystronger rather than weaker.Bioinformatics in the Post-genome EraIn all areas of biological and medical research, the role of the computer hasbeen dramatically enhanced in the last five to ten year period. While the firstwave of computational analysis did focus on sequence analysis, where manyhighly important unsolved problems still remain, the current and future needswill in particular concern sophisticated integration of extremely diverse setsof data. These novel types of data originate from a variety of experimentaltechniques of which many are capable of data production at the levels of entirecells, organs, organisms, or even populations.The main driving force behind the changes has been the advent of new, efficient experimental techniques, primarily DNA sequencing, that have led to anexponential growth of linear descriptions of protein, DNA and RNA molecules.Other new data producing techniques work as massively parallel versions oftraditional experimental methodologies. Genome-wide gene expression measurements using DNA microrarrays is, in essence, a realization of tens of thousands of Northern blots. As a result, computational support in experiment design, processing of results and interpretation of results has become essential.xi

xiiPrefaceThese developments have greatly widened the scope of bioinformatics.As genome and other sequencing projects continue to advance unabated,the emphasis progressively switches from the accumulation of data to its interpretation. Our ability in the future to make new biological discoveries willdepend strongly on our ability to combine and correlate diverse data sets alongmultiple dimensions and scales, rather than a continued effort focused in traditional areas. Sequence data will have to be integrated with structure andfunction data, with gene expression data, with pathways data, with phenotypicand clinical data, and so forth. Basic research within bioinformatics will haveto deal with these issues of system and integrative biology, in the situationwhere the amount of data is growing exponentially.The large amounts of data create a critical need for theoretical, algorithmic,and software advances in storing, retrieving, networking, processing, analyzing, navigating, and visualizing biological information. In turn, biological systems have inspired computer science advances with new concepts, includinggenetic algorithms, artificial neural networks, computer viruses and syntheticimmune systems, DNA computing, artificial life, and hybrid VLSI-DNA genechips. This cross-fertilization has enriched both fields and will continue to doso in the coming decades. In fact, all the boundaries between carbon-basedand silicon-based information processing systems, whether conceptual or material, have begun to shrink [29].Computational tools for classifying sequences, detecting weak similarities,separating protein coding regions from non-coding regions in DNA sequences,predicting molecular structure, post-translational modification and function,and reconstructing the underlying evolutionary history have become an essential component of the research process. This is essential to our understandingof life and evolution, as well as to the discovery of new drugs and therapies.Bioinformatics has emerged as a strategic discipline at the frontier betweenbiology and computer science, impacting medicine, biotechnology, and societyin many ways.Large databases of biological information create both challenging datamining problems and opportunities, each requiring new ideas. In this regard,conventional computer science algorithms have been useful, but are increasingly unable to address many of the most interesting sequence analysis problems. This is due to the inherent complexity of biological systems, broughtabout by evolutionary tinkering, and to our lack of a comprehensive theoryof life’s organization at the molecular level. Machine-learning approaches (e.g.neural networks, hidden Markov models, vector support machines, belief networks), on the other hand, are ideally suited for domains characterized bythe presence of large amounts of data, “noisy” patterns, and the absence ofgeneral theories. The fundamental idea behind these approaches is to learnthe theory automatically from the data, through a process of inference, model

Prefacexiiifitting, or learning from examples. Thus they form a viable complementaryapproach to conventional methods. The aim of this book is to present a broadoverview of bioinformatics from a machine-learning perspective.Machine-learning methods are computationally intensive and benefitgreatly from progress in computer speed. It is remarkable that both computerspeed and sequence volume have been growing at roughly the same ratesince the late 1980s, doubling every 16 months or so. More recently, with thecompletion of the first draft of the Human Genome Project and the advent ofhigh-throughput technologies such as DNA microarrays, biological data hasbeen growing even faster, doubling about every 6 to 8 months, and further increasing the pressure towards bioinformatics. To the novice, machine-learningmethods may appear as a bag of unrelated techniques—but they are not. Onthe theoretical side, a unifying framework for all machine-learning methodsalso has emerged since the late 1980s. This is the Bayesian probabilisticframework for modeling and inference. In our minds, in fact, there is littledifference between machine learning and Bayesian modeling and inference, except for the emphasis on computers and number crunching implicit in the firstterm. It is the confluence of all three factors—data, computers, and theoreticalprobabilistic framework—that is fueling the machine-learning expansion, inbioinformatics and elsewhere. And it is fair to

Bioinformatics : the machine learning approach / Pierre Baldi, Søren Brunak.—2nd ed. p. cm.—(Adaptive computation and machine learning) "A Bradford Book" Includes bibliographical references (p. ). ISBN 0-262-02506-X (hc. : alk. paper) 1. Bioinformatics. 2. Molecular biology—Computer simulation. 3. Molecular biology—Mathematical models .