Introduction To Bioinformatics - University Of Rajshahi

Transcription

2

Introduction to Bioinformatics3

INTRODUCTION TOBIOINFORMATICSFOURTH EDITIONArthur M. LeskThe Pennsylvania State UniversityIn nature’s infinite book of secrecyA little I can read.Antony and Cleopatra4

Great Clarendon Street, Oxford, OX2 6DP, United KingdomOxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research,scholarship, and education by publishing worldwide. Oxford is a registered trade mark of Oxford University Press in the UK and incertain other countries Arthur M. Lesk 2014The moral rights of the author have been assertedFirst Edition copyright 2002Second Edition copyright 2005Third Edition copyright 2008Impression: 1All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by anymeans, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, by licence or under termsagreed with the appropriate reprographics rights organization. Enquiries concerning reproduction outside the scope of the above shouldbe sent to the Rights Department, Oxford University Press, at the address aboveYou must not circulate this work in any other form and you must impose this same condition on any acquirerBritish Library Cataloguing in Publication DataData availableISBN 978–0–19–965156–6Printed in Italy by L.E.G.O. S.p.A—Lavis TNLinks to third party websites are provided by Oxford in good faith and for information only. Oxford disclaims any responsibility forthe materials contained in any third party website referenced in this work.5

Dedicated to Eda, with whom I have merged my genes.6

CONTENTSPreface to the first editionPreface to the second editionPreface to the third editionPreface to the fourth editionPlan of the bookIntroduction to bioinformatics on the webAcknowledgements1 IntroductionLife in space and timePhenotype genotype environment life history epigeneticsEvolution is the change over time in the world of living thingsDogmas: central and peripheralStatics and dynamicsNetworksObservables and data archivesA database without effective modes of access is merely a data graveyardInformation flow in bioinformaticsCuration, annotation, and quality controlThe world-wide webElectronic publicationComputers and computer scienceProgrammingBiological classification and nomenclatureUse of sequences to determine phylogenetic relationshipsUse of SINES and LINES to derive phylogenetic relationshipsSearching for similar sequences in databases: PSI-BLASTIntroduction to protein structureThe hierarchical nature of protein architecture7

Classification of protein structuresProtein structure prediction and engineeringCritical Assessment of Structure PredictionProtein engineeringProteomics and transcriptomicsDNA microarraysTranscriptomics and RNA sequencingMass spectrometrySystems biologyClinical implicationsThe futureRecommended readingExercises and problems2 Genome organization and evolutionGenomes, transcriptomes, and proteomesGenesProteomics and transcriptomicsEavesdropping on the transmission of genetic informationIdentification of genes associated with inherited diseasesMappings between the mapsHigh-resolution mapsGenome-wide association studiesPicking out genes in genomesGenome-sequencing projectsGenomes of prokaryotesThe genome of the bacterium Escherichia coliThe genome of the archaeon Methanococcus jannaschiiThe genome of one of the simplest organisms: Mycoplasma genitaliumMetagenomics: the collection of genomes in a coherent environmental sampleThe human microbiomeGenomes of eukaryaGene familiesThe genome of Saccharomyces cerevisiae (baker's yeast)The genome of Caenorhabditis elegansThe genome of Drosophila melanogaster8

The genome of Arabidopsis thalianaThe genome of Homo sapiens (the human genome)Protein-coding genesRepeat sequencesRNASingle-nucleotide polymorphisms and haplotypesSystematic measurements and collections of single-nucleotide polymorphismsEthical, legal, and social issuesGenetic diversity in anthropologyDNA sequences and languagesGenetic diversity and personal identificationEvolution of genomesPlease pass the genes: horizontal gene transferComparative genomics of eukaryaRecommended ReadingExercises and problems3 Scientific publications and archives: media, content, and accessThe scientific literatureEconomic factors governing access to scholarly publicationsOpen accessThe Public Library of ScienceTraditional and digital librariesHow to populate a digital libraryThe information explosionThe web: higher dimensionsNew media: video, soundSearching the literatureBibliography managementDatabasesDatabase contentsThe literature as a databaseDatabase organizationAnnotationDatabase quality controlDatabase accessLinks9

Database interoperabilityData miningProgramming languages and toolsTraditional programming languagesScripting languagesProgram libraries specialized for molecular biologyJava: computing over the webMarkup languagesNatural language processingNatural language processing and mining the biomedical literatureApplications of text miningRecommended readingExercises and problems4 Archives and information retrievalDatabase indexing and specification of search termsFollow-up questionsAnalysis and processing of retrieved dataThe archivesNucleic acid sequence databasesGenome databases and genome browsersProtein sequence databasesDatabases of protein familiesDatabases of structuresClassifications of protein structuresAccuracy and precision of protein structure determinationsSpecialized, or ‘boutique’, databasesExpression and proteomics databasesBibliographic databasesSurveys of molecular biology databases and serversGateways to archivesAccess to databases in molecular biologyENTREZThe Protein Identification ResourceExPASy: Expert Protein Analysis SystemWhere do we go from here?Recommended reading10

Exercises and problems5 Alignments and phylogenetic treesIntroduction to sequence alignmentThe dotplotDotplots and sequence alignmentsMeasures of sequence similarityScoring schemesDerivation of substitution matrices: PAM and BLOSUM matricesComputing the alignment of two sequencesVariations and generalizationsApproximate methods for quick screening of databasesThe dynamic-programming algorithm for optimal pairwise sequence alignmentSignificance of alignmentsMultiple sequence alignmentApplications of multiple sequence alignments and database searchingProfilesPSI-BLASTHidden Markov modelsPhylogenyDetermination of taxonomic relationships from molecular propertiesPhylogenetic treesClustering methodsCladistic methodsReconstruction of ancestral sequencesThe problem of varying rates of evolutionAre trees the correct way to present phylogenetic relationships?Computational considerationsPutting it all togetherRecommended readingExercises and problems6 Structural bioinformatics and drug discoveryIntroductionProtein stability and folding11

The Sasisekharan–Ramakrishnan–Ramachandran plot describes allowed mainchain conformationsThe sidechainsProtein stability and denaturationProtein foldingApplications of hydrophobicityCoiled-coiled proteinsSuperposition of structures, and structural alignmentsDALI and MUSTANGEvolution of protein structuresClassifications of protein structuresProtein structure prediction and modellingA priori and empirical methodsCritical Assessment of Structure PredictionSecondary structure predictionHomology modellingFold recognitionConformational energy calculations and molecular dynamicsAssignment of protein structures to genomesPrediction of protein functionDivergence of function: orthologues and paraloguesDrug discovery and developmentThe lead compoundImproving on the lead compound: quantitative structure-activity relationshipsBioinformatics in drug discovery and developmentMolecular modelling in drug discoveryRecommended readingExercises and problems7 Introduction to systems biologyIntroductionNetworks and graphsConnectivity in networksDynamics, stability, and robustnessSome sources of ideas for systems biologyComplexity of sequencesComputational complexity12

Static and dynamic complexityChaos and predictabilityRecommended readingExercises and problems8 Metabolic pathwaysClassification and assignment of protein functionThe Enzyme CommissionThe Gene Ontology Consortium protein function classificationCatalysis by enzymesActive sitesCofactorsProtein–ligand binding equilibriaEnzyme kineticsMeasures of effectiveness of enzymesHow do proteins evolve new functions?Control over enzyme activityStructural mechanisms of evolution of altered or novel protein functionsProtein evolution at the level of domain assemblyDatabases of metabolic pathwaysEcoCycThe Kyoto Encyclopedia of Genes and GenomesEvolution and phylogeny of metabolic pathwaysPathway comparisonAlignment of metabolic pathwaysComparing linear metabolic pathwaysComparing nonlinear metabolic pathways: the pentose phosphate pathway and the Calvin–Benson cycleDynamics of metabolic networksRobustness of metabolic networksDynamic modelling of metabolismRecommended readingExercises and problems9 Gene expression and regulationDNA microarraysMicroarray data are quantitative but imprecise13

Analysis of microarray dataMass spectrometryIdentification of components of a complex mixtureProtein sequencing by mass spectrometryMeasuring deuterium exchange in proteinsGenome sequence analysis by mass spectrometryProtein complexes and aggregatesProperties of protein–protein complexesProtein interaction networksRegulatory networksSignal transduction and transcriptional controlStructures of regulatory networksStructural biology of regulatory networksThe genetic switch of bacteriophage λWhat are the characteristics of the switch that must be implemented by DNA–protein interactions?The materialsHow to ’throw’ the switchThe genetic regulatory network of Saccharomyces cerevisiaeAdaptability of the yeast regulatory networkRecommended readingExercises and problemsConclusionIndex14

PREFACE TO THE FIRST EDITIONOn June 26, 2000, the sciences of biology and medicine changed forever. Prime Minister of theUnited Kingdom Tony Blair and President of the United States Bill Clinton held a joint pressconference, linked via satellite, to announce the completion of the draft of the Human Genome. TheNew York Times ran a banner headline: ‘Genetic Code of Human Life is Cracked by Scientists’. Thesequence of 3 billion bases was the culmination of over a decade of work, during which the goal wasalways clearly in sight and the only questions were how fast the technology could progress and howgenerously the funding would flow. The Table shows some of the landmarks along the way.Next to the politicians stood the scientists. John Sulston, Director of the Wellcome Trust SangerInstitute in the UK, had been a key player since the beginning of high-throughput sequencingmethods. He had grown with the project from the earliest ‘one man and a dog’ stages to the currentinternational consortium. In the US, appearing with President Clinton were Francis Collins, directorof the US National Human Genome Research Institute, representing the US publicly-funded efforts;and J. Craig Venter, President and Chief Scientific Officer of Celera Genomics Corporation,representing the commercial sector. It is difficult to introduce these two without thinking, ‘In thiscorner and in this corner ’. Although never actually coming to blows, there was certainlyintense competition, in the later stages a race.The race was more than an effort to finish first and receive scientific credit for priority. Indeed, itwas a race after which the contestants would be tested not for whether they had taken drugs, butwhether they and others could discover them. Clinical applications were a prime motive for supportof the Human Genome Project. Once the courts had held that gene sequences were patentable—withenormous potential payoffs for drugs based on them—the commercial sector rushed to submitpatents on sets of sequences that they determined, and the academic groups rushed to place each bitof sequence that they determined into the public domain to prevent Celera—or anyone else—fromapplying for patents.The academic groups lined up against Celera were a collaborating group of laboratories primarilybut not exclusively in the UK and USA. These included the Wellcome Trust Sanger Institute inEngland, Washington University in St. Louis, Missouri, the Whitehead Institute at the MassachusettsInstitute of Technology in Cambridge, Massachusetts, Baylor College of Medicine in Houston,Texas, the Joint Genome Institute at Lawrence Livermore National Laboratory in Livermore,California, and the RIKEN Genomic Sciences Center, now in Yokahama, Japan.Both sides could dip into deep pockets. Celera had its original venture capitalists; its current parentcompany, PE Corporation; and, after going public, anyone who cared to take a flutter. The WellcomeTrust Sanger Institute was supported by the UK Medical Research Council and The Wellcome Trust.The US academic labs were supported by the US National Institutes of Health and Department ofEnergy.On June 26, 2000 the contestants agreed to declare the race a tie, or at least a carefully out-offocus photo finish.Landmarks in the Human Genome Project15

Watson–Crick structure of DNA published.F. Sanger, and independently A. Maxam and W. Gilbert, develop methods for sequencing DNA.Bacteriophage ϕX-174 sequenced: first ‘complete genome’.US Supreme Court holds that genetically-modified bacteria are patentable. This decision was theoriginal basis for patenting of genes.1981Human mitochondrial DNA sequenced: 16 569 base pairs.1984Epstein–Barr virus genome sequenced: 172 281 base pairs1990International Human Genome Project launched: target horizon 15 years.1991J. Craig Venter and colleagues identify active genes via expressed sequence tags, sequences of initialportions of DNA complementary to messenger RNA.1992Complete low-resolution linkage map of the human genome.1992Beginning of the Caenorhabditis elegans sequencing project.1992Wellcome Trust and UK Medical Research Council establish the Sanger Centre for large-scalegenomic sequencing, directed by J. Sulston.1992J. Craig Venter forms the Institute for Genome Research (TIGR), associated with plans to exploitsequencing commercially through gene identification and drug discovery.1995First complete sequence of a bacterial genome, Haemophilus influenzae, by TIGR.1996High-resolution map of human genome: markers spaced by 600 000 base pairs.1996Completion of yeast genome, first eukaryotic genome sequence.May 1998 Celera claims to be able to finish human genome by 2001. Wellcome responds by increasing fundingto Sanger Centre.1998Caenorhabditis elegans sequence published.September Drosophila melanogaster genome sequence announced, by Celera Genomics; released Spring 2000.1, 19991999Human Genome Project states goal: working draft of human genome by 2001 (90% of genessequenced to 95% accuracy).December Sequence of first complete human chromosome published.1, 1999June 26, Joint announcement of complete draft sequence of human genome.20002003Fiftieth anniversary of discovery of the structure of DNA. Announcement of completion of humangenome sequence.1953197519771980The human genome is only one of the many complete genome sequences known. Taken together,genome sequences from organisms distributed widely among the branches of the tree of life give us asense, only hinted at before, of the very great unity in detail of all life on Earth. They have changedour perceptions, much as the first pictures of the Earth from space engendered a unified view of ourplanet.The sequencing of the human genome sequence ranks with the Manhattan project that producedatomic weapons during the Second World War, and the space program that sent people to the Moon,as one of the great bursts of technological achievement of the last century. These projects share agrounding in fundamental science, and large-scale and expensive engineering development andsupport. For biology, neither the attitudes nor the budgets will ever be the same. Soon a ‘one manand a dog project’ will refer only to an afternoon’s undergraduate practical experiment in sequencingand comparison of two mammalian genomes.The human genome is fundamentally about information, and computers were essential both for thedetermination of the sequence and for the applications to biology and medicine that are alreadyflowing from it. Computing contributed not only the raw capacity for processing and storage of data,but also the mathematically-sophisticated methods required to achieve the results. The marriage ofbiology and computer science has created a new field called bioinformatics.Today bioinformatics is an applied science. We use computer programs to make inferences from16

the data archives of modern molecular biology, to make connections among them, and to deriveuseful and interesting predictions.This book is aimed at students and practising scientists who need to know how to access the dataarchives of genomes and proteins, the tools that have been developed to work with these archives,and the kinds of questions that these data and tools can answer. In fact, there are a lot of sources ofthis information. Sites treating topics in bioinformatics are sprawled out all over the Web. Thechallenge is to select an essential core of this material and to describe it clearly and coherently, at anintroductory level.It is assumed that the reader already has some knowledge of modern molecular biology, and somefacility at using a computer. The purpose of this book is to build on and develop this background. Itis suitable as a textbook for advanced undergraduates or beginning postgraduate students. Manyworked-out examples are integrated into the text, and references to useful web sites andrecommended reading are provided.Problems test and consolidate understanding, provide opportunities to practise skills, and exploreadditional subjects. Three types of problems appear at the ends of chapters. Exercises are short andstraightforward applications of material in the text. Problems also require no information notcontained in the text, but require lengthier answers or in some cases calculations. The third category,‘Weblems,’ require access to the Worldwide Web. Weblems are designed to give readers practicewith the tools required for further study and research in the field.What has made it possible to try to write such a book now is the extent to which the WorldwideWeb has made easily accessible both the archives themselves and the programs that deal with them.In the past, it was necessary to install programs and data on one’s own system, and run calculationslocally. Of course this meant that everything was dependent on the facilities available. Now it ispossible to channel all the work through an interface to the Web. The web site linked with this bookwill ease the transition. To ensure that readers will be able freely to pursue discussions in the bookonto the Web, descriptions of and references to commercial software have been avoided, althoughmany commercial packages are of very high quality.A serious problem with the web is its volatility. Sites come and go, leaving trails of dead links intheir wake. There are so many sites that it is necessary to try to find a few gateways that are stable:not only continuing to exist but also kept up-to-date in both their contents and links. I have suggestedsome such sites, but many others are just as good. The problem is not to create a long list of usefulsites—this has been done many times, and is relatively easy—but to create a short one—this is muchharder!Some computing is introduced in this book based on the widely available language PERL.Examples of simple PERL programs appear in the context of biological problems. Many simplePERL tasks are assigned as exercises or problems at the ends of the chapters.Where might the reader turn next? This book is designed as a companion volume—in currentparlance, a ‘prequel’—to Introduction to Protein Architecture: the Structural Biology of Proteins(Oxford University Press, 2000), and that title is of course recommended. Other books on sequenceanalysis range from those oriented towards biology to others in the field of computer science. Thegoal is that each reader will come to recognize his or her own interests, and be equipped to followthem up.17

PREFACE TO THE SECOND EDITIONBioinformatics has grown since the first edition of this book appeared.The most striking change has been a refocus on integration; that is, of trying to see life processesas unified systems. As I wrote at the end of Introduction to Protein Science: Architecture, Functionand Genomics, ‘During the last century, molecular biologists have been taking living things apart.Our task now is to understand how to put them back together.’ We have had large amounts of data.Now we are trying to see how they interrelate. At the heart of life processes, are complicated patternsof interaction among the components, in space and in time. To understand these patterns the field hasmoved towards combining information into networks, and trying to understand their structures anddynamics.Supporting this venture are the growing streams of data. The human genome, available in draftform when the first edition appeared, is now complete. It is joined by the complete genomes of 18archaea, 155 bacteria, over 30 eukarya, and many other organelle and viral sequences. Thesegenomes illuminate each other. One story that they tell is about unsuspected underlying unities of allliving things, despite the obvious and profound differences in morphology and lifestyle.Genomic sequences are supplemented by other data streams, notably the proteome. Knowingpatterns of gene expression, and networks of regulatory interactions, shows how cells and organismsimplement the information in the DNA. The potential for the life of an organism is contained in itsgenome, but it would be impossible to deduce a biography from it. Genomes are not formulas orscripts. It is in the proteins, and their interactions with themselves and with DNA, that we must seekthe set of activities, contingent on and responsive to, the environment. Proteomics is giving us theinformation we need to see how the system works.Research and applications require that the data be available in useful form. It is not enough tomake the data public. The information must be subjected to quality control, annotation, and a logicalstructure must be imposed on it to make information retrieval possible. For this we are indebted tothe institutions that archive, curate, organize and distribute the data. A recent trend has seen mergersof these groups into collaborative projects spanning the continents. In accord with the need tointegrate the study of different types of data, we are moving in the direction of a single biologicaldata repository. Individual scientists will be able to define ‘virtual databanks’ tailoring access to theinformation to suit particular needs and interests.A gratifying consequence of academic bioinformatics is its contributions to applications inmedicine, agriculture and technology. A better understanding of life processes empowers us to dealwith them when they go wrong.18

PREFACE TO THE THIRD EDITIONMajor changes in molecular biology since the second edition most prominently involve the greatgrowth in new complete genome sequences that have become available. These are results ofenhancements in methods of sequence determination. The extension to metagenomics—the survey ofdistributions of sequences in a region of the earth or ocean—is new.Major changes in information distribution involve the accelerating transition from paper toelectronic libraries. A new chapter treating this subject, appears in this edition. The implications forscientific research are only a part of the great social revolution that has flowed from the developmentof the Web; comparable to, if not exceeding, the one impelled by the printing press 500 years ago.There are many different possible points of view from which to present molecular biology.Bioinformatics is one of them. I have also written about genomics, and about proteins, in companionvolumes also published by Oxford University Press: Introduction to Protein Science: Architecture,Function and Genomics and Introduction to Genomics. As a result, this book is focussed moretightly on the applied science of bioinformatics. Readers are urged to put the books together for amore rounded appreciation of the pageant and mechanisms of life.19

PREFACE TO THE FOURTH EDITIONThe natural habitat of bioinformatics is the web. Previous versions of this book recognized this, tosome extent, with an Online Resource Centre supplementing the text. With this edition, the onlinematerial assumes a full partnership.To learn bioinformatics means to understand basic concepts and principles, and to develop a set ofskills. The paper text contains an exposition of the concepts and principles; the Online ResourceCentre is the equivalent of a ‘laboratory’ or ‘practical’ component of the course. An icon in the textindicates the appearance in the Online Resource Centre of material related to current discussion.The data of bioinformatics are accessible on the web. Programs to analyse them are available onthe web. Indeed, many authors of programs provide web servers for remote access to thecalculations. Links from databases to servers streamline the passage from data retrieval to dataanalysis. Such facilities supersede the old procedure of ‘download the data onto your computer,install the program on your computer, and run it locally’.All research in contemporary molecular biology depends on data, and programs to retrieve andanalyse them. There is consensus that all biomedical scientists must achieve a minimum ofprogramming skills, but there is vigorous debate over what this minimum level should be. The pointof view expressed in this book is that molecular biologists based primarily in a ‘wet’ lab must dip nomore than their toes into the stream; those based primarily at a computer must wade in up to theirwaist perhaps; but only those specializing in computer science and software development mustundergo total immersion.Indeed, one of the arguments for the suggestion that sophisticated programming skills are notgenerally required is the great panoply of freely available programs, written by acknowledgedprofessionals. What is essential is developing skill in using these programs, and in intelligentinterpretion of the results that they produce.This is the goal of the problems and projects in the Online Resource Centre. Many of them are‘weblems’ based on data and facilities on the web. Some are programming exercises, based on thePERL language. PERL is a relatively simple but extremely effective programming language. It is oneof the languages popular in the bioinformatics community. Similar languages include PYTHON andRUBY; each of these has its adherents. For PERL (and for the other languages), an extensiverepertoire of utilizable program components is available, both general (see, e.g. T. Christiansen andN. Torkington, Perl Cookbook, 2nd edn, O'Reilly Media, Sebastopol, CA, 2003) and specialized(www.bioperl.org).Some of the PERL exercises in the Online Resource Centre involve modifying programs. Suchchallenges can be more focused than writing programs from scratch. Some of the exercises,problems, and weblems, although not requiring any programming, can be solved more easily bywriting short PERL programs. Readers are encouraged to try this approach whenever appropriate.In addition to PERL, the minimal computing skills essential for a biomedical scientist wouldinclude facility with using social media for communication (it is assumed that readers are familiarwith Facebook and YouTube, but there are others that are in use for communication amongscientists), and the ability to create a website. Studying from this book and the Online Resource20

Centre affords an opportunity to practise these skills. You might, for instance, ‘turn in’ the answersto homework assignments by gathering them into a web page. Questions about statements that youand the other students found unclear in your instructor's lectures—or, conceivably, even in this book—could be shared and discussed in a blog. Indeed, there is now a trend to integrating websites andsocial media. However, there are security issues. Your instructor might be unhappy if everyonecopied the answers to the exercises from the first student to post them. A class taught from this bookwould afford a fine opportunity to explore the possibilities and challenges.21

PLAN OF THE BOOK Chapter 1 sets the stage and introduces all of the major players: DNA and protein sequences andstructures, genomes and proteomes, databases and information retrieval, the worldwide web,computer programming. Before developing individual topics in detail it is important to see theframework of their interactions. Chapter 2 presents the nature of individual genomes, including the human genome, and therelationships among them, from the biological point of view. Chapter 3 describes the current state of the scientific literature as it makes the transition frompaper to electronic form. This transition has many consequences, both intellectual and practical. Ithas had profound effects on research in bioinformatics. Chapter 4 imparts basic skills in using the web in bioinformatics. It describes archival databanksand leads the reader through sample sessions involving information retrieval from some of themajor archival databases in molecular biology. Chapter 5 treats the analysis of relationships among sequences: alignments and phylogenetic trees.These methods underlie some of the major computational challenges of bioinformatics: detectingdistant re

Introduction to bioinformatics on the web Acknowledgements 1 Introduction Life in space and time Phenotype genotype environment life history epigenetics Evolution is the change over time in the world of living things Dogmas: central and peripheral St