The Origins Of Bioinformatics - Bcb.unl.edu

Transcription

PERSPECTIVESnomic environment that includes varioussocial values, research practices and businesspressures. We are mindful that, in some situations, modifying patent law may reduce oneproblem (such as permitting more competition), while magnifying others (such asreducing incentives to conduct research anddevelopment). Nevertheless, although caremust be taken, this debate needs to progressto ensure that patenting practices, as appliedto genetic material, fulfil the ultimate objective of encouraging the development ofgenetic technologies into products for thepublic’s good.Update – note added in proofThe recent announcement that scientistswill share a patent over a disease-relatedgene43 with a patient advocacy group, whoprovided the researchers with blood andtissue samples 44, is a positive sign thatresearchers take seriously their moralresponsibility to donors. Such steps are inagreement with recent policy statementsissued by HUGO45. Binding legal measureswould help to ensure that researchers andcompanies who comply with this type ofethical norm do not face unfair competition from those who do not.Timothy Caulfield is at the Health Law Institute,University of Alberta, Edmonton, Alberta, T6G2H5, Canada. E. Richard Gold is at the Faculty ofLaw, University of Western Ontario, LondonOntario, N6A 3K7 , Canada. Mildred K. Cho is atthe Center for Biomedical Ethics, StanfordUniversity, Stanford, California 94304 , USA.Correspondence to : .17.18.19.20.21.22.23.24.25.d’éthique en désaccord avec la directive européenne. LeMonde 15 June (2000).Kolata, G. Special Report: Who owns your genes? NewYork Times 15 May (2000).Ramirez, A. School given patent to clone humans.National Post 16 May (2000).Sagar, A., Daemmrich, A. & Ashiya, M. The tragedy ofcommoners: biotechnology and its publics. NatureBiotechnol. 18, 2–4 (2000).Angell, M. Is academic medicine for sale? N. Engl. J.Med. 20, 1516–1518 (2000).Pottagem, A. The inscription of life in law: gene, patents,and bio-politics. The Modern Law Review 61, 740–765(1998).Gold, E. R. Body Parts: Property Rights and theOwnership of Human Biological Materials (GeorgetownUniv. Press, Washington DC, 1996).Caulfield, T. & Gold, E. R. Whistling in the wind: reframingthe genetic patent debate. Forum for Applied Researchand Public Policy 15, 75–79 (2000).Ernst and Young’s Fourth Report on the CanadianBiotechnology Industry. Can. Biotechol. ‘97: Coming ofAge (Ernst and Young, 1997).President and Fellows of Harvard v. Commissioner ofPatents (August 3, 2000) No. A-334–398 (Fed. Crt ofAppeals).Nottingham, S. Eat Your Genes (St. Martin’s, New York,1999).Roberts, T. Why not patent plants? Patent World 113,14–16 (1999).Schehr, R. & Fox, J. Human genome bombshell. NatureBiotechnol. 18, 365 (2000).Marcus, A. Owning a gene: patent pending. Nature Med.2, 728–729 (1996).Nelkin, D. & Andrews, L. Homo economicus:Commercialization of body tissue in the age ofbiotechnology. Hastings Center Report 28, 30–39 (1998).Heller, M. & Eisenberg, R. Can patents deter innovation?The anticommons in biomedical research. Science 280,698–701 (1998).Knoppers, B. M. Status, sale and patenting of humangenetic material: an international survey. Nature Genet.22, 23–26 (1999).Bunk, S. Researchers feel threatened by disease genepatents. The Scientist 13, 7 (1999)Academy of Clinical Laboratory Physicians andScientists. ACLPS Resolution: Exclusive Licenses forDiagnostic Tests Approved by the ACLPS ExecutiveCouncil 06/03/99 cense/htmCho, M. K. in Preparing for the Millennium: LaboratoryMedicine in the 21st Century, December 4–5, 1998, 2ndedn 47–53 (AACC, Washington DC, 1998).26. Caulfield, T. & Gold, E. R. Genetic testing, ethicalconcerns, and the role of patent law. Clin. Genet. 57,370–375 (2000).27. Bruzzone, L. The research exemption: a proposal. Am.Intell. Prop. Law Assoc. QL 21, 52 (1993).28. Parker, D. Patent infringement exemptions for life scienceresearch. Houston J. Intl Law 16, 615 (1994).29. Gold, E. R. in Commercialization of Genetic Research:Ethical, Legal and Policy Issues (eds Caulfield, T. &Williams–Jones, B.) 63–78 (Plenum, New York, 1999).30. Schissel, A., Merz, J. F. & Cho, M. K. Survey confirmsfear about licensing of genetic tests. Nature 402, 118(1999).31. Blumenthal, D. et al. Withholding Research Results inAcademic Life Science: Evidence From a National Surveyof Faculty J. Am. Med. Assoc. 277, 1224 (1997).32. Caulfield, T. The commercialization of human genetics: adiscussion of issues relevant to Canadian consumers. J.Consumer Policy 21, 483–526 (1998).33. Packer, K. & Webster, A. Patenting culture in science:reinventing the scientific wheel of credibility. Science,Technology and Human Values 21, 425–445(1996).34. Blumenthal, D. Academic–industry relationships in the lifesciences. J. Am. Med. Assoc. 268, 3344 (1992).35. Straus, J. Intellectual property issues in genomeresearch. Genome Digest 3, 1–2 (1996).36. Barton, J. Reforming the patent system. Science 287,1933–1934 (2000).37. United States Patent and Trade Mark Office. Interim UtilityGuidelines (1999).38. Holtzman, N. Are genetic tests adequately regulated?Science 286, 409 (1999).39. Kodish, E. Commentary: Risks and benefits, testing andscreening, cancer, genes and dollars. J. Law Med. Ethics25, 252–255 (1997).40. Brower, V. News: Testing, testing, testing? Nature Med.3, 131–132 (1997).41. Weiss, R. Genetic testing’s human toll. Washington Post21 July (1999).42. Cowan, D. Tort liability of patentee licensors. J. PatentOffice Soc. 64, 87–104 (1982).43. Le Saux, O. et al. Mutations in a gene encoding an ABCtransporter cause pseudoxanthoma elasticum. NatureGenet. 25, 223–227 (2000).44. Smaglik, P. Tissue donors use their influence in deal overgene patent terms. Nature 407, 821 (2000).45. Human Genome Organization Ethics Committee. Geneticbenefit sharing. Science 290, 49 (2000).LinksTIMELINEDATABASE LINKS BRCA1 APOEFURTHER INFORMATION American College ofMedical Genetics Unesco’s 1997 UniversalDeclaration on the Human Genome andHuman Rights European Patent Office United States Patent Office Canadian PatentOffice Japanese Patent Office patent on the‘onco-mouse’ European Patent Convention Incyte Pharmaceuticals United StatesSupreme court case of Diamond versusChakrabarty United States Patent Office’srecent interim guidelines1.2.3.4.5.6.Rifkin, J. The Biotech Century (Penguin Putnam, NewYork, 1998).American College of Medical Genetics, PositionStatement on Gene Patents and Accessibility of GeneTesting (1999). www.faseb.org/genetics/acmg/pol34.htmSarma, L. Biopiracy: Twentieth century imperialism in theform of international agreements. Temple Internationaland Comparative Law Journal 13, 107–136 (1999).Thomas, S. et al. Ownership of the human genome.Nature 380, 387–388 (1996).Thomas, S. in The Commercialization of GeneticResearch: Ethical, Legal and Policy Issues (eds Caulfield,T. & Williams-Jones, B.) 55–62 (KluwerAcademic/Plenum Publishing, New York, 1999).Nau, J. Y. Brevetabilité des gènes humains: le comitéThe origins of bioinformaticsJoel B. HagenBioinformatics is often described as beingin its infancy, but computers emerged asimportant tools in molecular biology duringthe early 1960s. A decade before DNAsequencing became feasible,computational biologists focused on therapidly accumulating data from proteinbiochemistry. Without the benefits ofsupercomputers or computer networks,these scientists laid important conceptualand technical foundations forbioinformatics today.It is tempting to trace the origins of bioinformatics to the recent convergence of DNAsequencing, large-scale genome projects, theNATURE REVIEWS GENETICSinternet and supercomputers1–3. However,some scientists who claim that bioinformaticsis in its infancy acknowledge that computerswere important tools in molecular biology adecade before DNA sequencing became feasible4. Although the pioneers of computationalbiology did not use the term ‘bioinformatics’to describe their work, they had a clear visionof how computer technology, mathematicsand molecular biology could be fruitfullycombined to answer fundamental questionsin the life sciences.Three important factors facilitated theemergence of computational biology duringthe early 1960s. First, an expanding collectionof amino-acid sequences provided both aVOLUME 1 DECEMBER 2000 2 3 1 2000 Macmillan Magazines Ltd

PERSPECTIVESdeveloped from weapons research programmes during the Second World War,finally became widely available to academicbiologists. Not all biologists had — or wantedto have — access to these machines but, by1960, scarcity of computers was no longer aserious stumbling block for the developmentof computational biology.Sequencing proteinsFigure 1 Frederick Sanger at the Nobel prizeceremony in 1980.(Photograph kindly provided by the MRC, Laboratoryof Molecular Biology, Cambridge, UK.)source of data and a set of interesting problems that were infeasible to solve without thenumber-crunching power of computers.Second, the idea that macromolecules carryinformation became a central part of the conceptual framework of molecular biology.Although some historians and philosophershave questioned the theoretical significance ofthis idea for modern molecular biology5–7, itseems likely that thinking in terms of macromolecular information provided an important conceptual link between molecular biology and the computer science from whichformal information theory had arisen. Third,high-speed digital computers, which hadThe idea that proteins carry informationencoded in linear sequences of amino acids iscommonplace today, but it has a relativelyshort history. This idea first emerged duringthe decades following the Second World War,a time that one main participant, Emil Smith,later described as a “heroic period” in proteinbiochemistry8. The watershed event of thisperiod was the first successful sequencing of acomplete protein, INSULIN, by Frederick Sangerand his colleagues8–10 at Cambridge Universityduring the decade 1945–1955 (FIG. 1).Sanger’s achievement, for which he wasawarded the 1958 Nobel Prize in chemistry,firmly established the polypeptide theory ofprotein structure. First formulated in 1902,this theory had faced considerable scepticism and competition from alternative theories9 (FIG. 2). Analytical techniques in protein biochemistry had improved greatlyduring the 1930s and 1940s, but beforeSanger’s work, practically nothing wasknown about the order of amino acids inany protein. One could, therefore, still clingto the belief that proteins were structurallysimple or even that they had no definitestructure at all. As the biochemist PaulZamecnik later recalled, these lingering con-cerns were “blown away” by Sanger’s work,which quickly dispelled any doubts thateach protein was characterized by a uniqueprimary structure11.Sequencing insulin was a case of problemsolving by a master chemist who used greatscientific skill in separating and identifying thefragments of protein degradation12. At thesame time, however, other biochemists weredeveloping more refined methods that wouldtransform the laborious analytical processused by Sanger and his co-workers. TheEdman degradation reaction, by which biochemists could sequentially remove and identify individual amino acids from the aminoterminus of a short peptide, was a greatimprovement over the more tedious methodsused by Sanger8,9. The use of ion exchangecolumns and other innovations in CHROMATOGRAPHY and electrophoresis also made sequencing more efficient. Just as significantly, theentire process of separating and identifyingamino acids was rapidly becoming automated.Using semi-automated techniques, researchersled by Stanford Moore and William Stein atthe Rockefeller Institute were able to sequencethe 124 amino acids in RIBONUCLEASE in abouthalf the time that Sanger’s group had spentdeciphering the sequence of the 51 aminoacids in insulin13,14. Automation sent a shockwave through the biochemical community,because it promised to transform sequencinginto a routine procedure carried out, not bymaster chemists, but by competent laboratorytechnicians8. By the late 1960s, Pehr Edmanhad designed the ‘sequenator’, a fully automated sequencing machine that implemented hisalready widely used degradation reaction15.Timeline Some early milestones in protein and peptide sequencingOxytocin,Vasopressin9Insulin (α-chain)211951Tobacco mosaicvirus coat159Glucagon291953Insulin (β-chain)*30Cytochrome 62Haemoglobin (α-chain)141Haemoglobin -3-phosphatedehydrogenase340(1969) Humangrowth hormone188*The complete primary structure of insulin, including the positions of the disulphide bonds, was published in 1955.(Dates in parentheses are for revisions of the originally published sequences; numbers in bold are the numbers of amino acids.)Source: L.R. Croft, Handbook of Protein Sequence Analysis: A Compilation of Amino Acid Sequences of Proteins with an Introduction to the Methodology (John Wiley, Chichester, 1980).232 DECEMBER 2000 VOLUME 1www.nature.com/reviews/genetics 2000 Macmillan Magazines Ltd

OHHCRC ONHFigure 2 Alternative theories of proteinstructure. Although the polypeptide theory waswidely accepted by the time of the Second WorldWar, there remained cautious doubt about proteinstructure when Frederick Sanger began his workon insulin. a An influential group of biochemistsduring the early decades of the twentieth centuryhad argued that proteins were amorphouscolloids, with no definite molecular structure.Although this idea steadily lost ground, a fewbiochemists clung to the idea that polypeptideswere merely artefacts formed when colloidalproteins were denatured. Other structural theoriessurfaced during the 1930s. b According to thecyclol theory, proteins were honeycomb-likefabrics formed from interlocking cyclical subunits.These alternative theories were not stronglysupported but, until insulin was sequenced, theywere not completely ruled out. c According toperiodicity theories, proteins were simple,repetitive chains of amino acids.Rc Gly Ala Other amino acidThese innovations encouraged many laboratories to begin sequencing proteins and rapidly expanded the library of amino-acidsequences (TIMELINE).Macromolecular informationOnce the polypeptide theory became firmlyestablished and methods for sequencing proteins were readily available, the idea of proteins as information-carrying macromolecules became widespread. This general ideadeveloped within three broadly overlappingcontexts: the genetic code, the three-dimensional structure of a protein in relation to itsfunction, and protein evolution.The genetic code. Concurrent developmentsin the molecular biology of the gene provideda compelling theoretical context for discussing how genetic information was transferred from a sequence of nucleotides to asequence of amino acids. However, thesequencing of DNA and RNA presented formidable technical hurdles that were not fullyovercome until the early 1970s (REF. 16). So,although molecular biologists learned a greatdeal about the genetic code, the actualnucleotide sequences of genes remainedlargely unknown during the 1960s. With agrowing collection of amino-acid sequences,the idea of molecular information couldtherefore be explored with proteins in waysnot applicable to nucleic acids.Protein structure. From a purely biochemical perspective, one could ask about thecausal relationship between the informationcarried in the primary structure of a proteinand the three-dimensional configuration ofthe active molecule. Experiments carriedout by Christian Anfinsen and his colleagues at the National Institutes of Healthin the late 1950s showed that, after beingdenatured, ribonuclease spontaneouslyrefolded to regain its original enzymaticactivity17. This was taken as compelling evidence that the sequence of amino acidscompletely specified the three-dimensionalstructure of the protein. Of course, in practical terms, knowing the sequence did notnecessarily allow biochemists to correctlypredict the secondary and tertiary structures of a protein. But sequence data playeda key role in interpreting the X-ray diffraction images used by John Kendrew and MaxPerutz (FIG. 3) to determine the three-dimensional structures of MYOGLOBIN and HAEMO18GLOBIN . Combining the biochemical techniques of sequence analysis with thebiophysical techniques of X-RAY CRYSTALLOGRAPHY seemed to hold the key to understanding how the molecular information ina sequence of amino acids causes a proteinto fold into a specific, often highly complex,three-dimensional configuration13,14,17.Protein evolution. The idea that linear information could determine the structure andfunction of proteins fits squarely within adominant tradition in twentieth-century biochemistry — the ‘protein paradigm’19. Forseveral decades before 1960, biochemists hadNATURE REVIEWS GENETICSfocused their efforts on providing mechanistic explanations for how enzymes, proteinhormones, antibodies and respiratory pigments worked. This episode has attractedconsiderable historical interest9,12,20, but historians have paid much less attention to howmacromolecular information could bethought of in an explicitly phylogenetic context. This phylogenetic approach was a significant departure from traditional biochemicalthinking about structure and function, whichhad largely ignored evolutionary questions.Indeed, historians have often stressed the conflicts between evolutionary biology and themore experimental sciences such as biochemistry. However, during the 1960s, biochemistsand molecular biologists were increasinglydrawn to evolutionary questions. For example, Emile Zuckerkandl and Linus Paulingreferred to proteins and nucleic acids as“semantides”, whose information-carryingsequences of subunits could be used to document evolutionary history21. Derived from‘semanteme’, the fundamental unit of meaning used by linguists to study human speech,semantides were to be the analogous biochemical units (hence the chemical suffix —ide) for evolutionary studies.How would the evolutionary informationcarried by semantides be used? Zuckerkandland Pauling imagined a new field of ‘paleogenetics’ that would use the rigorous laboratorytechniques of biochemistry and molecularbiology to answer evolutionary questions traditionally studied by palaeontologists andnaturalists. Paleogeneticists, who soonFigure 3 Max Perutz, who shared the 1962Nobel prize in chemistry with John Kendrew.(Photograph kindly provided by the MRC, Laboratoryof Molecular Biology, Cambridge, UK.)VOLUME 1 DECEMBER 2000 2 3 3 2000 Macmillan Magazines Ltd

PERSPECTIVESbecame more commonly referred to as molecular evolutionists, had several approaches attheir disposal. Comparisons of similar proteins, such as myoglobin and haemoglobin,provided evidence for molecular evolution bygene duplication. Comparison of homologous proteins drawn from various speciescould be used to trace phylogenetic relationships among both the proteins themselvesand the species that carried them. In somecases, such comparisons could also be used torecreate the ancestral proteins from whichpresent-day molecules evolved. Assumingthat amino-acid substitution rates were relatively constant within a given protein, paleogeneticists had a ‘MOLECULAR CLOCK’ by whichevolutionary events might be reliably dated.These claims, particularly the idea of amolecular clock, were enormously controversial and provided a source of conflict betweenmolecular evolutionists and traditional naturalists22–24. Sequence analysis also had to compete with well-established molecular techniques, such as the immunological measuresused by Morris Goodman, Allan Wilson,Vincent Sarich and others to unravel phylogenetic relationships23. Encounters amongcompeting groups of biologists at professional meetings could be bruising, but theseconfrontations should not eclipse theimportant synthesis of evolutionary biology,protein biochemistry and computer sciencethat was beginning to emerge during the early1960s, which laid an evolutionary foundationfor the bioinformatics of today24–26.Emergence of computational biologyHistorical studies of protein biochemistryhave emphasized the importance of instrumentation8,9,19 but, with the exception of JohnKendrew’s use of computers for elucidatingthe three-dimensional structure ofmyoglobin18,27, the historical role of digitalcomputers during the 1960s has been virtually ignored. Even in the case of Kendrew, computers have not been viewed as contributingdecisively to the discovery process.Nonetheless, digital computers were well suited to deal with the types of numerical datathat protein biochemists were generating ingrowing amounts.By the early 1960s, computers were becoming widely available to academic researchers.According to surveys conducted at the beginning of the decade, 15% of colleges and universities in the United States had at least one computer on campus, and most principal researchuniversities were purchasing so-called ‘secondgeneration’ computers, based on transistors, toreplace the older vacuum-tube models28. Thefirst high-level programming language, FORTRAN (formula translation), was introducedby the International Business Machines (IBM)corporation in 1957. It was particularly wellsuited to scientific applications, and comparedwith the earlier machine languages, it was relatively easy to learn. For the first time, detailedknowledge of computer architecture was notneeded to write a computer program. Thisimportant innovation in computer softwareencouraged the growth of computational biology. At the same time, there was a concertedeffort by governmental agencies and the computer industry to foster the development ofacademic computing in the life sciences29,30.The attraction of computers is well illus-GlossaryCHROMATOGRAPHYinsulin-dependent diabetes mellitus.A chemical analysis technique that uses a process ofseparating gases, liquids or solids from mixtures orsolutions by selective adsorption.MOLECULAR CLOCKCYTOCHROMESProteins whose function is to carry electrons or protons(hydrogen ions) by virtue of the reversiblecharging/discharging of an iron atom or iron/sulphuratoms in the centre of the protein. Cytochromes arecentral molecules of electron transport in the processknown as oxidative phosphorylation. Cytochromes aredivided into four groups (a, b, c, d) according to theirability to absorb or transmit certain colours of light.The hypothesis that, in any given gene or DNAsequence, mutations accumulate at an approximatelyconstant rate in all evolutionary lineages as long as thegene or the DNA sequence retains its original function.MYOGLOBINAn oxygen-carrying muscle protein that makes oxygenavailable to the muscles for contraction.RIBONUCLEASEA enzyme that hydrolyses RNA.HAEMOGLOBINX-RAY CRYSTALLOGRAPHYProtein present in red blood cells that reversibly bindsoxygen for transport to tissues.Study of the molecular structure of crystallinecompounds through X-ray diffraction techniques.When an X-ray beam bombards a crystal, the atomicstructure of the crystal causes the beam to scatter(diffract) in a specific pattern. X-ray crystallographyprovides information on the positions of individualatoms in the crystal, the distances between atoms, theangles of the atomic bonds and other features ofmolecular geometry.INSULINA protein hormone secreted by β cells of the pancreas.Insulin is important in the regulation of glucosemetabolism, generally promoting the cellular use ofglucose. It is also an important regulator of protein andlipid metabolism. Insulin is used as a drug to control234 DECEMBER 2000 VOLUME 1trated by the career of Margaret OakleyDayhoff31,32. Trained in quantum chemistryand mathematics, she became interested inproteins and molecular evolution around1960. As associate director of the newly established National Biomedical ResearchFoundation, an organization founded specifically to encourage the development of computer applications, Dayhoff was well situatedto explore mathematical approaches foranalysing amino-acid sequence data (FIG. 4).Continuously funded by grants from theNational Institutes of Health throughout the1960s and with further support from theNational Science Foundation, the NationalAeronautics and Space Administration, andthe IBM corporation, Dayhoff moved on several fronts. Her initial project was writing aseries of FORTRAN programs to determinethe amino-acid sequences of protein molecules33,34. Taking the overlapping peptide fragments from the partial digestion of a protein,the programs deduced all of the possiblesequences that were consistent with the data.Conceptually similar to the puzzle-solvingapproach that biochemists claimed to haveused in the early sequencing investigations ofinsulin and ribonuclease10,13,14,35, Dayhoff ’scomputer programs arrived at the correctsequence for a small protein (ribonuclease)within a few minutes. The same feat hadtaken a team of humans many months toaccomplish. Similar programs written byother computational biologists at about thesame time claimed to successfully sequencehypothetical proteins up to 750 amino acidsin length36. Significantly, even during the earlydevelopment of these programs, Dayhoff andher contemporaries realized that the logic ofsequence analysers could also be directlyapplied to nucleic acids when gene sequencesfinally became available.Computer programs for sequence analysisfollowed the principles initiated by the automatic amino-acid analyser used by Stein andMoore37,38. In both cases, the objective was todevelop quickly a library of sequences thatcould be used for studies in comparative biochemistry and molecular evolution. To promote this end, Dayhoff established the Atlas ofProtein Sequence and Structure, an annualpublication that attempted to catalogue allknown amino-acid sequences. Althoughrudimentary by today’s standards, the Atlasserved as the first database for molecular biology, and it became an indispensable resourcefor early computational research25,31,32,39. Iteventually evolved into a major online database, the Protein Information Resource (PIR),established in 1983, and it provided animportant point of departure for other com-www.nature.com/reviews/genetics 2000 Macmillan Magazines Ltd

PERSPECTIVESFigure 4 The IBM 7090 computer, which Margaret Dayhoff used for her early work. This famouscomputer was one of the first solid-state machines and was used widely in business and defencesettings, as well as scientific applications.(Photograph courtesy of IBM archives.)putational biologists, who soon began building their own molecular databases40.Critics have pointed out that most of theentries in Dayhoff ’s Atlas were interspecificvariations of a few well-studied proteins suchas cytochrome c, and that the number ofknown protein sequences remained fairlysmall throughout the 1960s (REF. 40). Whatshould not be overlooked, however, is howthe early comparative studies of homologousproteins opened up the field of molecularevolution. Although phylogenetic analysis ofamino-acid sequences could be done byhand41, computers proved immensely valuable in this regard. From the beginning, theoretical biologists realized that in most casesthe number of possible phylogenetic trees wasso great that it would be infeasible for ahuman to evaluate even a small fraction ofthem. If every amino acid in even a small protein was to be considered a separate character,then finding the most likely tree was clearly anappropriate task for digital computers. Earlymolecular evolutionists such as RussellDoolittle, who began studying protein phylogenies without computers, quickly addedthem to their research programmes by thelate 1960s.The potential for using computers forphylogenetic analysis was dramaticallydemonstrated for cytochrome c, the respiratory pigment found in all aerobic cells. By themid-1960s, the protein had been sequencedfor a wide variety of plants, animals, fungiand microbes. In a now classic article, WalterFitch and Emanuel Margoliash showed howthis data could be used to build a phylogenetic tree that was remarkably similar to thosebased on more traditional taxonomic characters42. In this, and in the similar computerprograms concurrently devised by Dayhoff’steam43,44, pairwise comparisons were madeamong homologous amino-acid sites onCYTOCHROMES drawn from 20 species. Thecomputer calculated the mutation distances,or the minimum number of steps required toconvert one cytochrome to another. Startingwith a simple three-branched tree, subsequent branches were added sequentially in away that would minimize the mutation distances. These early computer programs didnot attempt exhaustive searches for the simplest phylogenetic tree, but left this partly tohuman intuition. The investigator chose adifferent initial subset of three proteins, thenthe computer constructed a second tree. Thenew tree was discarded if it turned out to beless satisfactory than the original one. Duringthe research leading to their article, Fitch andMargoliash examined 40 trees in an attemptto find the optimal one.The molecular phylogenies constructed byearly computational biologists rested on theassumption that the proteins being comparedwere homologous. In cytochrome c trees, theproteins from various species were so similarthat there was no question that they shared aclose common ancestor. However, detectinghomology, and distinguishing it from chancesimilarity, in more distantly related proteinswas recognized as an important problem bymolecular evolutionists. During the lateNATURE REVIEWS GENETICS1960s, several biologists developed computeralgorithms for determining sequence homology and aligning related sequences to acc

bioinformatics today. It is tempting to trace the origins of bioinfor-matics to the recent convergence of DNA sequencing,large-scale genome projects,the internet and supercomputers1–3.However, some scientists who claim that bioinformatics is in its infancy acknowledge t