Articles Initial Sequencing And Analysis Of The Human Genome

Transcription

articlesInitial sequencing and analysis of thehuman genomeInternational Human Genome Sequencing Consortium** A partial list of authors appears on the opposite page. Af liations are listed at the end of the paper.The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution.Here we report the results of an international collaboration to produce and make freely available a draft sequence of the humangenome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.The rediscovery of Mendel's laws of heredity in the opening weeks ofthe 20th century1 3 sparked a scienti c quest to understand thenature and content of genetic information that has propelledbiology for the last hundred years. The scienti c progress madefalls naturally into four main phases, corresponding roughly to thefour quarters of the century. The rst established the cellular basis ofheredity: the chromosomes. The second de ned the molecular basisof heredity: the DNA double helix. The third unlocked the informational basis of heredity, with the discovery of the biological mechanism by which cells read the information contained in genes and withthe invention of the recombinant DNA technologies of cloning andsequencing by which scientists can do the same.The last quarter of a century has been marked by a relentless driveto decipher rst genes and then entire genomes, spawning the eldof genomics. The fruits of this work already include the genomesequences of 599 viruses and viroids, 205 naturally occurringplasmids, 185 organelles, 31 eubacteria, seven archaea, onefungus, two animals and one plant.Here we report the results of a collaboration involving 20 groupsfrom the United States, the United Kingdom, Japan, France,Germany and China to produce a draft sequence of the humangenome. The draft genome sequence was generated from a physicalmap covering more than 96% of the euchromatic part of the humangenome and, together with additional sequence in public databases,it covers about 94% of the human genome. The sequence wasproduced over a relatively short period, with coverage rising fromabout 10% to more than 90% over roughly fteen months. Thesequence data have been made available without restriction andupdated daily throughout the project. The task ahead is to produce a nished sequence, by closing all gaps and resolving all ambiguities.Already about one billion bases are in nal form and the task ofbringing the vast majority of the sequence to this standard is nowstraightforward and should proceed rapidly.The sequence of the human genome is of interest in severalrespects. It is the largest genome to be extensively sequenced so far,being 25 times as large as any previously sequenced genome andeight times as large as the sum of all such genomes. It is the rstvertebrate genome to be extensively sequenced. And, uniquely, it isthe genome of our own species.Much work remains to be done to produce a complete nishedsequence, but the vast trove of information that has becomeavailable through this collaborative effort allows a global perspectiveon the human genome. Although the details will change as thesequence is nished, many points are already clear.X The genomic landscape shows marked variation in the distribution of a number of features, including genes, transposableelements, GC content, CpG islands and recombination rate. Thisgives us important clues about function. For example, the developmentally important HOX gene clusters are the most repeat-poorregions of the human genome, probably re ecting the very complex860coordinate regulation of the genes in the clusters.X There appear to be about 30,000 40,000 protein-coding genes inthe human genomeÐonly about twice as many as in worm or y.However, the genes are more complex, with more alternativesplicing generating a larger number of protein products.X The full set of proteins (the proteome') encoded by the humangenome is more complex than those of invertebrates. This is due inpart to the presence of vertebrate-speci c protein domains andmotifs (an estimated 7% of the total), but more to the fact thatvertebrates appear to have arranged pre-existing components into aricher collection of domain architectures.X Hundreds of human genes appear likely to have resulted fromhorizontal transfer from bacteria at some point in the vertebratelineage. Dozens of genes appear to have been derived from transposable elements.X Although about half of the human genome derives from transposable elements, there has been a marked decline in the overallactivity of such elements in the hominid lineage. DNA transposonsappear to have become completely inactive and long-terminalrepeat (LTR) retroposons may also have done so.X The pericentromeric and subtelomeric regions of chromosomesare lled with large recent segmental duplications of sequence fromelsewhere in the genome. Segmental duplication is much morefrequent in humans than in yeast, y or worm.X Analysis of the organization of Alu elements explains the longstanding mystery of their surprising genomic distribution, andsuggests that there may be strong selection in favour of preferentialretention of Alu elements in GC-rich regions and that these sel sh'elements may bene t their human hosts.X The mutation rate is about twice as high in male as in femalemeiosis, showing that most mutation occurs in males.X Cytogenetic analysis of the sequenced clones con rms suggestions that large GC-poor regions are strongly correlated with darkG-bands' in karyotypes.X Recombination rates tend to be much higher in distal regions(around 20 megabases (Mb)) of chromosomes and on shorterchromosome arms in general, in a pattern that promotes theoccurrence of at least one crossover per chromosome arm in eachmeiosis.X More than 1.4 million single nucleotide polymorphisms (SNPs)in the human genome have been identi ed. This collection shouldallow the initiation of genome-wide linkage disequilibriummapping of the genes in the human population.In this paper, we start by presenting background information onthe project and describing the generation, assembly and evaluationof the draft genome sequence. We then focus on an initial analysis ofthe sequence itself: the broad chromosomal landscape; the repeatelements and the rich palaeontological record of evolutionary andbiological processes that they provide; the human genes andproteins and their differences and similarities with those of other 2001 Macmillan Magazines LtdNATURE VOL 409 15 FEBRUARY 2001 www.nature.com

articlesGenome Sequencing Centres (Listed in order of total genomicsequence contributed, with a partial list of personnel. A full list ofcontributors at each centre is available as SupplementaryInformation.)Whitehead Institute for Biomedical Research, Center for GenomeResearch: Eric S. Lander1*, Lauren M. Linton1, Bruce Birren1*,Chad Nusbaum1*, Michael C. Zody1*, Jennifer Baldwin1,Keri Devon1, Ken Dewar1, Michael Doyle1, William FitzHugh1*,Roel Funke1, Diane Gage1, Katrina Harris1, Andrew Heaford1,John Howland1, Lisa Kann1, Jessica Lehoczky1, Rosie LeVine1,Paul McEwan1, Kevin McKernan1, James Meldrim1, Jill P. Mesirov1*,Cher Miranda1, William Morris1, Jerome Naylor1,Christina Raymond1, Mark Rosetti1, Ralph Santos1,Andrew Sheridan1, Carrie Sougnez1, Nicole Stange-Thomann1,Nikola Stojanovic1, Aravind Subramanian1& Dudley Wyman1The Sanger Centre: Jane Rogers2, John Sulston2*,Rachael Ainscough2, Stephan Beck2, David Bentley2, John Burton2,Christopher Clee2, Nigel Carter2, Alan Coulson2,Rebecca Deadman2, Panos Deloukas2, Andrew Dunham2,Ian Dunham2, Richard Durbin2*, Lisa French2, Darren Grafham2,Simon Gregory2, Tim Hubbard2*, Sean Humphray2, Adrienne Hunt2,Matthew Jones2, Christine Lloyd2, Amanda McMurray2,Lucy Matthews2, Simon Mercer2, Sarah Milne2, James C. Mullikin2*,Andrew Mungall2, Robert Plumb2, Mark Ross2, Ratna Shownkeen2& Sarah Sims2Washington University Genome Sequencing Center:Robert H. Waterston3*, Richard K. Wilson3, LaDeana W. Hillier3*,John D. McPherson3, Marco A. Marra3, Elaine R. Mardis3,Lucinda A. Fulton3, Asif T. Chinwalla3*, Kymberlie H. Pepin3,Warren R. Gish3, Stephanie L. Chissoe3, Michael C. Wendl3,Kim D. Delehaunty3, Tracie L. Miner3, Andrew Delehaunty3,Jason B. Kramer3, Lisa L. Cook3, Robert S. Fulton3,Douglas L. Johnson3, Patrick J. Minx3 & Sandra W. Clifton3US DOE Joint Genome Institute: Trevor Hawkins4,Elbert Branscomb4, Paul Predki4, Paul Richardson4,Sarah Wenning4, Tom Slezak4, Norman Doggett4, Jan-Fang Cheng4,Anne Olsen4, Susan Lucas4, Christopher Elkin4,Edward Uberbacher4 & Marvin Frazier4Baylor College of Medicine Human Genome Sequencing Center:Richard A. Gibbs5*, Donna M. Muzny5, Steven E. Scherer5,John B. Bouck5*, Erica J. Sodergren5, Kim C. Worley5*, Catherine M.Rives5, James H. Gorrell5, Michael L. Metzker5,Susan L. Naylor6, Raju S. Kucherlapati7, David L. Nelson,& George M. Weinstock8RIKEN Genomic Sciences Center: Yoshiyuki Sakaki9,Asao Fujiyama9, Masahira Hattori9, Tetsushi Yada9,Atsushi Toyoda9, Takehiko Itoh9, Chiharu Kawagoe9,Hidemi Watanabe9, Yasushi Totoki9 & Todd Taylor9Genoscope and CNRS UMR-8030: Jean Weissenbach10,Roland Heilig10, William Saurin10, Francois Artiguenave10,Philippe Brottier10, Thomas Bruls10, Eric Pelletier10,Catherine Robert10 & Patrick Wincker10GTC Sequencing Center: Douglas R. Smith11,Lynn Doucette-Stamm11, Marc Ruben eld11, Keith Weinstock11,Hong Mei Lee11 & JoAnn Dubois11Department of Genome Analysis, Institute of MolecularNATURE VOL 409 15 FEBRUARY 2001 www.nature.comBiotechnology: Andre Rosenthal12, Matthias Platzer12,Gerald Nyakatura12, Stefan Taudien12 & Andreas Rump12Beijing Genomics Institute/Human Genome Center:Huanming Yang13, Jun Yu13, Jian Wang13, Guyang Huang14& Jun Gu15Multimegabase Sequencing Center, The Institute for SystemsBiology: Leroy Hood16, Lee Rowen16, Anup Madan16 & Shizen Qin16Stanford Genome Technology Center: Ronald W. Davis17,Nancy A. Federspiel17, A. Pia Abola17 & Michael J. Proctor17Stanford Human Genome Center: Richard M. Myers18,Jeremy Schmutz18, Mark Dickson18, Jane Grimwood18& David R. Cox18University of Washington Genome Center: Maynard V. Olson19,Rajinder Kaul19 & Christopher Raymond19Department of Molecular Biology, Keio University School ofMedicine: Nobuyoshi Shimizu20, Kazuhiko Kawasaki20& Shinsei Minoshima20University of Texas Southwestern Medical Center at Dallas:Glen A. Evans21², Maria Athanasiou21 & Roger Schultz21University of Oklahoma's Advanced Center for GenomeTechnology: Bruce A. Roe22, Feng Chen22 & Huaqin Pan22Max Planck Institute for Molecular Genetics: Juliane Ramser23,Hans Lehrach23 & Richard Reinhardt23Cold Spring Harbor Laboratory, Lita Annenberg Hazen GenomeCenter: W. Richard McCombie24, Melissa de la Bastide24& Neilay Dedhia24GBFÐGerman Research Centre for Biotechnology:Helmut BloÈcker25, Klaus Hornischer25 & Gabriele Nordsiek25* Genome Analysis Group (listed in alphabetical order, alsoincludes individuals listed under other headings):Richa Agarwala26, L. Aravind26, Jeffrey A. Bailey27, Alex Bateman2,Sera m Batzoglou1, Ewan Birney28, Peer Bork29,30, Daniel G. Brown1,Christopher B. Burge31, Lorenzo Cerutti28, Hsiu-Chuan Chen26,Deanna Church26, Michele Clamp2, Richard R. Copley30,Tobias Doerks29,30, Sean R. Eddy32, Evan E. Eichler27,Terrence S. Furey33, James Galagan1, James G. R. Gilbert2,Cyrus Harmon34, Yoshihide Hayashizaki35, David Haussler36,Henning Hermjakob28, Karsten Hokamp37, Wonhee Jang26,L. Steven Johnson32, Thomas A. Jones32, Simon Kasif38,Arek Kaspryzk28, Scot Kennedy39, W. James Kent40, Paul Kitts26,Eugene V. Koonin26, Ian Korf3, David Kulp34, Doron Lancet41,Todd M. Lowe42, Aoife McLysaght37, Tarjei Mikkelsen38,John V. Moran43, Nicola Mulder28, Victor J. Pollara1,Chris P. Ponting44, Greg Schuler26, JoÈrg Schultz30, Guy Slater28,Arian F. A. Smit45, Elia Stupka28, Joseph Szustakowki38,Danielle Thierry-Mieg26, Jean Thierry-Mieg26, Lukas Wagner26,John Wallis3, Raymond Wheeler34, Alan Williams34, Yuri I. Wolf26,Kenneth H. Wolfe37, Shiaw-Pyng Yang3 & Ru-Fang Yeh31Scienti c management: National Human Genome ResearchInstitute, US National Institutes of Health: Francis Collins46*,Mark S. Guyer46, Jane Peterson46, Adam Felsenfeld46*& Kris A. Wetterstrand46; Of ce of Science, US Department ofEnergy: Aristides Patrinos47; The Wellcome Trust: Michael J.Morgan48 2001 Macmillan Magazines Ltd861

articlesorganisms; and the history of genomic segments. (Comparisonsare drawn throughout with the genomes of the budding yeastSaccharomyces cerevisiae, the nematode worm Caenorhabditiselegans, the fruit y Drosophila melanogaster and the mustard weedArabidopsis thaliana; we refer to these for convenience simply asyeast, worm, y and mustard weed.) Finally, we discuss applicationsof the sequence to biology and medicine and describe next steps inthe project. A full description of the methods is provided asSupplementary Information on Nature's web site (http://www.nature.com).We recognize that it is impossible to provide a comprehensiveanalysis of this vast dataset, and thus our goal is to illustrate therange of insights that can be gleaned from the human genome andthereby to sketch a research agenda for the future.(4) The development of random shotgun sequencing of complementary DNA fragments for high-throughput gene discovery bySchimmel12 and Schimmel and Sutcliffe13, later dubbed expressedsequence tags (ESTs) and pursued with automated sequencing byVenter and others14 20.The idea of sequencing the entire human genome was rstproposed in discussions at scienti c meetings organized by theUS Department of Energy and others from 1984 to 1986 (refs 21,22). A committee appointed by the US National Research Councilendorsed the concept in its 1988 report23, but recommended abroader programme, to include: the creation of genetic, physicaland sequence maps of the human genome; parallel efforts in keymodel organisms such as bacteria, yeast, worms, ies and mice; thedevelopment of technology in support of these objectives; andresearch into the ethical, legal and social issues raised by humangenome research. The programme was launched in the US as a jointeffort of the Department of Energy and the National Institutes ofHealth. In other countries, the UK Medical Research Council andthe Wellcome Trust supported genomic research in Britain; theCentre d'Etude du Polymorphisme Humain and the French Muscular Dystrophy Association launched mapping efforts in France;government agencies, including the Science and Technology Agencyand the Ministry of Education, Science, Sports and Culture supported genomic research efforts in Japan; and the European Community helped to launch several international efforts, notably theprogramme to sequence the yeast genome. By late 1990, the HumanGenome Project had been launched, with the creation of genomecentres in these countries. Additional participants subsequentlyjoined the effort, notably in Germany and China. In addition, theHuman Genome Organization (HUGO) was founded to provide aforum for international coordination of genomic research. Severalbooks24 26 provide a more comprehensive discussion of the genesisof the Human Genome Project.Through 1995, work progressed rapidly on two fronts (Fig. 1).The rst was construction of genetic and physical maps of thehuman and mouse genomes27 31, providing key tools for identi cation of disease genes and anchoring points for genomic sequence.The second was sequencing of the yeast32 and worm33 genomes, asBackground to the Human Genome ProjectThe Human Genome Project arose from two key insights thatemerged in the early 1980s: that the ability to take global views ofgenomes could greatly accelerate biomedical research, by allowingresearchers to attack problems in a comprehensive and unbiasedfashion; and that the creation of such global views would require acommunal effort in infrastructure building, unlike anything previously attempted in biomedical research. Several key projectshelped to crystallize these insights, including:(1) The sequencing of the bacterial viruses FX1744,5 and lambda6, theanimal virus SV407 and the human mitochondrion8 between 1977and 1982. These projects proved the feasibility of assembling smallsequence fragments into complete genomes, and showed the valueof complete catalogues of genes and other functional elements.(2) The programme to create a human genetic map to make itpossible to locate disease genes of unknown function based solely ontheir inheritance patterns, launched by Botstein and colleagues in1980 (ref. 9).(3) The programmes to create physical maps of clones covering theyeast10 and worm11 genomes to allow isolation of genes and regionsbased solely on their chromosomal position, launched by Olson andSulston in the 199920002001Discussion and debatein scientific communityNRC reportOther organismsBacterial genome sequencingH. fluS. cerevisiae sequencingC. elegans sequencingD. melanogaster sequencingA. thaliana sequencingGenetic mapsMouse39 speciesE. coliMicrosatellitesSNPsPhysical mapsESTscDNA sequencingGenomic sequencingHumanGenetic mapsFull lengthPilotsequencingSNPsMicrosatellitesPhysical mapsESTscDNA sequencingFull lengthPilot project,15%Genomic sequencingWorkingdraft, 90% Finishing, 100%Chromosome 22 Chromosome 21Figure 1 Timeline of large-scale genomic analyses. Shown are selected components ofwork on several non-vertebrate model organisms (red), the mouse (blue) and the human862(green) from 1990; earlier projects are described in the text. SNPs, single nucleotidepolymorphisms; ESTs, expressed sequence tags. 2001 Macmillan Magazines LtdNATURE VOL 409 15 FEBRUARY 2001 www.nature.com

articleswell as targeted regions of mammalian genomes34 37. These projectsshowed that large-scale sequencing was feasible and developed thetwo-phase paradigm for genome sequencing. In the rst, shotgun',phase, the genome is divided into appropriately sized segments andeach segment is covered to a high degree of redundancy (typically,eight- to tenfold) through the sequencing of randomly selectedsubfragments. The second is a nishing' phase, in which sequencegaps are closed and remaining ambiguities are resolved throughdirected analysis. The results also showed that complete genomicsequence provided information about genes, regulatory regions andchromosome structure that was not readily obtainable from cDNAstudies alone.In 1995, genome scientists considered a proposal38 that wouldhave involved producing a draft genome sequence of the humangenome in a rst phase and then returning to nish the sequence ina second phase. After vigorous debate, it was decided that such aplan was premature for several reasons. These included the need rstto prove that high-quality, long-range nished sequence could beproduced from most parts of the complex, repeat-rich humangenome; the sense that many aspects of the sequencing processwere still rapidly evolving; and the desirability of further decreasingcosts.Instead, pilot projects were launched to demonstrate the feasibility of cost-effective, large-scale sequencing, with a target completion date of March 1999. The projects successfully produced nished sequence with 99.99% accuracy and no gaps39. They alsointroduced bacterial arti cial chromosomes (BACs)40, a new largeinsert cloning system that proved to be more stable than the cosmidsand yeast arti cial chromosomes (YACs)41 that had been usedpreviously. The pilot projects drove the maturation and convergence of sequencing strategies, while producing 15% of the humangenome sequence. With successful completion of this phase, thehuman genome sequencing effort moved into full-scale productionin March 1999.The idea of rst producing a draft genome sequence was revivedat this time, both because the ability to nish such a sequence was nolonger in doubt and because there was great hunger in the scienti ccommunity for human sequence data. In addition, some scientistsfavoured prioritizing the production of a draft genome sequenceover regional nished sequence because of concerns about commercial plans to generate proprietary databases of human sequencethat might be subject to undesirable restrictions on use42 44.The consortium focused on an initial goal of producing, in a rstproduction phase lasting until June 2000, a draft genome sequencecovering most of the genome. Such a draft genome sequence,although not completely nished, would rapidly allow investigatorsto begin to extract most of the information in the human sequence.Experiments showed that sequencing clones covering about 90% ofthe human genome to a redundancy of about four- to vefold ( halfshotgun' coverage; see Box 1) would accomplish this45,46. The draftgenome sequence goal has been achieved, as described below.The second sequence production phase is now under way. Itsaims are to achieve full-shotgun coverage of the existing clonesduring 2001, to obtain clones to ll the remaining gaps in thephysical map, and to produce a nished sequence (apart fromregions that cannot be cloned or sequenced with currently availabletechniques) no later than 2003.Genomic DNABAC libraryOrganizedmapped largeclone contigsBAC to besequencedShotgunclonesAssembly .ACCGTAAATGGGCTGATCATGCTTAAACCCTGTGCATCCTACTG.Soon after the invention of DNA sequencing methods47,48, theshotgun sequencing strategy was introduced49 51; it has remainedthe fundamental method for large-scale genome sequencing52 54 forthe past 20 years. The approach has been re ned and extended tomake it more ef cient. For example, improved protocols forfragmenting and cloning DNA allowed construction of shotgunNATURE VOL 409 15 FEBRUARY 2001 www.nature.comHierarchical shotgun sequencingShotgun ATCCTACTG.sequenceStrategic issuesHierarchical shotgun sequencinglibraries with more uniform representation. The practice of sequencing from both ends of double-stranded clones ( double-barrelled'shotgun sequencing) was introduced by Ansorge and others37 in1990, allowing the use of linking information' between sequencefragments.The application of shotgun sequencing was also extended byapplying it to larger and larger DNA moleculesÐfrom plasmids(, 4 kilobases (kb)) to cosmid clones37 (40 kb), to arti cial chromosomes cloned in bacteria and yeast55 (100 500 kb) and bacterialgenomes56 (1 2 megabases (Mb)). In principle, a genome of arbitrary size may be directly sequenced by the shotgun method,provided that it contains no repeated sequence and can be uniformly sampled at random. The genome can then be assembledusing the simple computer science technique of hashing' (in whichone detects overlaps by consulting an alphabetized look-up table ofall k-letter words in the data). Mathematical analysis of theexpected number of gaps as a function of coverage is similarlystraightforward57.Practical dif culties arise because of repeated sequences andcloning bias. Small amounts of repeated sequence pose littleproblem for shotgun sequencing. For example, one can readilyassemble typical bacterial genomes (about 1.5% repeat) or theeuchromatic portion of the y genome (about 3% repeat). Bycontrast, the human genome is lled (. 50%) with repeatedsequences, including interspersed repeats derived from transposableelements, and long genomic regions that have been duplicated intandem, palindromic or dispersed fashion (see below). Theseinclude large duplicated segments (50 500 kb) with high sequenceidentity (98 99.9%), at which mispairing during recombinationcreates deletions responsible for genetic syndromes. Such featurescomplicate the assembly of a correct and nished genome sequence.There are two approaches for sequencing large repeat-richgenomes. The rst is a whole-genome shotgun sequencingapproach, as has been used for the repeat-poor genomes of viruses,bacteria and ies, using linking information and computationalFigure 2 Idealized representation of the hierarchical shotgun sequencing strategy. Alibrary is constructed by fragmenting the target genome and cloning it into a largefragment cloning vector; here, BAC vectors are shown. The genomic DNA fragmentsrepresented in the library are then organized into a physical map and individual BACclones are selected and sequenced by the random shotgun strategy. Finally, the clonesequences are assembled to reconstruct the sequence of the genome. 2001 Macmillan Magazines Ltd863

articlesanalysis to attempt to avoid misassemblies. The second is the hierarchical shotgun sequencing' approach (Fig. 2), also referredto as map-based', BAC-based' or clone-by-clone'. This approachinvolves generating and organizing a set of large-insert clones(typically 100 200 kb each) covering the genome and separatelyperforming shotgun sequencing on appropriately chosen clones.Because the sequence information is local, the issue of long-rangemisassembly is eliminated and the risk of short-range misassemblyis reduced. One caveat is that some large-insert clones may sufferrearrangement, although this risk can be reduced by appropriatequality-control measures involving clone ngerprints (see below).The two methods are likely to entail similar costs for producing nished sequence of a mammalian genome. The hierarchicalapproach has a higher initial cost than the whole-genome approach,owing to the need to create a map of clones (about 1% of the totalcost of sequencing) and to sequence overlaps between clones. Onthe other hand, the whole-genome approach is likely to requiremuch greater work and expense in the nal stage of producing a nished sequence, because of the challenge of resolving misassemblies. Both methods must also deal with cloning biases, resulting inunder-representation of some regions in either large-insert orsmall-insert clone libraries.There was lively scienti c debate over whether the humangenome sequencing effort should employ whole-genome or hierarchical shotgun sequencing. Weber and Myers58 stimulated thesediscussions with a speci c proposal for a whole-genome shotgunapproach, together with an analysis suggesting that the methodcould work and be more ef cient. Green59 challenged these conclusions and argued that the potential bene ts did not outweigh thelikely risks.In the end, we concluded that the human genome sequencingeffort should employ the hierarchical approach for several reasons.First, it was prudent to use the approach for the rst project tosequence a repeat-rich genome. With the hierarchical approach, theultimate frequency of misassembly in the nished product wouldprobably be lower than with the whole-genome approach, in whichit would be more dif cult to identify regions in which the assemblywas incorrect.Second, it was prudent to use the approach in dealing with anoutbred organism, such as the human. In the whole-genome shotgun method, sequence would necessarily come from two differentcopies of the human genome. Accurate sequence assembly could becomplicated by sequence variation between these two copiesÐbothSNPs (which occur at a rate of 1 per 1,300 bases) and larger-scalestructural heterozygosity (which has been documented in humanchromosomes). In the hierarchical shotgun method, each largeinsert clone is derived from a single haplotype.Third, the hierarchical method would be better able to deal withinevitable cloning biases, because it would more readily allowtargeting of additional sequencing to under-represented regions.And fourth, it was better suited to a project shared among membersof a diverse international consortium, because it allowed work andresponsibility to be easily distributed. As the ultimate goal hasalways been to create a high-quality, nished sequence to serve as afoundation for biomedical research, we reasoned that the advantages of this more conservative approach outweighed the additionalcost, if any.A biotechnology company, Celera Genomics, has chosen toincorporate the whole-genome shotgun approach into its ownefforts to sequence the human genome. Their plan60,61 uses amixed strategy, involving combining some coverage with wholegenome shotgun data generated by the company together with thepublicly available hierarchical shotgun data generated by the International Human Genome Sequencing Consortium. If the rawsequence reads from the whole-genome shotgun component aremade available, it may be possible to evaluate the extent to which thesequence of the human genome can be assembled without the need864for clone-based information. Such analysis may help to re nesequencing strategies for other large genomes.Technology for large-scale sequencingSequencing the human genome depended on many technologicalimprovements in the production and analysis of sequence data. Keyinnovations were developed both within and outside the HumanGenome Project. Laboratory innovations included four-colour uorescence-based sequence detection62, improved uorescentdyes63 66, dye-labelled terminators67, polymerases speci callydesigned for sequencing68 70, cycle sequencing71 and capillary gelelectrophoresis72 74. These studies contributed to substantialimprovements in the automation, quality and throughput ofcollecting raw DNA sequence75,76. There were also importantadvances in the development of software packages for the analysisof sequence data. The PHRED software package77,78 introduced theconcept of assigning a base-quality score' to each base, on the basisof the probability of an erroneous call. These quality scores make itpossible to monitor raw data quality and also assist in determiningwhether two similar sequences truly overlap. The PHRAP computerpackage

Genome Sequencing Centres (Listed in order of total genomic sequence contributed, with a partial list of personnel. A full list of contributors at each centre is available as Supplementary Information.) Whitehead Institute for Biomedical Research, Center for Genome Research: Eric S. Lander1*, Lauren M. Linton1, Bruce Birren1*,