BASICS ON MOLECULAR BIOLOGY - Helsinki

Transcription

BASICS ON MOLECULAR BIOLOGYCell – DNA – RNA – proteinSequencing methodsarising questions for handling the data, making sense of itnext two week lectures: sequence alignment and genomeassembly

Cells 2Fundamental working units of every living system.Every organism is composed of one of two radically different types of cells:– prokaryotic cells– eukaryotic cells which have DNA inside a nucleus.Prokaryotes and Eukaryotes are descended from primitive cells and the results of3.5 billion years of evolution.

Prokaryotes and Eukaryotes According to the most recentevidence, there are threemain branches to the tree oflife Prokaryotes include Archaea(“ancient ones”) and bacteria Eukaryotes are kingdomEukarya and includes plants,animals, fungi and certainalgaeLecture: Phylogenetic trees,this topic in more detail3

All Cells have common Cycles Born, eat, replicate, and die4

Common features of organisms Chemical energy is stored in ATP Genetic information is encoded by DNA Information is transcribed into RNA There is a common triplet genetic code–some variations are known, howeverTranslation into proteins involves ribosomes Shared metabolic pathways Similar proteins among diverse groups of organisms5

All Life depends on 3 critical molecules DNAs (Deoxyribonucleic acid)– Hold information on how cell works RNAs (Ribonucleic acid)– Act to transfer short pieces of information to different parts of cell– Provide templates to synthesize into protein Proteins– Form enzymes that send signals to other cells and regulate geneactivity– Form body’s major components6

DNA structure DNA has a double helix structurewhich is composed of– sugar molecule– phosphate group– and a base (A,C,G,T) By convention, we read DNAstrings in direction oftranscription: from 5’ end to 3’end5’ ATTTAGGCC 3’3’ TAAATCCGG 5’7

DNA is contained in chromosomesIn eukaryotes, DNA is packed into linear chromosomesIn prokaryotes, DNA is usually contained in a single, age:Chromatin Structures.png

Human chromosomes Somatic cells (cells in all, exceptthe germline, tissues) in humanshave 2 pairs of 22 chromosomes XX (female) or XY (male) totalof 46 chromosomes Germline cells have 22chromosomes either X or Y total of 23 chromosomesKaryogram of human male using Giemsa staining(http://en.wikipedia.org/wiki/Karyotype)9

RNA RNA is similar to DNA chemically. It is usually only a single strand.T(hyamine) is replaced by U(racil) Several types of RNA exist for different functions in the cell.tRNA linear and 3D ial/trna/trna.gif

DNA, RNA, and the Flow of InformationReplicationTranscription”The central dogma”TranslationIs this true?11Denis Noble: The principles of Systems Biology illustrated using the virtual eccs07 dresden/noble denis/eccs07 noble psb 01.ppt

Proteins 12Proteins are polypeptides (stringsof amino acid residues)Represented using strings ofletters from an alphabet of 20:AEGLV WKKLAGTypical length 50 1000 residuesUrease enzyme from Helicobacter pylori

/Amino acids 2.pngAmino acids13

How DNA/RNA codes for protein? 14DNA alphabet contains fourletters but must specify protein,or polypeptide sequence of 20letters.Trinucleotides (triplets) allow 43 64 possible trinucleotidesTriplets are also called codons

Proteins 20 different amino acids– different chemical properties cause the protein chains to fold up into specificthree-dimensional structures that define their particular functions in the cell. Proteins do all essential work for the cell– build cellular structures– digest nutrients– execute metabolic functions– mediate information flow within a cell and among cellular communities. Proteins work together with other proteins or nucleic acids as "molecularmachines"– structures that fit together and function in highly specific, lock-and-key ways.15

Genes “A gene is a union of genomic sequences encoding a coherent set ofpotentially overlapping functional products” A DNA segment whose information is expressed either as an RNAmolecule or protein(translation)(folding)MSG (transcription)5’ a t g a g t g g a 3’3’ t a c t c a c c t 5’16http://fold.it

Genes & alleles A gene can have different variants The variants of the same gene are calledallelesMSG MSR 5’ a t g a g t g g a 5’ a t g a g t c g a 3’ t a c t c a c c t 3’ t a c t c a g c t 17

Genes can be found on both strands5’3’3’5’18

Exons and introns & splicingExons5’3’3’5’Introns are removed from RNA after transcriptionExons are joined:This process is called splicing19

Alternative splicingDifferent splice variants may be generated5’AB3’3’5’AAC 20CBCBC

DNA and continuum of life. Prokaryotes are typically haploid:they have a single (circular)chromosomeDNA is usually inherited vertically(parent to daughter)Inheritance is clonal– Descendants are faithful copiesof an ancestral DNA– Variation is introduced viamutations, transposableelements, and horizontal transferof DNAChromosome map of S. dysenteriae, the nine ringsdescribe different properties of the genomehttp://www.mgc.ac.cn/ShiBASE/circular Sd197.htm21

Biological string manipulation Point mutation: substitution of a base– ACGGCT ACGCCT Deletion: removal of one or more contiguous bases(substring)– TTGATCA TTTCA Insertion: insertion of a substring– GGCTAG GGTCAACTAG Lecture: Sequence alignmentLecture: Genome rearrangements22

Genome sequencing & assembly DNA sequencing– How do we obtain DNA sequence information from organisms? Genome assembly– What is needed to put together DNA sequence information from sequencing? First statement of sequence assembly problem:– Peltola, Söderlund, Tarhio, Ukkonen: Algorithms for some string matchingproblems arising in molecular genetics. Proc. 9th IFIP World ComputerCongress, 198323

Recovery of shredded newspaper?24

DNA sequencing 25DNA sequencing: resolving a nucleotide sequence (whole-genome or less)Many different methods developed– Maxam-Gilbert method (1977)– Sanger method (1977)– High-throughput methods, ”next-generation” methods

Sanger sequencing: sequencing by synthesis 26A sequencing technique developed by 1977Also called dideoxy sequencingA DNA polymerase is an enzymethat catalyzes DNA synthesisDNA polymerase needs a primerSynthesis proceeds always in 5’- 3’ directionIn Sanger sequencing, chain-terminatingdideoxynucleoside triphosphates (ddXTPs) are employed– ddATP, ddCTP, ddGTP, ddTTPlack the 3’-OH tail of dXTPsA mixture of dXTPs with small amount of ddXTPsis given to DNA polymerase with DNA template and primerddXTPs are given fluorescent labelsWhen DNA polymerase encounters a ddXTP, the synthesiscannot proceedThe process yields copied sequences of different lengthsEach sequence is terminated by a labeled ddXTP

Determining the sequence Sequences are sorted according tolength by capillary electrophoresis Fluorescent signals corresponding tolabels are registered Base calling: identifying which basecorresponds to each position in aread– Non-trivial problem!27Output sequences frombase calling are called reads

Reads are short! Modern Sanger sequencers can produce quality reads up to 750 bases1– Instruments provide you with a quality file for bases in reads, in addition toactual sequence dataCompare the read length against the size of the human genome (2.9x109 bases) Reads have to be assembled! 28

Problems 29Sanger sequencing error rate per base varies from 1% to 3%1Repeats in DNA– For example, 300 base longs Alu sequence repeated is over million times inhuman genome– Repeats occur in different scalesWhat happens if repeat length is longer than read length?Shortest superstring problem– Find the shortest string that ”explains” the reads– Given a set of strings (reads), find a shortest string that contains all of them

Sequence assembly and combination locks 30What is common with sequence assembly and opening keypad locks?

Whole-genome shotgun sequence Whole-genome shotgun sequence assembly starts with a large sample ofgenomic DNA1.2.3.4.31Sample is randomly partitioned into inserts of length 500 basesInserts are multiplied by cloning them into a vector which is used to infectbacteriaDNA is collected from bacteria and sequencedReads are assembled

Assembly of reads with Overlap-LayoutConsensus algorithm 32Overlap– Finding potentially overlapping readsLayout– Finding the order of reads along DNAConsensus (Multiple alignment)– Deriving the DNA sequence from the layoutNext, the method is described at a very abstract level, skipping a lot of details

Finding overlaps First, pairwise overlap alignment ofreads is resolved Reads can be from either DNA strand:The reverse complement r* of eachread r has to be consideredacggagtccagtccgcgcttr15’ a t g a g t g g a 3’3’ t a c t c a c c t 5’r233r1: tgagt, r1*: actcar2: tccac, r2*: gtgga

Example sequence to assemble5’ – AAGTGAAACGCGATGCGGTCGGTCGGTGAAGTTGTGCT - 3’ 20 TCACTCATCGCGTCGATGCGCTTTCACTT

Finding overlaps Overlap between two reads canbe found with a dynamicprogramming algorithmOverlap(1, 6) 36 ATGCGCAT– Errors can be taken into account 12 ATCGTGATDynamic programming will bediscussed more during the nexttwo weeksOverlap scores stored into theoverlap matrix– Entries (i, j) below the diagonaldenote overlap of read ri and rj*351 CATCGTCAOverlap(1, 12) 7161237

Finding layout & consensus Method extends the assemblygreedily by choosing the bestoverlaps Both orientations are considered Sequence is extended as far aspossibleconsensus sequence36Ambiguous bases7*GACCGCAT6 6* ATGCGCAT14GCATCGTG1CATCGTGA12ATCGTGAT19GCGCATCG13* CGCAGCGC--------------------CGCATCGTGAT

Finding layout & consensus We move on to next bestoverlaps and extend thesequence from there The method stops when there areno more overlaps to consider A number of contigs is produced Contig stands for contiguoussequence, resulting from TC--------------------ATGCGGTCGGTGAAG

Whole-genome shotgun sequencing:summaryOriginal genome sequence ReadsNon-overlappingreadOverlapping reads Contig Ordering of the reads is initially unknown Overlaps resolved by aligning the reads In a 3x109 bp genome with 500 bp reads and 5x coverage, there are 107 reads and 107(107-1)/2 5x1013 pairwise sequence comparisons38

Repeats in DNA and genome assemblyTwo instances of the same repeat39

Repeats in DNA cause problems insequence assembly Recap: if repeat length exceeds read length, we might not get the correctassemblyThis is a problem especially in eukaryotes– 3.1% of genome consists of repeats in Drosophila, 45% in humanPossible solutions1.2.40Increase read length – feasible?Divide genome into smaller parts, with known order, and sequence partsindividually

”Divide and conquer” sequencingapproaches: BAC-by-BACWhole-genome shotgun sequencingGenomeGenomeBAC library41Divide-and-conquer

BAC-by-BAC sequencing 42Each BAC (Bacterial Artificial Chromosome) is about 150 kbpCovering the human genome requires 30000 BACsBACs shotgun-sequenced separately– Number of repeats in each BAC is significantly smaller than in the wholegenome.– .needs much more manual work compared to whole-genome shotgunsequencing

Hybrid method 43Divide-and-conquer and whole-genome shotgun approaches can be combined– Obtain high coverage from whole-genome shotgun sequencing for shortcontigs– Generate of a set of BAC contigs with low coverage– Use BAC contigs to ”bin” short contigs to correct placesThis approach was used to sequence the brown Norway rat genome in 2004

First whole-genome shotgun sequencingproject: Drosophila melanogaster 44http://en.wikipedia.org/wiki/Drosophila melanogasterFruit fly is a common model organismin biological studiesWhole-genome assembly reported inEugene Myers, et al., A WholeGenome Assembly of Drosophila,Science 24, 2000Genome size 120 Mbp

Sequencing of the Human Genome The (draft) human genome waspublished in 2001 Two efforts:– Human Genome Project (publicconsortium)– Celera (private company) HGP: BAC-by-BAC approach Celera: whole-genome shotgunsequencingHGP: Nature 15 February 2001Vol 409 Number 6822Celera: Science 16 February 2001Vol 291, Issue 550745

Sequencing of the Human Genome The (draft) human genomewas published in 2001 Two efforts:– Human Genome Project(public consortium)– Celera (private company) HGP: BAC-by-BAC approach Celera: whole-genomeshotgun sequencingHGP: Nature 15 February 2001Vol 409 Number 6822Celera: Science 16 February 2001Vol 291, Issue 550746

Next-gen sequencing: 454 Sanger sequencing is the prominent first-generation sequencing methodMany new sequencing methods are emerging Genome Sequencer FLX (454 Life Science / Roche)– 100 Mb / 7.5 h run– Read length 250-300 bp– 99.5% accuracy / base in a single run– 99.99% accuracy / base in consensus47

The method used by the Roche/454 sequencerto amplify single-stranded DNA copies from afragment library on agarose beads.A mixture of DNA fragments with agarose beadscontaining complementary oligonucleotides to theadapters at the fragment ends are mixed in anapproximately 1:1 ratio.The mixture is encapsulated by vigorousvortexing into aqueous micelles that contain PCRreactants surrounded by oil, and pipetted into a96-well microtiter plate for PCR amplification.The resulting beads are decorated withapproximately 1 million copies of the originalsingle-stranded fragment, which providessufficient signal strength during thepyrosequencing reaction that follows to detectand record nucleotide incorporation events.sstDNA, single-stranded template DNA.

Next-gen sequencing: Illumina Solexa 49Illumina / Solexa Genome Analyzer– Read length 35 - 50 bp– 1-2 Gb / 3-6 day run– 98.5% accuracy / base in a single run– 99.99% accuracy / consensus with 3x coverage

The Illumina sequencing-by-synthesisapproach. Cluster strands created by bridgeamplification are primed and all fourfluorescently labeled, 3 -OH blockednucleotides are added to the flow cell with DNApolymerase. The cluster strands are extendedby one nucleotide. Following the incorporationstep, the unused nucleotides and DNApolymerase molecules are washed away, ascan buffer is added to the flow cell, and theoptics system scans each lane of the flow cellby imaging units called tiles. Once imaging iscompleted, chemicals that effect cleavage ofthe fluorescent labels and the 3 -OH blockinggroups are added to the flow cell, whichprepares the cluster strands for another roundof fluorescent nucleotide incorporation.

Next-gen sequencing: SOLiD 51SOLiD– Read length 25-30 bp– 1-2 Gb / 5-10 day run– 99.94% accuracy / base– 99.999% accuracy / consensus with 15x coverage

The ligase-mediated sequencing approach of the Applied Biosystems SOLiDsequencer. In a manner similar to Roche/454 emulsion PCR amplification, DNAfragments for SOLiD sequencing are amplified on the surfaces of 1- m magneticbeads to provide sufficient signal during the sequencing reactions, and are thendeposited onto a flow cell slide. Ligase-mediated sequencing begins by annealinga primer to the shared adapter sequences on each amplified fragment, and thenDNA ligase is provided along with specific fluorescent-labeled 8mers, whose 4thand 5th bases are encoded by the attached fluorescent group. Each ligation stepis followed by fluorescence detection, after which a regeneration step removesbases from the ligated 8mer (including the fluorescent group) and concomitantlyprepares the extended primer for another round of ligation. (b) Principles of twobase encoding. Because each fluorescent group on a ligated 8mer identifies atwo-base combination, the resulting sequence reads can be screened for basecalling errors versus true polymorphisms versus single base deletions by aligningthe individual reads to a known high-quality reference sequence.

Modern Sanger sequencers can produce quality reads up to 750 bases1 – Instruments provide you with a quality file for bases in reads, in addition to actual sequence data Compare the read length against the size of the