Introduction BIOINFORMATICS

Transcription

Mark Gerstein, Yale Universitybioinfo.mbb.yale.edu/mbb452a1 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduBIOINFORMATICSIntroduction

BiologicalData ComputerCalculations2 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduBioinformatics

(Molecular) Bio - informatics One idea for a definition?Bioinformatics is conceptualizing biology in terms ofmolecules (in the sense of physical-chemistry) andthen applying “informatics” techniques (derivedfrom disciplines such as applied math, CS, andstatistics) to understand and organize theinformation associated with these molecules, on alarge-scale. Bioinformatics is “MIS” for Molecular BiologyInformation3 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduWhat is Bioinformatics?

Molecular Biology: an Information ScienceDNA- RNA- Protein- Phenotype- DNA Molecules Sequence, Structure, Function Processes Mechanism, Specificity, Regulation Central Paradigmfor BioinformaticsGenomic Sequence Information- mRNA (level)- Protein Sequence- Protein Structure- Protein Function- Phenotype Large Amounts of Information StandardizedStatistical(idea from D Brutlag, Stanford, graphics from S Strobel) Most cellular functions are performed orfacilitated by proteins. Primary biocatalyst Cofactor transport/storage Mechanical motion/support Immune protection Control of growth/differentiation Genetic material Information transfer (mRNA) Protein synthesis (tRNA/mRNA) Some catalytic activity4 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Central Dogmaof Molecular Biology

Molecular Biology Information - DNA Coding or Not? Parse into genes? 4 bases: AGCT 1 K in a gene, 2 M in gatgcagacgctggtatcgcattaactgattctttcgttaaattggtatc . . . . caattacaacaagatcctctttcttgcacttgg5 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Raw DNA Sequence

Molecular Biology Information:Protein Sequence ACDEFGHIKLMNPQRSTVWYbut not BJOUXZ Strings of 300 aa in an average protein (in bacteria), 200 aa in a domain 200 K known protein GGAQIFTAFKDDV6 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu 20 letter alphabet

Molecular Biology Information:Macromolecular Structure Almost all protein(RNA Adapted From D Soll Web Page,Right Hand Top Protein from M Levitt web page)7 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu DNA/RNA/Protein

Molecular Biology Information:Protein Structure Details 200 residues/domain - 200 CA atoms, separated by 3.8 A Avg. Residue is Leu: 4 backbone atoms 4 sidechain atoms, 150 cubic A 1500 xyz triplets ( 8x200) per protein domain 10 K known domain, 300 TER8 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Statistics on Number of XYZ triplets

1995Bacteria, 1.6Mb, 1600genes [Science269: 496]19979 (c) Mark Gerstein, 1999, Yale, f theWorld ofSequencesEukaryote,13 Mb, 6Kgenes [Nature387: 1]1998Animal, 100Mb, 20Kgenes [Science282: 1945]2000?Human, 3Gb, 100Kgenes [?]

The Revolution Driving EverythingFleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F.,Kerlavage, A. R., Bult, C. J., Tomb, J. F., Dougherty, B. A., Merrick, J. M., McKenney, K.,Sutton, G., Fitzhugh, W., Fields, C., Gocayne, J. D., Scott, J., Shirley, R., Liu, L. I., Glodek, A.,Kelley, J. M., Weidman, J. F., Phillips, C. A., Spriggs, T., Hedblom, E., Cotton, M. D.,Utterback, T. R., Hanna, M. C., Nguyen, D. T., Saudek, D. M., Brandon, R. C., Fine, L. D.,Fritchman, J. L., Fuhrmann, J. L., Geoghagen, N. S. M., Gnehm, C. L., McDonald, L. A., Small,Venter, J. C. (1995). "Whole-genomerandom sequencing and assembly of Haemophilus influenzae rd."Science 269: 496-512.K. V., Fraser, C. M., Smith, H. O. &(Picture adapted from TIGR website,http://www.tigr.org) Integrative Data1995, HI (bacteria): 1.6 Mb & 1600 genes done1997, yeast: 13 Mb & 6000 genes for yeast1998, worm: 100Mb with 19 K genes1999: 30 completed genomes!2003, human: 3 Gb & 100 K genes.Genome sequence nowaccumulate so quickly that,in less than a week, asingle laboratory canproduce more bits of datathan Shakespearemanaged in a lifetime,although the latter makebetter reading.-- G A Pekso, Nature 401: 115-116 (1999)10 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduMolecular BiologyInformation:Whole Genomes

Young/Lander, Chips,Abs. Exp.Brown, µarray,Rel. Exp. overTimecourseAlso: SAGE;Samson andChurch, s,Protein Exp.11 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduGene ExpressionDatasets: theTranscriptosome

Yeast Expression Data inAcademia:levels for all 6000 genes!Can only sequence genomeonce but can do an infinitevariety of these arrayexperimentsat 10 time points,6000 x 10 60K floatstelling signal frombackground(courtesy of J Hager)12 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduArray Data

Systematic KnockoutsWinzeler, E. A., Shoemaker, D. D.,Astromoff, A., Liang, H., Anderson, K.,Andre, B., Bangham, R., Benito, R.,Boeke, J. D., Bussey, H., Chu, A. M.,Connelly, C., Davis, K., Dietrich, F., Dow,S. W., El Bakkoury, M., Foury, F., Friend,S. H., Gentalen, E., Giaever, G.,Hegemann, J. H., Jones, T., Laub, M.,Liao, H., Davis, R. W. & et al. (1999).Functional characterization of the S.cerevisiae genome by gene deletion andparallel analysis. Science 285, 901-62 hybrids, linkage mapsHua, S. B., Luo, Y., Qiu, M., Chan, E., Zhou, H. &Zhu, L. (1998). Construction of a modular yeasttwo-hybrid cDNA library from human EST clones forthe human genome protein linkage map. Gene 215,143-52For yeast:6000 x 6000 / 2 18M interactions13 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduOther WholeGenomeExperiments

Information tounderstand genomes Metabolic Pathways(glycolysis), traditionalbiochemistry Regulatory Networks Whole OrganismsPhylogeny, traditionalzoology Environments, Habitats,ecology The Literature(MEDLINE) The Future.(Pathway drawing from P Karp’s EcoCyc, Phylogenyfrom S J Gould, Dinosaur in a Haystack)14 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduMolecular Biology Information:Other Integrative Data

InternetHosts As important as theincrease in computerspeed has been, theability to store largeamounts ofinformation oncomputers is evenmore crucial Driving Force inBioinformatics(Internet picture adaptedfrom D Brutlag, 000150010005000120100806040200198519901995CPU InstructionTime (ns) CPU vs Disk & Net15 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduExplonential Growth of Data Matchedby Development of ComputerTechnology

Bioinformatics is born!(courtesy of Finn Drablos)16 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu

WeberCartoon17 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu

Different Sequences Have theSame Structure Organism has many similar genes Single Gene May Have MultipleFunctions Genes are grouped into Pathways Genomic Sequence Redundancydue to the Genetic Code How do we find thesimilarities?.Integrative Genomics genes structures functions pathways expression levels regulatory systems .18 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduThe Character ofMolecular BiologyInformation:Redundancy andMultiplicity

Because ofincrease in data andimprovement in computers,new calculations becomepossible But Bioinformatics has a newstyle of calculation. Two Paradigms Physics Prediction based on physicalprinciples Exact Determination of RocketTrajectory Supercomputer, CPU Biology Classifying information anddiscovering unexpectedrelationships globin colicin plastocyanin repressor networks, “federated” database19 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduNew Paradigm forScientific Computing

Databases Building, Querying Object DB Text String Comparison Text Search1D AlignmentSignificance StatisticsAlta Vista, grep Finding Patterns AI / Machine Learning Clustering Datamining Geometry Robotics Graphics (Surfaces, Volumes) Comparison and 3D Matching(Visision, recognition) Physical Simulation Newtonian MechanicsElectrostaticsNumerical AlgorithmsSimulation20 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduGeneral Types of “Informatics”in Bioinformatics

Finding Genes in GenomicDNA introns exons promotors Characterizing Repeats inGenomic DNA Statistics Patterns Duplications in the Genome21 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduBioinformatics Topics -Genome Sequence

non-exact string matching, gaps How to align two strings optimallyvia Dynamic Programming Local vs Global Alignment Suboptimal Alignment Hashing to increase speed(BLAST, FASTA) Amino acid substitution scoringmatrices Multiple Alignment andConsensus Patterns How to align more than onesequence and then fuse theresult in a consensusrepresentation Transitive Comparisons HMMs, Profiles MotifsBioinformaticsTopics -Protein Sequence Scoring schemes andMatching statistics How to tell if a given alignment ormatch is statistically significant A P-value (or an e-value)? Score Distributions(extreme val. dist.) Low Complexity Sequences22 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Sequence Alignment

Secondary Structure“Prediction” via Propensities Neural Networks, GeneticAlg. Simple Statistics TM-helix finding Assessing SecondaryStructure Prediction Tertiary Structure Prediction Fold Recognition Threading Ab initio Function Prediction Active site identification Relation of Sequence Similarity toStructural Similarity23 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduBioinformaticsTopics -Sequence /Structure

Basic Protein Geometry andLeast-Squares Fitting Distances, Angles, Axes,Rotations Calculating a helix axis in 3Dvia fitting a line LSQ fit of 2 structures Molecular Graphics Calculation of Volume andSurface How to represent a planeHow to represent a solidHow to calculate an areaDocking and Drug Design asSurface Matching Packing Measurement Structural Alignment Aligning sequences on the basisof 3D structure. DP does not converge, unlikesequences, what to do? Other Approaches: DistanceMatrices, Hashing Fold Library24 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduTopics -- Structures

Keys, Foreign Keys SQL, OODBMS, views, forms,transactions, reports, indexes Joining Tables, Normalization Natural Join as "where"selection on cross product Array Referencing (perl/dbm) Forms and Reports Cross-tabulation Protein Units? What are the units of biologicalinformation? sequence, structure motifs, modules, domains How classified: folds, motions,pathways, functions?Topics -Databases Clustering and Trees Basic clustering UPGMA single-linkage multiple linkage Other Methods Parsimony, Maximumlikelihood Evolutionary implications The Bias Problem sequence weighting sampling25 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Relational DatabaseConcepts

Expression Analysis Time Courses clustering Measuring differences Identifying Regulatory Regions Large scale cross referencingof information Function Classification andOrthologs The Genomic vs. Singlemolecule Perspective Genome Comparisons Ortholog Families, pathwaysLarge-scale censusesFrequent Words AnalysisGenome AnnotationTrees from GenomesIdentification of interactingproteins Structural Genomics Folds in Genomes, shared &common folds Bulk Structure Prediction Genome Trees 26 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduTopics -- Genomics

Molecular Simulation Geometry - Energy - Forces Basic interactions, potentialenergy functions Electrostatics VDW Forces Bonds as Springs How structure changes overtime? How to measure the changein a vector (gradient) Molecular Dynamics & MC Energy Minimization Parameter SetsNumber DensityPoisson-Boltzman EquationLattice Models andSimplification27 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduTopics -- Simulation

BioinformaticsSchematic28 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu

BackgroundBiologyNeed to Know Calculation of StandardTodayDeviation, a Bell-shapedDistribution (of test scores),a 3D vectorDNA, RNA, alphahelix, the cell nucleus,ATPWhat You’llLearnForce is the Derivative (grad) ofEnergy, Rotation Matrices (3D), aP-value of .01 and an ExtremeValue DistributionProteins are tightlypacked, sequencehomology twilightzone, protein familiesNot reallynecessary .Poisson-Boltzman Equation,What GroEL does, aDesign a Hashing Function, Write worm is a metazoa, E.a Recursive Descent Parsercoli is gram negative,what chemokines are29 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduMath

Digital Libraries Automated Bibliographic Search and Textual Comparison Knowledge bases for biological literature Motif Discovery Using Gibb's Sampling Methods for Structure Determination Computational Crystallography Refinement NMR Structure Determination Distance Geometry Metabolic Pathway Simulation The DNA Computer30 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduAre They or Aren’t TheyBioinformatics? (#1)

(YES?) Digital Libraries Automated Bibliographic Search and Textual Comparison Knowledge bases for biological literature (YES) Motif Discovery Using Gibb's Sampling (NO?) Methods for Structure Determination Computational Crystallography Refinement NMR Structure Determination (YES) Distance Geometry (YES) Metabolic Pathway Simulation (NO) The DNA Computer31 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduAre They or Aren’t TheyBioinformatics? (#1, Answers)

Gene identification by sequence inspection Prediction of splice sites DNA methods in forensics Modeling of Populations of Organisms Ecological Modeling Genomic Sequencing Methods Assembling Contigs Physical and genetic mapping Linkage Analysis Linking specific genes to various traits32 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduAre They or Aren’t TheyBioinformatics? (#2)

(YES) Gene identification by sequence inspection Prediction of splice sites (YES) DNA methods in forensics (NO) Modeling of Populations of Organisms Ecological Modeling (NO?) Genomic Sequencing Methods Assembling Contigs Physical and genetic mapping (YES) Linkage Analysis Linking specific genes to various traits33 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduAre They or Aren’t TheyBioinformatics? (#2, Answers)

RNA structure predictionIdentification in sequences Radiological Image Processing Computational Representations for Human Anatomy (visible human) Artificial Life Simulations Artificial Immunology / Computer Security Genetic Algorithms in molecular biology Homology modeling Determination of Phylogenies Based on Nonmolecular Organism Characteristics Computerized Diagnosis based on Genetic Analysis(Pedigrees)34 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduAre They or Aren’t TheyBioinformatics? (#3)

(YES) RNA structure predictionIdentification in sequences (NO) Radiological Image Processing Computational Representations for Human Anatomy (visible human) (NO) Artificial Life Simulations Artificial Immunology / Computer Security (NO?) Genetic Algorithms in molecular biology (YES) Homology modeling (NO) Determination of Phylogenies Based on Nonmolecular Organism Characteristics (NO) Computerized Diagnosis based on GeneticAnalysis (Pedigrees)35 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduAre They or Aren’t TheyBioinformatics? (#3, Answers)

Understanding How Structures Bind Other Molecules (Function) Designing Inhibitors Docking, Structure Modeling(From left to right, figures adapted from Olsen Group Docking Page at Scripps, Dyson NMR Group Web page at Scripps, and fromComputational Chemistry Page at Cornell Theory Center).36 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduMajor Application I:Designing Drugs

Find Similar Ones in Different Organisms Human vs. Mouse vs. Yeast Easier to do Expts. on latter!(Section from NCBI Disease Genes Database Reproduced Below.)Best Sequence Similarity Matches to Date Between Positionally ClonedHuman Genes and S. cerevisiae ProteinsHuman DiseaseMIM #HumanGeneGenBankBLASTXAcc# forP-valueHuman cDNAYeastGeneGenBankYeast GeneAcc# forDescriptionYeast cDNAHereditary Non-polyposis Colon CancerHereditary Non-polyposis Colon CancerCystic FibrosisWilson DiseaseGlycerol Kinase DeficiencyBloom SyndromeAdrenoleukodystrophy, X-linkedAtaxia TelangiectasiaAmyotrophic Lateral SclerosisMyotonic DystrophyLowe SyndromeNeurofibromatosis, Type A repair proteinDNA repair proteinMetal resistance proteinProbable copper transporterGlycerol kinaseHelicasePeroxisomal ABC transporterPI3 kinaseSuperoxide dismutaseSerine/threonine protein kinasePutative IPP-5-phosphataseInhibitory regulator proteinChoroideremiaDiastrophic DysplasiaLissencephalyThomsen DiseaseWilms TumorAchondroplasiaMenkes 1X82013L26505Z23117X67787U07163L36317GDP dissociation inhibitorSulfate permeaseMethionine metabolismVoltage-gated chloride channelSulphite resistance proteinSerine/threoinine protein kinaseProbable copper transporter37 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduMajor Application II:Finding Homologues

Cross-Referencing, one thing to another thingSequence Comparison and ScoringAnalogous Problems for Structure ComparisonComparison has two parts:(1)(2)Optimally Aligning 2 entities to get a Comparison ScoreAssessing Significance of this score in a given Context Integrated Presentation Align Sequences Align Structures Score in a Uniform Framework38 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduMajor Application II:Finding Homologues (cont.)

Overall Occurrence of aCertain Feature in theGenome e.g. how many kinases in Yeast Compare Organisms andTissues Expression levels in Cancerous vsNormal Tissues Databases, Statistics(Clock figures, yeast v. Synechocystis,adapted from GeneQuiz Web Page, Sander Group, EBI)39 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduMajor Application I I:Overall Genome Characterization

100000 genes(human)12345678910 1112 1314 15 1617 18 1920 1000 folds12345678910 1112 13(T. pallidum)14 15 1000 genes40 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduSimplfying Genomes with Folds,Pathways, &c

1234567proteinfold (Ig)1m100Å8910 1112 13super-secondarystructure (ββ,ΤΜ ΤΜ,αβαβ,ααα)14 15 1617 18 19 PracticalRelevance(Pathogen only foldsas possible targets)(T. elixstrand3456789101112 131415 Drug41 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduAt onplant

Bioinformatics Topics --Protein Sequence Sequence Alignment non-exact string matching, gaps How to align two strings optimally via Dynamic Programming Local vs Global Alignment Suboptimal Alignment Hashing to increase speed (BLAST, FASTA) Amino acid substitution sco