CS612 - Algorithms In Bioinformatics

Transcription

CS612 - Algorithms in BioinformaticsIntroductionSeptember 8, 2020Nurit HaspelCS612 - Algorithms in Bioinformatics

Contact InformationInstructor: Nurit Haspelhttp://www.cs.umb.edu/ nurithnurith@cs.umb.edu or nurit.haspel@umb.eduCourse schedule: Tu Th 2:00–3:15 on Blackboard.Office hours – Tu Th 12:30-2:00 or by appointment, same ascourse link.Unless otherwise noted, lectures will be live and recorded.Nurit HaspelCS612 - Algorithms in Bioinformatics

Course DescriptionIntroduction to Molecular Biology.Protein structure – basic concepts.Structural representation and storage of protein molecules.Protein folding and docking.Geometric conformational search algorithms.Introduction to classification and machine learning methods.Bio-molecular simulations and Molecular dynamics.Introduction to systems biology, networks if time permits.Course website: http://www.cs.umb.edu/ nurith/cs612.Syllabus.Nurit HaspelCS612 - Algorithms in Bioinformatics

Course RequirementsPrerequisite: CS210, MATH260 or equivalent.Knowledge in biology is not required – all the concepts will betaught. It is an advantage, though!Homework/programming assignments – 5 during the course(40–50% total).You may consult with your friends, but the final work shouldbe individual.I strongly prefer typed homework. If handwritten – make itCLEAR.Term project – 30%.Presentations – 20-30%.Project will be term-long this time, instructions given later on.Nurit HaspelCS612 - Algorithms in Bioinformatics

Course Requirements (Cont.)The homework due date is strict. No late assignments will beaccepted without a good and documented reason.Homework submission will be done on gradescope.Final project will include a programming project andsummary/presentation.I will start introducing the project around October (when I getto the relevant subject).The project should be done by the end of the semester.Your final grade should be at least C (60%) to pass.Recommended text books: See syllabus.Nurit HaspelCS612 - Algorithms in Bioinformatics

General StuffThe course material will be available online and updatedregularly with class notes and assignments.Attendance is not required (but highly encouraged). You areresponsible for updating yourselves if you miss a class.Don’t be afraid to ask questions in or out of class. I won’tthink you are stupid and it won’t lower your grade.Don’t hesitate to send me e-mails. I expect e-mails. It won’tlower your grade.Nurit HaspelCS612 - Algorithms in Bioinformatics

Introduction to Molecular Biology and Biological Databasesbioalgorithms.info, cnx.org, and instructors across thecountry. Special thanks to Dr. Lydia Kavraki, Rice and Dr.Amarda Shehu, GMUNurit HaspelCS612 - Algorithms in Bioinformatics

Critical Events in Molecular Biology1869 – Discovery ofDNA: Johan FriedrichMiescher discoveredDNA and named itnuclein.1665 – Rise ofmicroscope: RobertHooke discoveredorganisms are madeof cells.1865 – Rise of Genetics:Gregor Mendel discoveredthat an organism has twoalternative hereditaryunits for a given trait:dominant vs. recessive.1881 – Edward Zacharias showed thatchromosomes are composed of nucleins.1899 – Richard Altmann renamednuclein to nucleic acid.1900 – Chemical structures of alltwenty classic amino acids had beenidentified.Nurit HaspelCS612 - Algorithms in Bioinformatics

Critical Events in Molecular Biology1902 – Emil Fischer wonthe Nobel Prize forshowing that amino acidslink to form proteins.Postulated: proteinproperties are defined byamino-acid compositionand arrangementThis postulate is now fact1911 – Thomas Morgandiscovered genes onchromosomes are discreteunits of heredityPheobus Lerenediscovered RNA1941 – George Beadle and Edward Tatumidentify that genes make proteinsNurit Haspel1952 – Alfred Hershey andMartha Chase make genesfrom DNA1952–1953 – James Watsonand Francis Crick deduce thedouble helical structure ofDNAWatson and Crick ”used“ theX-ray diffraction dataproduced by Rosalind Franklinto deduce the structure ofDNACS612 - Algorithms in Bioinformatics

Critical Events in Molecular Biology1986 – Leroy Hood developedautomated sequencingmechanism – Human GenomeInitiative announced1990 – Congress launchesHuman Genome Project1977 – Philip Sharp and Richard Robertsshow pre-mRNA is processed by excision ofintrons and splicing together of exons2000 – Complete sequence ofeuchromatic portion ofDrosophila Melanogastergenome2001 – First draft of humangenome published1995 – John Craig Ventersequences first bacterialgenomes – automatedsequencing1996 – First eukaryoticgenome, yeast, sequenced2003 – Mouse genomesequenced2007 – James Watson’sgenome sequencedNurit HaspelCS612 - Algorithms in Bioinformatics

Molecular Biology as Information Science 30, 000 genomes fully sequenced orpermanent draft, mostly bacterial(2017) 5x106 unique sequences available(probably more by now)What do we do with them?Compare them to find what iscommon and different amongorganisms (ComparativeGenomics)Find out how and which genesencode for which proteinsIdentify changes that lead todiseaseAssociate structural andfunctional information with newgene sequencesNurit Haspelsource: http://www.uniprot.orgNIH structural genomics projectProtein Structure Initiative (PSI)GOLD (Genomes Online Database)http://www.genomesonline.orgMost of the sequences do not have asolved structureExperiments lagging behindWay too much data for computerscientists to sit around doing nothingCS612 - Algorithms in Bioinformatics

Life as we Know it: The CellProkaryote cellSingle CellNo nucleusNo organellesOne piece of circular DNANo mRNA post transcriptional modificationNurit HaspelEukaryote cellSingle or multi cellNucleusOrganellesChromosomesExon–intron splicingCS612 - Algorithms in Bioinformatics

Signaling, Control, and Gene ActivityCells make decisions throughcomplex networks of chemicalreactions, called pathwaysSynthesize new materialsBreak material down forspare partsSignal to eat, die, or divideSignal to one another tocommunicateLarge-scale studies of gene regulatory, protein, metabolic networksin cells are conducted in systems biologyNurit HaspelCS612 - Algorithms in Bioinformatics

Cell Information and MachineryA cell stores all the information needed to replicate itselfThe human genome is around 3 billion base pairs longAlmost every cell in the human body contains the same set ofgenesWhat differentiates cells in your body?Not all genes are expressed at the same time in the same wayin all cellsA cell is a machineryIt collects and manufactures its own componentsIt carries out its own replicationIt kicks the start of its new offspringLife inside a cell: http://multimedia.mcb.harvard.edu/Nurit HaspelCS612 - Algorithms in Bioinformatics

The Three Life-Critical MoleculesDNARNANurit HaspelProteinCS612 - Algorithms in Bioinformatics

Overview Before Heading to Central DogmaNucleus Library.Chromosomes Bookshelves.Genes Books.Books represent the information (DNA) that every cell needs so itcan grow and carry out its own set of functions.Genome – an organism’s genetic material.Gene – discrete unit of hereditary information located onchromosomes and consisting of DNA.Genotype – genetic makeup of an organism.Phenotype – physical expressed trait of an organism.Nucleic acids – biological molecules (RNA and DNA) that alloworganisms to reproduce.Proteins – complex molecules made up of smaller subunits calledamino acids.Nurit HaspelCS612 - Algorithms in Bioinformatics

Central Dogma of Molecular BiologyInformation flows from DNA through RNA to SynthesizeProteins in cellsDNAHolds information on how the cell worksRNAActs to transfer short pieces of information to different parts ofthe cellProvides templates to synthesize into proteinsProteinsCan be enzymes that send signals to other cells and regulategene activityForm the body’s major components (hair, skin, etc)Often referred to as the workhorses of the cellNurit HaspelCS612 - Algorithms in Bioinformatics

Central Dogma of Molecular BiologyNurit HaspelCS612 - Algorithms in Bioinformatics

DNA: Sequence and StructureThe four fundamental units of DNA are: Adenine (A), Guanine(G), Thymine (T), and Cytosine (C) They pair up oncomplementary strands: A-T and C-G. Like a four-letter alphabet.Nurit HaspelCS612 - Algorithms in Bioinformatics

DNA: Sequence and StructureThe double helix structure is composedof: sugar molecules phosphate groupsbases (A, C, G, T)Base pairs form hydrogen bonds 2bonds link A to T 3 bonds link C to GDNA always reads from 5’ end to to 3’end for transcription and replication 5’ATTTAGGCC 3’ 3’ TAAATCCGG 5Nurit HaspelCS612 - Algorithms in Bioinformatics

DNA: Sequence and StructureHow DNA copies itself:http://www.youtube.com/watch?v 5VefaI0LrgE&NR 1Nurit HaspelCS612 - Algorithms in Bioinformatics

DNA: Sequence and SuperstructureLodish et al. Molecular Biology of the Cell (5th ed.). W.H. Freeman & Co., 2003.DNA in living cells is highly compact and structured.Transcription factors and RNA polymerase need access to DNA.Transcription is dependent on the structural state – sequence alone doesnot tell the whole story.Nurit HaspelCS612 - Algorithms in Bioinformatics

RNA: Sequence and StructureRNA is chemically similar to DNA,but T(hymine) is replaced withU(racil) and the ribose instead ofdeoxy-ribose.Some forms of RNA can fold tocreate secondary structures – hasimplication for functionDNA and RNA can pair with eachotherThere are several forms of RNA:1mRNA (messenger RNA) – carries a gene’s information out ofnucleus2tRNA (transfer RNA) – transfer’s mRNA’s information onto aprotein chain of amino acids3rRNA (ribosomal RNA) – part of the ribosome, where proteins aresynthesizedNurit HaspelCS612 - Algorithms in Bioinformatics

Transctiption: From DNA to mRNATranscription refers to the process of copying a piece of theDNA onto mRNACatalyzed by transcriptase enzymeThe enzyme recognizes a promoter region to begintranscriptionAbout 50 base pairs can be transcribed per second inbacteria multiple transcriptions can occurThe process of how the enzyme finds the promoter regions ispartially understood and related to the problem of motiffinding in bioinformaticsRepressor and inhibitor enzymes act in various ways to stoptranscription. This makes the regulation of genetranscription difficult to understand, model, or controlQuestion: How does splicing complicate the picture?Nurit HaspelCS612 - Algorithms in Bioinformatics

From DNA and RNA to ProteinsDNA contains introns and exons.Introns (”junk DNA“) – not fully understood, probably not junk. Exonsread to transfer information to mRNA Transcription (RNA synthesis)mRNA goes to the ribosome tRNAs line up to link amino acids in a chain Translation – Protein Synthesishttp://www.youtube.com/watch?v NJxobgkPEAoNurit HaspelCS612 - Algorithms in Bioinformatics

Translation – Codons and Amino AcidsDNA: strings of four letter codes A, T, C, GRNA: strings of four letter codes A, U, C, G (uracyl replacesthymine).Proteins: strings of twenty letter codes. The letters are thefundamental building blocks called amino acids.Each amino acid is coded by 3 nucleotides called codonThere are 20 amino acids, Many codons code for the same aminoacid. Some codons indicate when to stop reading the geneticinformationNurit HaspelCS612 - Algorithms in Bioinformatics

The Universal Genetic laNurit GlyCS612 - Algorithms in Bioinformatics

From mRNA to ProteinsmRNA travels to the ribosomeEach codon of mRNA determines what tRNA toline itself up in orderEach tRNA then transfers its aminoacyl group tothe growing chain of amino acidsExample: the tRNA with the anticodon AAGcorresponds with the codon UUC of mRNA andattaches Phenylalanine (PHE, F) onto the growingprotein chainNurit HaspelCS612 - Algorithms in Bioinformatics

Ribosome - Protein AssemblyBacterial ribosome(nigms.nih.gov)50S large subunit of theribosome. Proteins arein blue, RNA in orange,and active sites areshown in red(wikipedia.org)Nurit HaspelClosed depiction(above) and ”crackedopen” depiction (below)exposing the tRNAs.[adapted from MYusupovet al. (2001)Science]CS612 - Algorithms in Bioinformatics

Proteins: Workhorses of the CellIn the ribosome:the chain of amino acids folds(arranges itself in threedimension) into a unique structurewhere the protein is functionalThis is known as protein folding.Deviations from this structure canbe lethal and give rise to disease.This is known as misfolding.Proteins do all essential work in cells:a. build cellular structuresb. digest nutrientsc. execute metabolic functionsd. mediate information flow within a celland among cell communitiese. work together with other proteins ornucleic acids as molecular machinesNurit HaspelUnderstanding (computing) how the chainof amino acids determines whatstructure(s) a protein assumes in cells is animportant problem in computational(structural) biology: The protein foldingproblem (A variant: The structureprediction problem).CS612 - Algorithms in Bioinformatics

Basic Summary of What We KnowDNA, RNA, and Proteins are specified linearly (linear strings ofcharacters)DNA and RNA are constructed from nucleic acids (nucleotides)Strings written in a four-letter alphabet (C, G, A, T/U)Proteins are constructed from amino acidsStrings written in a twenty-letter alphabetThese strings fold into complex 3D structuresSequence BioinformaticsStructural Bioinformaticspatterns about the sequence can revealinsight into transcription, translation, andfunction of synthesized proteins.When sequence decoding reaches limitGenomic sequences represent a writtenlanguage of 4-letter alphabetDNA decoding techniques not verydifferent than those for decoding anancient languageNurit HaspelWhen structure reveals further information- relevance of DNA, RNA, Proteinstructure.When understanding how molecules fit tocreate machines requires more thansequence informationCS612 - Algorithms in Bioinformatics

Structure to FunctionStructure determines reactions in cellsStructures of proteins that are complementary fit with one anotherProblem: Using structure (and possibly sequence) infer active (infer) sitesSites where proteins interact with other moleculesProblem: Using structure (and possibly sequence) infer the structure andfunction of an amino-acid chain synthesized from a decoded gene sequenceInhibitors (green, yellow, purple) bind to (block) an HIVprotein mimic in three ”pockets” that are essential to thevirus’ ability to enter cells. (bnl.gov)Nurit HaspelLeft (anl.gov) and right (nigms.nih.gov) structuressuggest that the proteins are involved in DNA binding ortransferCS612 - Algorithms in Bioinformatics

Biological DatabasesAn increasing amount of biological and sequence data is freely availablein online databasesLarge amounts of data also pose an interesting computational problem ofhow to store themAn always improving, changing, increasing list of biological databases:NCBI Gene Bank – http://ncbi.nih.govcontains many subdatabasesnucleotide sequence database is most prominentProtein Data Bank http://www.rcsb.orgcontains protein structuresUniprot – http://www.uniprot.org/contains annotated protein sequencesProsite – http://kr.expasy.org/prositeDatabase of motifs of protein active sitesChapter 4 of Lesk’s book provides a good (albeit somewhat outdated)reference. Many more databases are available.Nurit HaspelCS612 - Algorithms in Bioinformatics

Employing Databases for Sequence AnalysisAnalyze biological sequences for patternsRNA splice sites – what are they and why?Open reading frames (ORFs) – what are they and why?Amino acid propensities in a protein – why?Conserved regions in proteins – possible active sitesConserved regions in DNA and RNA – possible protein bindingsitesPredict from sequenceProtein and RNA topology and 3D structure.Protein binding/active sitesProtein FunctionFundamental sequence question: How are genomes assembled,mapped, annotated?Nurit HaspelCS612 - Algorithms in Bioinformatics

Genome AssemblyLength of sequenced genomefragments is limitedGenome is fragmentedEnzymes splice/cutFragments then need to betaken and put back together inright orderNot easy to doNurit HaspelCS612 - Algorithms in Bioinformatics

Genome AssemblyShortest Common SuperstringProblem (SCS)Needed because fragments overlapFit overlapping sequences together tofind the shortest sequence thatincludes all fragment sequencesFragments may contain sequencingerrorsTwo complements of DNANeed to take into account both3’ and 5’Repeat problem:50% of human DNA is justrepeatsNurit HaspelCS612 - Algorithms in Bioinformatics

Genome Assembled – Now WhatTracing PhylogenyFind (evolutionary) family relationships by tracking similaritiesbetween speciesGene Annotation/Finding (comparative genomics)Comparison of Similar SpeciesDetermining Regulatory NetworksThe variables that control the body’s response to stimuliProteomicsFrom DNA sequence to a folded protein with known functionThe main part of this course will focus heavily oncomputational proteomicsNurit HaspelCS612 - Algorithms in Bioinformatics

Gene FindingIdentifying protein-encoding regions (exons) from “junk” DNA(introns)Ab-initio predicting methods – Hidden Markov Models,Machine learning approaches.Comparative genomics.Nurit HaspelCS612 - Algorithms in Bioinformatics

Comparative GenomicsHuman ChromosomesNurit HaspelCS612 - Algorithms in Bioinformatics

More Than Sequence and StructureNodes: MetabolitesEdges: Biochemical reactionAnalysis of this metabolicnetwork often employs toolsfrom graph theory incomputer scienceWhat are some questions wecan ask about a metabolicnetwork?Citric acid cycleNurit HaspelCS612 - Algorithms in Bioinformatics

Protein Interaction NetworksXXXXXXXXNodes: ProteinsEdges: Protein interactionszXAnalysis of protein interactionnetworks also employs tools fromgraph theory in computer scienceWhat are some questions we can askabout p2p networks?e.g.: how could we predict thefunction of a gene through such anetwork?Nurit HaspelCS612 - Algorithms in Bioinformatics

Signaling NetworksNodes: Different molecules:Proteins or neurotransmittersEdges: Activation ordeactivationAnalysis of signaling networksalso employs tools from graphtheory in computer scienceWhat are some questions wecan ask about p2p networks?e.g.: what does a path in thenetwork mean?sigmaaldrich.comNurit HaspelCS612 - Algorithms in Bioinformatics

Mining/Analyzing, Modeling, PredictingModeling biological processes (such as what?) allows us to test whether we fullyunderstand the process and whether we know all the variables that control itWe build models of proteins to explore the sequence-structure-functionrelationshipModels of molecular interactions to test whether molecules interact toachieve a biological function in cellsModels of gene regulatory networks to understand how genes interactwith one anotherWe build (systems biology) models of entire cellsWe use information from biologists, chemists, physicists, computer scientists tobuild such modelsWe then use computers to simulate the behavior or properties of molecules orcells being modeled over time and spaceIf the behavior or properties are different from those ”seen“ in the wet lab, wecorrect our model – this improves our (theoretical and computational)understandingIf the model is correct and we observe additional properties – we have crossedinto the discovery and prediction contribution of computational biologyThe complexity of the models and the simulations requires fast, efficient, accurate computer algorithms andpowerful machinesNurit HaspelCS612 - Algorithms in Bioinformatics

What do Bioinformaticians Do?Mining for information:Let scientists discover the biology of cells.Encourage the deposition of gene sequences, proteinsequences, protein structures decoded, resolved byexperimentalists in databases.Organize and cross-link databases so information can bequickly extracted and cross-referenced.Conduct fast large-throughout searches in sequence databases.Compare sequences of existing and novel genes, proteins toinfer knowledge about structure and function.ab initio modeling:Apply principles of physics to fold chains into 3D structures.Combination of the two:Statistical information from the databases with ab initiocomputingStudy large patterns in interaction networks.Nurit HaspelCS612 - Algorithms in Bioinformatics

The Future of Computational BiologyA significant part of our time is spent in improving the accuracy ofour modelingThe rest of that time is spent in extending the speed, accuracy, andapplicability of our algorithms in order to make meaningfulpredictions through computersBioinformatics and Computational Biology are still in infant stagesThere are a lot of things we do not understandA lot of questions to be resolvedThe main one revolves around control: how can we control thebehavior of a molecule, a group of molecules, and then a cell?These questions are often asked in the context of biomedicalresearch or bioengineering, where our focus is on building effectivetherapeutics (to improve the health of society) or novel functionalmaterials (to improve the living conditions of our citizens)Nurit HaspelCS612 - Algorithms in Bioinformatics

Sources CitedDebra Goldberg, Algorithms for Molecular Biology, Fall 2008www.bioalgorithms.info (lectures for students and faculty).Daniel Sam, Greedy Algorithm presentation.Glenn Tesler, Genome Rearrangements in Mammalian Evolution: Lessons fromHuman and Mouse Genomes presentation.Ernst Mayr, What evolution is.Neil C. Jones, Pavel A. Pevzner, An Introduction to Bioinformatics Algorithms.Alberts, Bruce, Alexander Johnson, Julian Lewis, Martin Raff, Keith Roberts,Peter Walter. Molecular Biology of the Cell. New York: Garland Science. 2002.Mount, Ellis, Barbara A. List. Milestones in Science & Technology. Phoenix:The Oryx Press. 1994.Voet, Donald, Judith Voet, Charlotte Pratt. Fundamentals of Biochemistry.New Jersey: John Wiley & Sons, Inc. 2002.Campbell, Neil. Biology, Third Edition. The Benjamin/Cummings PublishingCompany, Inc., 1993.Snustad, Peter and Simmons, Michael. Principles of Genetics. John Wiley &Sons, Inc, 2003.Nurit HaspelCS612 - Algorithms in Bioinformatics

Course schedule: Mo We 5:30{6:45 at M-2-207. Nurit Haspel CS612 - Algorithms in Bioinformatics. . Syllabus. Nurit Haspel CS612 - Algorithms in Bioinformatics. Course Requirements Prerequisite: CS210, MATH260 or equivalent. . Algorithms in Bioinformatics. Course Requirements (Cont.) The homework due date is strict. No late assignments will be