Introduction To Bioinformatics - San Jose State University

Transcription

Yverdon Les BainsIntroduction to BioinformaticsJuly 2010Introduction toBioinformaticsSami KhuriDepartment of Computer ScienceSan José State UniversitySan José, California, 2010 Sami KhuriWhat is Bioinformatics? The Human Genome 2010 Sami Khuri 2010 Sami KhuriProject (HGP) Mapping Model Organisms Types of Databases Applications ofBioinformatics Genome Research2.1

Yverdon Les BainsIntroduction to BioinformaticsJuly 2010The Human Genome Project The HGP is a multinational effort, begun by the USA in1988, whose aim is to produce a complete physical mapof all human chromosomes, as well as the entire humanDNA sequence.– As part of the project, genomes of other organisms such asbacteria, yeast, flies and mice are also being studied. The primary goal of the project is to make a series ofdescriptive diagrams (called maps) of each humanchromosome at increasingly finer resolutions. 2010 Sami KhuriThe HGP Goal The ultimate goal of genome research is to find all thegenes in the DNA sequence and to develop tools forusing this information in the study of human biologyand medicine. Mapping involves:– dividing the chromosomes into smaller fragments that canbe propagated and characterized– ordering (mapping) them to correspond to their respectivelocations on the chromosomes. 2010 Sami Khuri 2010 Sami Khuri2.2

Yverdon Les BainsIntroduction to BioinformaticstelomereJuly 2010centromeretelomereCytogenetic mapof chromosome 19 2010 Sami KhuriGoals of the HGP To identify all the approximately 20,00025,000 genes in human DNA, To determine the sequences of the 3.2 billionchemical base pairs that make up human DNA, To store this information in databases, To improve tools for data analysis, To address the ethical, legal, and social issues(ELSI) that may arise from the project. 2010 Sami Khuri 2010 Sami Khuri2.3

Yverdon Les BainsIntroduction to BioinformaticsJuly 2010HGP Finished Before Deadline In 1991, the USA Congress was toldthat the HGP could be done by 2005 for 3 billion. It ended in 2003 for 2.7 billion,because of efficient computationalmethods. 2010 Sami KhuriOther SpeciesAs part of the HGP, genomes of other organisms, such asbacteria, yeast, flies and mice are also being studied.Baker’s yeastC elegansp53 genepax6 geneDiabetesDNA repairCell divisionChimps are infected with SIVVery rarely progress to AIDS 2010 Sami Khuri 2010 Sami Khuri2.4

Yverdon Les BainsIntroduction to BioinformaticsJuly 2010Model Organisms A model organism is an organism that isextensively studied to understand particularbiological phenomena. Why have model organisms? The hope is thatdiscoveries made in model organisms will provideinsight into the workings of other organisms. Why is this possible? This works becauseevolution reuses fundamental biological principlesand conserves metabolic, regulatory, anddevelopmental pathways. 2010 Sami KhuriDavid Gilbert 2010 Sami Khuri 2010 Sami Khuri2.5

Yverdon Les BainsIntroduction to BioinformaticsJuly 2010Sequencing the ChimpanzeeNature, vol 418, August 2002 2010 Sami KhuriStudying Diseases 2010 Sami Khuri 2010 Sami Khuri2.6

Yverdon Les BainsIntroduction to BioinformaticsJuly 2010Studying Human DiseasesCopyright 2006 Pearson Prentice Hall, Inc. 2010 Sami KhuriFlies have orthologs tohumans disease-causinggenes in categories suchas: neurological renal immunological endocrine cardiovascular metabolic blood-vessel and cancerous disordersFlies can provide insights into human disease at thesystems level, revealing how different genes interact in vivo 2010 Sami Khuri 2010 Sami KhuriDiscovering Genomics, Campbell, 20072.7

Yverdon Les BainsIntroduction to BioinformaticsJuly 2010What is Bioinformatics? Set of Tools The use of computers to collect,analyze, and interpret biologicalinformation at the molecular level. A set of software toolsfor molecular sequenceanalysis 2010 Sami KhuriWhat is Bioinformatics? A Discipline The field of science, in which biology,computer science, and informationtechnology merge into a single discipline.Definition of NCBI (National Center for Biotechnology Information) The ultimate goal of bioinformatics is toenable the discovery of new biological insightsand to create a global perspective from whichunifying principles in biology can be discerned. 2010 Sami Khuri 2010 Sami Khuri2.8

Yverdon Les BainsIntroduction to BioinformaticsJuly 2010Bioinformatics and the Internet The enormous increase in biological data hasmade it necessary to use computer informationtechnology to collect, organize, maintain, access,and analyze the data. Computer speed, memory, and exchange ofinformation over the Internet has greatlyfacilitated bioinformatics. The bioinformatics tools available over theInternet are accessible, generally well developed,fairly comprehensive, and relatively easy to use. 2010 Sami KhuriWhat do Bioinformaticians do? Analyze and interpret dataDevelop and implement algorithmsDesign user interfaceDesign databaseAutomate genome analysisAssist molecular biologists in dataanalysis and experimental design. 2010 Sami Khuri 2010 Sami Khuri2.9

Yverdon Les BainsIntroduction to BioinformaticsJuly 2010Why Study Bioinformatics? Bioinformatics is intrinsicallyinteresting Bioinformatics offers the prospect of findingbetter drug targets earlier in the drugdevelopment process.– By looking for genes in model organisms that aresimilar to a given human gene, researchers canlearn about protein the human gene encodes andsearch for drugs to block it. 2010 Sami Khuriuse docking algorithmsto design molecule thatcould bind the model structure@2002-09 2010SamiSami KhuriKhuri 2010 Sami KhuriRational drug designStructure-based drug design2.10

Yverdon Les BainsIntroduction to BioinformaticsJuly 2010Databases for Storage and Analysis- Databases store data that need to be analyzed- By comparing sequences, we discover:- How organisms are related to one another- How proteins function- How populations vary- How diseases occur- The improvement of sequencing methods generated a lot ofdata that need to be:- stored- organized- curated- annotated- managed- networked- accessed- assessed 2010 Sami KhuriTypes of Databases Sequence– Genbank, SwissProt, 3D structure, carbohydrates,organism specific, phylogenetic, sequence patterns Literature– Medline, OMIM, Patents, eJournals Graphical– Swiss2D-Page Expression Analysis Databases– Microarrays Protein Interaction Databases– Pathways 2010 Sami Khuri 2010 Sami Khuri2.11

Yverdon Les BainsIntroduction to BioinformaticsJuly 2010Three Major Databases GenBank from the NCBI(National Center ofBiotechnology Information),National Library of Medicinehttp://www.ncbi.nlm.nih.gov EBI (European BioinformaticsInstitute) from the EuropeanMolecular Biology Libraryhttp://www.ebi.ac.uk DDBJ (DNA DataBank of Japan)http://www.ddbj.nig.ac.jp 2010 Sami KhuriGenBank Taxonomic SamplingHomo sapiensMus musculusDrosophila melanogasterCaenorhabditis elegansArabidopsis thalianaOryza sativaRattus norvegicusDanio rerioSaccharomyces cerevisiae62.1%7.7%6.1%3.3%2.9%1.3%0.8%0.6%0.6% 2010 Sami Khuri 2010 Sami Khuri2.12

Yverdon Les BainsIntroduction to BioinformaticsJuly 2010Databanks SPFSSPGenBankPDBMOLPROBESWISS-PROTNRL amFlyGeneTrEMBLPIRTFACTORThe major DNA databases are updated and synchronized daily. 2010 Sami KhuriThe Annotation Challenge One of the biggest challenges for maintainers ofbiological databases is the annotation:– putting sufficient information in the databasesuch that there is no question of what thegene is,– creating the proper links between thatinformation and the gene sequence and serialnumber. Correct annotation of genomic data is an activearea of research. 2010 Sami Khuri 2010 Sami Khuri2.13

Yverdon Les BainsIntroduction to BioinformaticsJuly 2010Data Formats Most databases are relational DB. The formatcan be changed to export the data:– GenBank data is stored by NCBI in a Sybasedatabase, but made public in a flatfile format. Each database and each sequence analysisprogram store/accept data in a different format.– The most commonly used data format is FASTA gi 37222328 gb AY350722.1 Pan troglodytes masticatory myosin heavychain CCCATTTTGTCCGCTGTATTATCCCCAATGAGTTTAAGCAATCGG 2010 Sami KhuriReasons for Searching DatabasesSearching a database can answer the followingquestions:– A researcher has just sequenced a gene.Has someone already found it?– A researcher has a sequence of unknown function.Is there a homology with another sequence that has aknown function?– A researcher has found a new protein in a lowerorganism.Is there a homology in a higher species? 2010 Sami Khuri 2010 Sami Khuri2.14

Yverdon Les BainsIntroduction to BioinformaticsJuly 2010Protein Databases GenPept from NCBI. ExPASy (Expert Protein Analysis System)http://www.expasy.ch SwissProt, TrEMBLhttp://www.ebi.ac.uk PIR (Protein Identification Resource)http://www-nbrf.georgetown.edu/pirwww DISC - DNA Information and Stock Center, Japanhttp://www.dna.affrc.go.jp 2010 Sami KhuriSwissProt and SRS The SwissProt protein sequence database is atISREC (Swiss Institute for Experimental CancerResearch) in Epalinges, Lausanne. The Sequence Retrieval System (SRS) at theEuropean Bioinformatics Institute:– allows both simple and complex concurrent searches ofone or more sequence databases,– may also be used on a local machine to assist in thepreparation of local sequence databases. 2010 Sami Khuri 2010 Sami Khuri2.15

Yverdon Les BainsIntroduction to BioinformaticsJuly 2010What does NCBI do?NCBI: established in 1988 as a national resourcefor molecular biology information.– it creates public databases,– it conducts research in computational biology,– it develops software tools for analyzing genome data,and– it disseminates biomedical information,all for the better understanding of molecularprocesses affecting human health and disease. 2010 Sami KhuriGenBankGenBank is the NIH genetic sequencedatabase of all publicly available DNAand derived protein sequences, withannotations describing the biologicalinformation these records contain. 2010 Sami Khuri 2010 Sami Khuri2.16

Yverdon Les BainsIntroduction to BioinformaticsJuly 2010GenBank DNA Sequence Entry IThe main important fields of an entry are: Locus: name of locus, length and type ofsequence, classification of organism, date ofentry. Not maintained among other databases Definition: description of entry Accession: accession number of originalsource. A citable entity; does not changewhen record is updated. 2010 Sami KhuriGenBank DNA Sequence Entry II Keywords: keywords for cross referencingthis entry Source: source organism of DNA Organism: description of organism Comment: biological function or databaseinformation Features: information about sequence bybase position or range of positions 2010 Sami Khuri 2010 Sami Khuri2.17

Yverdon Les BainsIntroduction to BioinformaticsJuly 2010GenBank DNA Sequence Entry III Base count: count of A, C, G, T and othersymbols Origin: text indicating the start of thesequence.The sequence entry is assumed by allcomputer programs to lie between theidentifiers ORIGIN and //. 2010 Sami KhuriDivision of OrganismsBCT BacterialFUN FungalHUM Homo sapiensINV InvertebrateMAM Other mammalianORG OrganellePHG odentSynthetic & chimericViralOther vertebrate 2010 Sami Khuri 2010 Sami Khuri2.18

Yverdon Les BainsIntroduction to BioinformaticsJuly 2010Interesting Databases UCSC Human Genome Browser– http://genome.ucsc.edu/ Organism specific information:– Yeast: http://genomewww.stanford.edu/Saccharomyces/– Arabidopis: http://www.tair.org/– Mouse: http://www.jax.org/– Fruit fly: http://www.fruitfly.org/– Nematode: http://www.wormbase.org/ 2010 Sami KhuriEuropean MolecularBiology Laboratory The European Molecular Biology Laboratory(EMBL) was established in 1974. It is supported by sixteen countries. EMBL consists of five facilities:– The main Laboratory in Heidelberg (Germany),– Outstations in Hamburg (Germany), Grenoble (France) andHinxton (the U.K.), and an external Research Programme inMonterotondo (Italy). 2010 Sami Khuri 2010 Sami Khuri2.19

Yverdon Les BainsIntroduction to BioinformaticsJuly 2010NCBI – EMBL - DDJB 2010 Sami KhuriApplications ofGenome ResearchCurrent and potential applications of GenomeResearch include:––––Molecular MedicineMicrobial GenomicsRisk AssessmentBioarcheology, Anthropology, Evolution andHuman Migration– DNA Identification– Agriculture, Livestock Breeding andBioprocessing 2010 Sami Khuri 2010 Sami Khuri2.20

Yverdon Les BainsIntroduction to BioinformaticsJuly 2010Molecular Medicine Improve the diagnosis of disease Detect genetic predispositions to disease Create drugs based on molecularinformation Use gene therapy and control systems asdrugs Design custom drugs on individual geneticprofiles. 2010 Sami KhuriMicrobial Genomics Swift detection and treatment in clinics ofdisease-causing microbes: pathogens Development of new energy sources: biofuels Monitoring of the environment to detectchemical warfare Protection of citizens from biological andchemical warfare Efficient and safe clean up of toxic waste. 2010 Sami Khuri 2010 Sami Khuri2.21

Yverdon Les BainsIntroduction to BioinformaticsJuly 2010DNA Identification I Identify potential suspects whose DNA maymatch evidence left at crime scenes Exonerate persons wrongly accused of crimes Establish paternity and other familyrelationships Match organ donors with recipients intransplant programs 2010 Sami KhuriLouis XVIILouis XVII: son of Louis XV1 and Marie-Antoinettewho died from tuberculosis in 1795 at the age of 12 2010 Sami Khuri 2010 Sami Khuri2.22

Yverdon Les BainsIntroduction to BioinformaticsJuly 2010DNA Identification II Identify endangered and protected speciesas an aid to wildlife officials and also toprosecute poachers Detect bacteria and other organisms thatmay pollute air, water, soil, and food Determine pedigree for seed or livestockbreeds Authenticate consumables such as wineand caviar 2010 Sami KhuriAgriculture, Livestock Breedingand Bioprocessing Grow disease-resistant, insect-resistant,and drought-resistant crops Breed healthier, more productive,disease-resistant farm animals Grow more nutritious produce Develop biopesticides Incorporate edible vaccines into foodproducts 2010 Sami Khuri 2010 Sami Khuri2.23

Yverdon Les BainsIntroduction to BioinformaticsJuly 2010What have we learned from HGP?A smallportion of thegenomecodes forproteins,tRNAsand rRNAs 2010 Sami KhuriWhat have we learned from HGP?The small number of genes 2010 Sami Khuri 2010 Sami Khuri2.24

Yverdon Les BainsIntroduction to BioinformaticsJuly 2010Alternative Splicing 2010 Sami KhuriGenomic Medicine by Guttmacher et al., NEJM, 2002The FutureConvert all this progress into real riches for science, society, and 2010patientsSami Khuri 2010 Sami Khuri2.25

Yverdon Les BainsIntroduction to BioinformaticsJuly 2010Objectives of Molecular Biology Extract the information in the genomes. Understand the structure of the genome. Apply this understanding to the diagnosis andtreatment of genetic diseases. Explain the process of evolution by comparinggenomes of related species. 2010 Sami KhuriGoals ofModern Molecular Biology Read the entire genomes of living things Identify every gene Match each gene with the protein itencodes Determine the structure and function ofeach protein. 2010 Sami Khuri 2010 Sami Khuri2.26

Yverdon Les BainsIntroduction to BioinformaticsJuly 2010Objectives of BioinformaticsDevelopment and use of mathematicaland computer science techniques tohelp solving the problems in molecularbiology. 2010 Sami KhuriBioinformatics Problems Reconstructing long DNA sequences from overlappingstring fragments. Comparing two or more sequences for similarities. Storing, retrieving and comparing DNA sequences andsubsequences in databases. Exploring frequently occurring patterns of nucleotides. Finding informative elements in protein and DNAsequences. Finding evolutionary relationships between organisms. 2010 Sami Khuri 2010 Sami Khuri2.27

Yverdon Les BainsIntroduction to BioinformaticsJuly 2010Main Aim of the Problems The aim of these problems is to learnabout the functionality and/or thestructure of protein without actuallyhaving to physically construct the proteinitself. The research is based on the assumptionthat similar sequences produce similarproteins. 2010 Sami KhuriFunctional: Coding v/s NoncodingCoding Sequence Non-CodingSequence(Genes)IdentifyingRelatively EasyVery HardComputational ToolsImproving ToolsPoor predictive toolsSignalsWhat to look forWe Have a Good Very little isUnderstandingknownComplementarydata we can useAvailable – Ex.UnavailableESTs and cDNAs 2010 Sami Khuri 2010 Sami Khuri2.28

Yverdon Les BainsIntroduction to BioinformaticsJuly 2010Post Human Genome Project Major role for comparative sequence analysis will bethe identification of functionally important, noncoding sequences. Need to study the relation between SequenceConservation and Sequence Function. Focus on the interpretation of the human genome. Learn the functional landscape of the human genome. Challenge: go from sequence to function– i.e., define the role of each gene and understand how thegenome functions as a whole. 2010 Sami KhuriImpact on Other Fields of Science Mathematical ModelsVisualization/Animation of molecular structuresVisualization/Animation of algorithmsMore sophisticated databasesData miningStatistical analysis of resultsMetabolic pathwaysPredictive algorithms 2010 Sami Khuri 2010 Sami Khuri2.29

Bioinformatics offers the prospect of finding better drug targets earlier in the drug development process. – By looking for genes in model organisms that are similar to a given human gene, researchers can learn about protein