Bioinformatics Databases - Shcollege.ac.in

Transcription

Bioinformatics DatabasesMV 2017

Data in Bioinformatics DNA- Sequences of nucleotides (ATGC) thatcontain information in the form of triplet codonshaving specific reading frames and built-in controlsegments RNA sequences (AUGC) mRNA, tRNA, hnRNA Protein sequences- Strings of Amino-acidsequences (e. g., Aspartate, Glycine, Histidine,Isoleucine, Leucine, Methionine, Serine,Threonine, Valine, Phenyl alanine, Tyrosine) Structure data (Protein structure-primarysecondary, tertiary, Quaternary, 3D views) Images of 2D Gel electrophoresis

Bioinformatics Databases Databases are convenient system to properlystore, search and retrieve any type of data Databases are different types based on nature ofinformation and manner (complexity) of datastorage

Types of databasesBased on nature of information db are dividedinto 1. Generalized db: DNA, Protein (e. g., NCBI)– a. Sequence db: nucleotides or amino acids– b. Structure db: structure of macromolecules 2. Specialized db: Expressed Sequence Tags(EST), Single Nucleotide Polymorphisms (SNP)

Based on the manner of data storage, db are dividedinto1. Primary or abbreviated db: in original form, takenas such from the source. Eg: GenBank, Swiss-Prot2. Secondary db: value added db with derivedinformation from primary db3. Composite db: combined primary dbRedundant and Non-redundant db: more thanone copy of each sequenceBoutique db: species specific sequence data

Db entries composed of– Core data: original sequence– Supplementary data or annotation (source,author, date, method used etc) Sequence formats– PIR (Protein Information Resource)/NBRF(NationalBiomedical Res. Foundation) - P, N– FASTA (Fast Alignment) - – GDE (Genetic Data Environment) - %

Primary Databases In original form, taken from the sourceOriginal submission by researcherContents controlled by the submitterData explosion in 1980s - so started manyrepositories1. Nucleic acid sequence db2. Protein sequence db3. Metabolite db

Secondary databases Derivative dbResult of analyses of sequences in the primary dbSecondary db built up from primary dbSecondary db analyzed in a variety of ways andcontain different information in different formats Contents of secondary db controlled by a thirdparty Eg: Prosite, Prints, Blocks

Nucleic acid sequence databases Collection of nucleotide sequences Organize and distribute nucleotide sequences fromall available source In the form of a text file Can read by humans and computer Many dbs are assembled from severalpublications, so overlapping fragments ofcomplete sequence First sequence - Yeast t-RNA with 77 bases in 1964

NCBI National Centre for Biotechnology Information Established on November 4, 1988 as part ofthe National Library of Medicine (NLM) at theNational Institute of Health (NIH), USA Headquarters in Bethesda, Maryland Legislation sponsored by Senator ClaudePepper

Services PubmedGenbankBLASTEntrez

GenBank GenBank is the NIH genetic sequencedatabase, an annotated collection of allpublicly available DNA sequences GenBank is part of the InternationalNucleotide Sequence Database Collaboration(INSDC) , which comprises the DNA DataBankof Japan (DDBJ), the European NucleotideArchive (ENA), and GenBank at NCBI. Thesethree organizations exchange data on a dailybasis.

International Nucleotide SequenceDatabase Collaboration (INSDC) INSDC consist of1. EMBL2. DDBJ3. GenBankDaily exchange of data

The GenBank database is designed to provideand encourage access within the scientificcommunity to the most up to date andcomprehensive DNA sequence information.Therefore, NCBI places no restrictions on theuse or distribution of the GenBank data.However, some submitters may claim patent,copyright, or other intellectual property rightsin all or a portion of the data they havesubmitted.

What is in it? Annotated nucleotide sequences, includingmRNA sequences with coding regions,segments of genomic DNA with a single geneor multiple genes, and ribosomal RNA geneclusters More than 100,000 organisms Aminoacid translations (CDS)

EMBL The European Molecular Biology Laboratory (EMBL) isa molecular biology research institution supported by25 member states, four prospect and two associatemember states. EMBL was constituted in 1974 and isan intergovernmental organisation funded by publicresearch money from its member states. Research atEMBL is conducted by approximately 85 independentgroups covering the spectrum of molecular biology.EMBL groups and laboratories perform basic researchin molecular biology and molecular medicine as well astraining for scientists, students and visitors.

Stations The Laboratory operates from six sites: themain laboratory in Heidelberg, andoutstations in Hinxton (the EuropeanBioinformatics Institute (EBI), inEngland), Grenoble (France), Hamburg (Germany), Monterotondo (near Rome)and Barcelona (Spain).

European Molecular Biology Laboratory(EMBL) From European Bioinformatics Institute (EBI), UK Collect and assemble data from-Direct author submission-Genome sequencing groups-Patent application-Literature Goal - integrate nucleotide sequence data andannotation into the wealth of bioinformaticsresources By cross reference and Sequence Retrieval System(SRS) data can be viewed in 200 local stations 2494 completed genomes

EMBL The roots of the EMBL-EBI lie in the EMBLNucleotide Sequence Data Library (nowknown as EMBL-Bank), which was establishedin 1980 at the EMBL laboratories inHeidelberg, Germany and was the world's firstnucleotide sequence database. The original goal was to establish a centralcomputer database of DNA sequences, tosupplement sequences submitted to journals.

The EMBL-EBI hosts a number of publicly open, free to uselife science resources, including biomedical databases,analysis tools and bio-ontologies. These include: ArrayExpress - archive of gene expression experiments BioModels Database - a database of computational modelsrelevant to the life sciences BioStudies - a database that serves as a generic dataarchive at EMBL-EBI for biomolecular datasets Chemical Entities of Biological Interest (ChEBI) - databaseand ontology of molecular entities European Nucleotide Archive (ENA) - resource ofnucleotide sequencing information Ensembl project - genome databases for vertebrates andother eukaryotic species (joint with Wellcome Trust SangerInstitute) Europe PubMed Central - database offering free access tocollection of biomedical research literature

DNA Data Bank of Japan

Currently, DDBJ Center is in operation at theNational Institute of Genetics (NIG) inMishima, Japan with endorsement of MEXT;Japanese Ministry of Education, Culture,Sports, Science and Technology. DDBJ Center is reviewed and advised by itsown advisory board, DNA Database AdvisoryCommittee (an outside committee of NIG),and also by the advisory board toINSDC, International Advisory Committee. Started in 1986

It is located at the National Institute ofGenetics (NIG) in the Shizuoka prefecture ofJapan. It is also a member of the INSDC. Itexchanges its data with European MolecularBiology Laboratory at the EuropeanBioinformatics Institute and with GenBank atthe National Center for BiotechnologyInformation on a daily basis. These three databanks contain the same dataat any given time.

Protein sequence databases SWISSPROT, PIR UniProtKB/Swiss-Prot is the manuallyannotated and reviewed section of theUniProt Knowledgebase (UniProtKB). It is a high quality annotated and nonredundant protein sequence database, whichbrings together experimental results,computed features and scientific conclusions. Since 2002, it is maintained by the UniProtconsortium and is accessible via the UniProtwebsite.

UniProtKB/Swiss-Prot UniProtKB/Swiss-Prot is the manuallyannotated and reviewed section of theUniProt Knowledgebase (UniProtKB).It is a high quality annotated and nonredundant protein sequence database, whichbrings together experimental results,computed features and scientific conclusions. Since 2002, it is maintained by the UniProtconsortium and is accessible via the UniProtwebsite.

Swiss-Prot Established in 1986 by Dept. of Biochemistry, University of Geneva Maintenance by Swiss Institute of Bioinformatics (SIB) and EMBL Database composed of 2 parts1. Core data - sequence reference and taxonomic details2. Annotation - sequence variants, functions, 2o & 3o structures Provide high level annotation including functions of the protein Maintain high quality and structure - first choice for most researchpurpose Swiss-Prot is supplemented by TrEMBL in 1996 - translated EMBL TrEMBL has 2 sections1. SP-TrEMBL - data included in the Swiss-Prot from EMBL2. REM-TrEMBL - data which are not included in the Swiss-Prot

A well-defined manual curation process isessential to ensure that all manuallyannotated entries are handled in a consistentmanner. This process consists of 6 majormandatory steps: (1) sequence curation, (2)sequence analysis, (3) literature curation, (4)family-based curation, (5) evidenceattribution, (6) quality assurance andintegration of completed entries. Curation isperformed by expert biologists using a rangeof tools that have been iteratively developedin close collaboration with curators.

Protein Sequence Databases1. PIR - Protein Information Resource Established in 1984 by National BiomedicalResearch Foundation (NBRF), Washington DC Aim - identification and interpretation of proteinsequence information Investigating evolutionary relationship amongproteins Help to do search and similarity analysis Provide integrated environment for sequenceanalysis between 3 units

PIR - Protein Information Resource Established in 1984 by National BiomedicalResearch Foundation (NBRF), Washington DC Aim - identification and interpretation of proteinsequence information Investigating evolutionary relationship amongproteins Help to do search and similarity analysis Provide integrated environment for sequenceanalysis between 3 units

PIR is composed of 3 databases:1. PSD - protein sequence database2. NREF - Non-redundant reference database3. iProClass - provides structural and functionalfeatures of proteins PIR database split into 4 sections - differ in termsof quality of data and levels of annotation provided1. fully classified and annotated entries2. preliminary entries, not thoroughly reviewed3. unverified entries, not reviewed4. genetically engineered sequences

Structure databases

PDB Protein Data Bank The Protein Data Bank (PDB) is a crystallographicdatabase for the three-dimensional structural data oflarge biological molecules, suchas proteins and nucleic acids. The data, typicallyobtained by X-ray crystallography, NMRspectroscopy, or, increasingly, cryo-electronmicroscopy

www.rcsb.org/ The data is freely accessible on the Internet viathe websites of its member organisations (PDBe,PDBj, and RCSB). The PDB is overseen by an organization calledthe Worldwide Protein Data Bank, wwPDB. The PDB is a key resource in areas of structuralbiology, such as structural genomics. Most majorscientific journals, and some funding agencies,now require scientists to submit their structuredata to the PDB. Many other databases useprotein structures deposited in the PDB.

Molecular graphics display, the BrookhavenRAster Display (BRAD), is used to visualizeprotein structures in 3-D. The file format initially used by the PDB wascalled the PDB file format. PDB was initiated in 1968 with the help ofBRAD visualization of protein structure and Xray crystallographic studies of proteins In October 1998, the PDB was transferred tothe Research Collaboratory for StructuralBioinformatics (RCSB).

In 2003, with the formation of the wwPDB,the PDB became an international organization.The founding members are PDBe (Europe),RCSB (USA), and PDBj (Japan). Each of the three members of wwPDB can actas deposition, data processing anddistribution centers for PDB data. The data processing refers to the fact thatwwPDB staff review and annotate eachsubmitted entry

INSULIN

NDB

The Nucleic Acid Database (NDB; Berman etal., 1992) was established in 1991 as aresource for specialists in the field of nucleicacid structure. Over the years, the NDB hasdeveloped generalized software forprocessing, archiving, querying anddistributing structural data for nucleic acidcontaining structures. The core of the NDB hasbeen its relational database of nucleic acidcontaining crystal structures. It allows researchers to perform comparativeanalyses of nucleic acid-containing structuresselected from the NDB

Structures available in the NDB include RNA andDNA oligonucleotides with two or more baseseither alone or complexed with ligands, naturalnucleic acids such as tRNA and protein nucleicacid complexes. The archive stores both primaryand derived information about the structures The primary data include the crystallographiccoordinate data, structure factors andinformation about the experiments used todetermine the structures, such as crystallizationinformation, data-collection and refinementstatistics.

OMIM Online Mendelian Inheritance in Man A comprehensive, authoritative and timelycompendium of human genes and geneticphenotypes The full-text, referenced overviews in OMIMcontain information on all known Mendeliandisorders and over 12,000 genes.

Initiated in the early 1960s by Dr. Victor A. McKusick as a catalogue of Mendelian traits anddisorders, entitled Mendelian Inheritance inMan 1995 internet version

Data in Bioinformatics DNA- Sequences of nucleotides (ATGC) that contain information in the form of triplet codons having specific reading frames and built-in control segments RNA sequences (AUGC) mRNA, tRNA, hnRNA Protein sequences- Strings of Amino-acid sequences (e. g., Aspartate, Gly