Genomics & Bioinformatics - Stanford University

Transcription

formatics.htmlGenomics, Bioinformatics & Medicinehttp://biochem158.stanford.edu/Doug BrutlagProfessor Emeritus of Biochemistry & MedicineStanford University School of Medicine Doug Brutlag 2015

25th Anniversary Beckman edule-review.htmlThanks To Matthew Lum Doug Brutlag 2015

What is olutionSelectionPopulationsBiological Information Doug Brutlag 2015

Computational Goals of Bioinformatics Learn & Generalize: Discover conserved patterns (models) ofsequences, structures, metabolism & chemistries from well-studiedexamples. Prediction: Infer function or structure of newly sequenced genes,genomes, proteomes or proteins from these generalizations. Organize & Integrate: Develop a systematic and genomic approach tomolecular interactions, metabolism, cell signaling, gene expression Basis of systems biology Simulate: Model gene expression, gene regulation, protein folding,protein-protein interaction, protein-ligand binding, catalytic function,metabolism Goal of systems biology. Engineer: Construct novel organisms or novel functions or novelregulation of genes and proteins. Basis of synthetic biology. Target: Mutations, RNAi to specifc genes and transcripts or drugs tospecifc protein targets. Practical biological and medical use ofbioinformatics. Doug Brutlag 2015

Central Paradigm of Molecular BiologyDNARNAProteinPhenotype Doug Brutlag 2015

Central Paradigm of MedicineDNARNAProteinSymptomsOpinions Doug Brutlag 2015

Central Paradigm of CVLARNFGKEFTPQMQAAYQKVVAGVANALAHKYH Doug Brutlag 2015

Central Paradigm of CVLARNFGKEFTPQMQAAYQKVVAGVANALAHKYH Doug Brutlag 2015

Challenges Understanding Genetic chemicalFunctionPhenotype Genetic information is redundant Structural information is redundant Doug Brutlag 2015

Soybean Leghemoglobin andSperm Whale MyoglobinSoybean LeghemoglobinSperm Whale Myoglobin Doug Brutlag 2015

Challenges Understanding Genetic chemicalFunctionPhenotype Genetic information is redundant Structural information is redundant Genes and proteins are meta-stable Doug Brutlag 2015

Challenges Understanding Genetic InformationGeneticInformation tic information is redundantStructural information is redundantGenes and proteins are meta-stableGenes and proteins are one dimensional buttheir function depends on three-dimensionalstructure Doug Brutlag 2015

Discovering Function from Protein SequenceSequences ofCommonStructure or FunctionSequence --NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGN : : : : : : : : : : : --DLSHGS1020304050 Doug Brutlag 2015

Dayhoff’s PAM 250 - Log OddsAmino Acid Replacement Matrix (1978) Doug Brutlag 2015

Discovering Function from Protein SequenceSequences ofCommonStructure or FunctionSequence --NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGN : : : : : : : : : : : --DLSHGS1020304050 Doug Brutlag 2015

Discovering Function from Protein SequenceConsensus Sequencesor Sequence MotifsZinc Finger (C2H2 type)C X{2,4} C X{12} H X{3,5} HSequences ofCommonStructure or FunctionSequence --NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGN : : : : : : : : : : : --DLSHGS1020304050 Doug Brutlag 2015

A Typical Motif:Zinc Finger DNA Binding MotifC.C.H.H Doug Brutlag 2015

Zinc Finger DNA Binding MotifGreif et al, Blood (July 12, 2012) 120 Doug Brutlag 2015

Zinc Finger DNA Binding MotifGreif et al, Blood (July 12, 2012) 120 Doug Brutlag 2015

Protein Motifs fromMultiple Sequence AlignmentsEBI Course on Protein ine/course/introduction-protein-classifcation-ebi Doug Brutlag 2015

PROSITE Patternshttp://expasy.org/prosite/ Active site of trypsin-like serine proteasesGDSGG Zinc Finger (C2H2 type)C-X(2,4)-C-X(12)-H-X(3,5)-H N-Glycosylation SiteN-[ P]-[S T]-[ P] Homeobox Domain (8)-[RK] Doug Brutlag 2015

Discovering Function from Protein SequenceBLOCKs, PRINTs, PSSMS orWeight Matrices1ARNDCQEGHILKMFPSTWYV23Position4 5 6 72 1 3 137 5 8 90 8 0 10 1 0 10 0 1 01 1 21 82 0 0 99 7 1 44 3 1 110 0 11 116 1 17 03 4 5 107 1 1 04 0 3 00 6 0 11 17 0 85 22 3 112 0 0 01 0 4 26 3 1 110401301021022111000310028DatabaseZinc Finger (C2H2 type)C X{2,4} C X{12} H X{3,5} H9 10 11 1212 67 4 13 9 1 20 1 16 7 0 1 00 0 2 1 1 10 00 0 12 1 0 4 00 0 0 0 2 2 10 0 7 6 0 0 20 0 15 7 3 3 00 8 0 0 0 46 00 0 2 2 0 5 010 0 4 9 3 0 1631 0 3 11 24 0 141 1 13 10 0 5 20 0 0 5 7 1 84 0 0 0 10 0 00 0 0 0 0 0 01 3 0 2 2 2 05 0 2 2 2 0 50 0 0 0 1 0 11 0 0 2 4 0 115 0 0 2 12 0 28QueryConsensus Sequencesor Sequence MotifsSequences ofCommonStructure or FunctionSequence RLLVVYPWTQRFFESFGDLSTPDAVMGN : : : : : : : : : : : --DLSHGS1020304050 Doug Brutlag 2015

Position-Specifc Scoring Matrix forProkaryotic Helix-Turn-Helix MotifsSequenceRCRO LAMBDRCRO BP434RCRO BPP22RPC1 LAMBDRPC1 BP434RPC1 BPP22RPC2 LAMBDLACR ECOLICRP ECOLITRPR ECOLIRPC1 CPP22GALR ECOLIY77 BPT7TER3 ECOLIVIVB BPT7DEOR ECOLIRP32 BACSUY28 BPT7IMMRE QQQIHTQLLNLHelixT K T AT E L AR A V AE S V AA E L AA A L GE K T AY D V AQ E I GR E L KR K V AK D V AR S L GR K L AA I F AK D A AE E V GV S L AE A V AYLLLFYLTurnG V YG V KG I SG M GG T TG V SG V DG V SG C SG A GG I NG V SG V SG V EG G TG V SG V TG V SG I QQQDQQNKYRIEVQQQERQVSQASQVSQEASASPSMEQSHelixA I NS I QA V SG V GS I EA I SQ I ST V ST V GT I TQ I ST V ST I TT L YA A ST I RR I RT I CA I ILVDLEREHAENNRRNKNGNQKENAKT Doug Brutlag 2015

Blocks or Finger Prints fromMultiple Sequence AlignmentsEBI Course on Protein ine/course/introduction-protein-classifcation-ebi Doug Brutlag 2015

Finger Prints from Multiple SequenceAlignmentsEBI Course on Protein ine/course/introduction-protein-classifcation-ebi Doug Brutlag 2015

Discovering Function from Protein SequenceBLOCKs, PRINTs, PSSMS orWeight Matrices1ARNDCQEGHILKMFPSTWYV23Position4 5 6 72 1 3 137 5 8 90 8 0 10 1 0 10 0 1 01 1 21 82 0 0 99 7 1 44 3 1 110 0 11 116 1 17 03 4 5 107 1 1 04 0 3 00 6 0 11 17 0 85 22 3 112 0 0 01 0 4 26 3 1 110401301021022111000310028DatabaseZinc Finger (C2H2 type)C X{2,4} C X{12} H X{3,5} H9 10 11 1212 67 4 13 9 1 20 1 16 7 0 1 00 0 2 1 1 10 00 0 12 1 0 4 00 0 0 0 2 2 10 0 7 6 0 0 20 0 15 7 3 3 00 8 0 0 0 46 00 0 2 2 0 5 010 0 4 9 3 0 1631 0 3 11 24 0 141 1 13 10 0 5 20 0 0 5 7 1 84 0 0 0 10 0 00 0 0 0 0 0 01 3 0 2 2 2 05 0 2 2 2 0 50 0 0 0 1 0 11 0 0 2 4 0 115 0 0 2 12 0 28QueryConsensus Sequencesor Sequence MotifsSequences ofCommonStructure or FunctionProfles, PSI-BLASTHidden Markov ModelsD2D3D4D5I1I2I3I4I5AA1AA2AA3AA4AA5AA6Sequence RLLVVYPWTQRFFESFGDLSTPDAVMGN : : : : : : : : : : : --DLSHGS1020304050 Doug Brutlag 2015

Hidden Markov Models fromMultiple Sequence AlignmentsEBI Course on Protein ine/course/introduction-protein-classifcation-ebi Doug Brutlag 2015

Data Mining:The Search for Buried Treasure Doug Brutlag 2015

Data Mining:The Search for Buried Treasure Doug Brutlag 2015

Data Mining:The Search for Buried Treasure Doug Brutlag 2015

Discovering Function from Protein SequenceBLOCKs, PRINTs, PSSMS orWeight Matrices1ARNDCQEGHILKMFPSTWYV23Position4 5 6 72 1 3 137 5 8 90 8 0 10 1 0 10 0 1 01 1 21 82 0 0 99 7 1 44 3 1 110 0 11 116 1 17 03 4 5 107 1 1 04 0 3 00 6 0 11 17 0 85 22 3 112 0 0 01 0 4 26 3 1 110401301021022111000310028DatabaseZinc Finger (C2H2 type)C X{2,4} C X{12} H X{3,5} H9 10 11 1212 67 4 13 9 1 20 1 16 7 0 1 00 0 2 1 1 10 00 0 12 1 0 4 00 0 0 0 2 2 10 0 7 6 0 0 20 0 15 7 3 3 00 8 0 0 0 46 00 0 2 2 0 5 010 0 4 9 3 0 1631 0 3 11 24 0 141 1 13 10 0 5 20 0 0 5 7 1 84 0 0 0 10 0 00 0 0 0 0 0 01 3 0 2 2 2 05 0 2 2 2 0 50 0 0 0 1 0 11 0 0 2 4 0 115 0 0 2 12 0 28QueryConsensus Sequencesor Sequence MotifsSequences ofCommonStructure or FunctionProfles, PSI-BLASTHidden Markov ModelsD2D3D4D5I1I2I3I4I5AA1AA2AA3AA4AA5AA6Sequence RLLVVYPWTQRFFESFGDLSTPDAVMGN : : : : : : : : : : : --DLSHGS1020304050 Doug Brutlag 2015

Swiss Institute of Bioinformaticshttp://www.isb-sib.ch/ Doug Brutlag 2015

Expasy Bioinformatics Resource Portalhttp://expasy.org/ Doug Brutlag 2015

Expasy Bioinformatics Resource Portalhttp://expasy.org/ Doug Brutlag 2015

UniProt Knowledge Basehttp://www.uniprot.org/ Doug Brutlag 2015

UniProt Knowledge Base Advanced Searchhttp://www.uniprot.org/ Doug Brutlag 2015

UniProt Human Opsin Entrieshttp://www.uniprot.org/ Doug Brutlag 2015

UniProt Human Opsin Entries Reviewedhttp://www.uniprot.org/ Doug Brutlag 2015

UniProt Human Opsin OPN1MW Entryhttp://www.uniprot.org/uniprot/P04001 Doug Brutlag 2015

Blast UniProt Human Opsin OPN1MW Entryhttp://www.uniprot.org/uniprot/P04001 Doug Brutlag 2015

Blast UniProt Human OPN1MW Resultshttp://www.uniprot.org/uniprot/P04001 Doug Brutlag 2015

NCBI BLAST Home Pagehttp://blast.ncbi.nlm.nih.gov/ Doug Brutlag 2015

NCBI BLAST Home Pagehttp://blast.ncbi.nlm.nih.gov/ Doug Brutlag 2015

NCBI BLAST Parameters http://blast.ncbi.nlm.nih.gov/ Doug Brutlag 2015

Entrez Gene search for Colorblindness Doug Brutlag 2015

Entrez Gene search for Colorblindness Doug Brutlag 2015

Entrez Gene search for Colorblindness Doug Brutlag 2015

Entrez Gene search for Opsins Doug Brutlag 2015

Entrez Gene search for Opsins Doug Brutlag 2015

BLAST Similarity Searchhttp://www.ncbi.nlm.nih.gov/BLAST/ Doug Brutlag 2015

Choose Standard Protein-Protein BLASThttp://www.ncbi.nlm.nih.gov/BLAST/ Doug Brutlag 2015

Paste Sequence, Choose SwissProt Databaseand BLAST! Doug Brutlag 2015

Optional Parameters Doug Brutlag 2015

BLAST Conserved Domain Output Doug Brutlag 2015

Sequence Aligned with Domain Doug Brutlag 2015

Most Signifcant Similarity Hits Doug Brutlag 2015

Most Signifcant Similarity Hits Doug Brutlag 2015

Bovine Blue Opsin Similarity Doug Brutlag 2015

Discovering Function from Protein SequenceBLOCKs, PRINTs, PSSMS orWeight Matrices1ARNDCQEGHILKMFPSTWYV23Position4 5 6 72 1 3 137 5 8 90 8 0 10 1 0 10 0 1 01 1 21 82 0 0 99 7 1 44 3 1 110 0 11 116 1 17 03 4 5 107 1 1 04 0 3 00 6 0 11 17 0 85 22 3 112 0 0 01 0 4 26 3 1 110401301021022111000310028DatabaseZinc Finger (C2H2 type)C X{2,4} C X{12} H X{3,5} H9 10 11 1212 67 4 13 9 1 20 1 16 7 0 1 00 0 2 1 1 10 00 0 12 1 0 4 00 0 0 0 2 2 10 0 7 6 0 0 20 0 15 7 3 3 00 8 0 0 0 46 00 0 2 2 0 5 010 0 4 9 3 0 1631 0 3 11 24 0 141 1 13 10 0 5 20 0 0 5 7 1 84 0 0 0 10 0 00 0 0 0 0 0 01 3 0 2 2 2 05 0 2 2 2 0 50 0 0 0 1 0 11 0 0 2 4 0 115 0 0 2 12 0 28QueryConsensus Sequencesor Sequence MotifsSequences ofCommonStructure or FunctionProfles, PSI-BLASTHidden Markov ModelsD2D3D4D5I1I2I3I4I5AA1AA2AA3AA4AA5AA6Sequence RLLVVYPWTQRFFESFGDLSTPDAVMGN : : : : : : : : : : : --DLSHGS1020304050 Doug Brutlag 2015

Evaluation of PSSMs, Profles and HMMsNegative ProteinsPositive ProteinsT Doug Brutlag 2015

Evaluation of ProflesNegative ProteinsPositive Predictive Value TP/(TP FP)Sensitivity TP/(TP FN)Positive ProteinsSpecifcity TN/(TN FP)TNTPFNFP Doug Brutlag 2015

MyHits Local Motifs Searchhttp://myhits.isb-sib.ch/ Doug Brutlag 2015

MyHits Local Motifs Queryhttp://myhits.isb-sib.ch/ Doug Brutlag 2015

MyHits Local Motifs Summaryhttp://myhits.isb-sib.ch/ Doug Brutlag 2015

MyHits Local Motif Hitshttp://myhits.isb-sib.ch/ Doug Brutlag 2015

MyHits Local Motifs Hits (Cont.)http://myhits.isb-sib.ch/ Doug Brutlag 2015

MyHits Local Motifs Hits (Cont.) Doug Brutlag 2015

MyHits Local Motifs Hits (Cont.) Doug Brutlag 2015

InterProhttp://www.ebi.ac.uk/interpro/ Doug Brutlag 2015

InterPro Scanhttp://www.ebi.ac.uk/interpro/ Doug Brutlag 2015

InterPro Scanhttp://www.ebi.ac.uk/Tools/pfa/iprscan/ Doug Brutlag 2015

InterPro Scan Hour Glasshttp://www.ebi.ac.uk/InterProScan/ Doug Brutlag 2015

InterPro Scan Resultshttp://www.ebi.ac.uk/InterProScan/ Doug Brutlag 2015

InterPro Scan Resultshttp://www.ebi.ac.uk/InterProScan/ Doug Brutlag 2015

GO: Gene Ontology Databasehttp://www.geneontology.org/ Doug Brutlag 2015

GO: Gene Ontology for Opsin OPN1MWhttp://www.geneontology.org/ Doug Brutlag 2015

GO: Gene Ontology for Opsin OPN1MWhttp://www.geneontology.org/ Doug Brutlag 2015

GO: Sequence Information for OPN1MWhttp://www.geneontology.org/ Doug Brutlag 2015

GO: Annotations for OPN1MWhttp://www.geneontology.org/ Doug Brutlag 2015

GO: Gene Ontology Databasehttp://www.geneontology.org/ Doug Brutlag 2015

GO: Gene Ontology Terms for OPN1MWhttp://www.geneontology.org/ Doug Brutlag 2015

GO: Gene Ontology Term GCRPhttp://www.geneontology.org/ Doug Brutlag 2015

GO: Gene Ontology GCPR Termhttp://www.geneontology.org/ Doug Brutlag 2015

GO: Gene Ontology GCPR Termhttp://www.geneontology.org/ Doug Brutlag 2015

Bioinformatics ics.htmlHomework Assignment1)Select a protein from OMIM or from Entrez Gene or from UniProt concerning thedisease of interest to you. Copy and save the FASTA format of the protein fle.2) Search your protein for motifs with the MyHits Motif Scan Query. Be sure to IncludeProsite Patterns, Prosite Frequent Patterns, Prosite Profles, Profles, Pfam HMMSs(local Models) in your search. Please send me the MyHits you think are biologicallysignifcant and at least 1 or 2 hits which you think are not statistically or biologicallysignifcant. Please note that only the Profles have expectation values. The Patternsdo not have a measure of statistical signifcance.3) Search your protein for blocks using the InterPro database. Please send me a few ofthe InterPro domains hits you think are signifcant and at least 1 or 2 hits which youthink are not statistically or biologically signifcant. Please note that the defaultgraphic output of InterPro does not list expectation values. You must switch to theTabular view to obtain the statistical signifcance.4) Search your protein for homology using the BLAST method. Please report two orthree hits which are both statistically and biologically signifcant. Also report two orthree hits which you think are neither statistically nor biologically signifcant. Ifyour protein family is very large, you may have to ask BLAST to return more hits tofnd statistically insignifcant hits. Doug Brutlag 2015

Statistical vs. Biological SignifcanceAssignmentFirst, for each search (MyHits, InterPro and BLAST hit), I would like you toreport some signifcance hits and describe why you think they aresignifcant both statistically and biologically; also report some statisticallyinsignifcant hits (and why) and are any of your statistically insignifcanthits, still signifcant biologically). To remind you what I said in class: astatistically signifcant fnd in the database search is always biologicallysignifcant, but a biologically signifcant result in the search is notnecessarily always statistically signifcant.Statistical signifcance and expectation values.Statistical signifcance is determined by the expectation value which gives youa measure of how likely this fnding is based on pure chance. A fndingwith an E-value of 1 or greater is not signifcant because it could occur bypure chance. A fnding with an E-value less than 10-3 (one chance in athousand) is generally considered statistically signifcant (unless of courseyou are doing a 1,000 searches!). So the lower the expectation value, themore signifcant the fnding. Findings between 10-3 and 1 are in the socalled twilight zone and require some further analysis or experiments todetermine their validity. Doug Brutlag 2015

Statistical vs. Biological Signifcance (cont.)InterProUnlike most of the other methods, InterPro sets a very high level ofsignifcance for a fnding before it will report it. This means thatyou will usually not fnd any statistically insignifcant hits for thisparticular search.Biological SignifcanceIn order to determine biological signifcance you must read thebiological properties (ontology terms are the most useful) of yourprotein and the biological properties of your fndings. Thefndings may be signifcant because the fnding defnes a veryclosely related protein family (opsins for example) or a very broadfamily (G-coupled protein receptors or 7-transmembrane proteins)or a common structure (protein fold) or a specifc function (retinalbinding site) or a very specifc catalytic activity. You shoulddescribe in words the level of the biological signifcance. Doug Brutlag 2015

Statistical vs. Biological Signifcance (cont.)MyHitsIf you ask MyHits to return PATTERNs as well as motifs, you willnotice that PATTERNs do not have E-values associated with themso there is no easy way to judge statistical signifcance. Withpattern fndings you are left only with judging biologicalsignifcance. Also none of the Frequent patterns from MyHits arestatistically signifcant.BLASTIf you do not have any insignifcant hits from the BLAST search, itmeans that your protein family is very large and you have to askBLAST to return more results using the Advanced Options at thebottom of the form. Only when you see hits with E-values 0.001do you have insignifcant fndings. Doug Brutlag 2015

Computational Goals of Bioinformatics Learn & Generalize: Discover conserved patterns (models) of sequences, structures, metabolism & chemistries from well-studied examples. Prediction: Infer function or structure of newly sequenced genes, g