Bioinformatics - Rensselaer Polytechnic Institute

Transcription

Bioinformatics Sequence alignmentDatabase searchingSignificance, e-valuesTreesGene ontologyProtein Structure1

Experimental origins of sequence dataThe Sanger dideoxynucleotide methodFEach color is one lane of an electrophoresis gel.

New technology: Pyrosequencing http://www.youtube.com/watch?v nFfgWGFe0aA&NR 1 .or search youtube for “pyrosequencing” Whole genome sequencing in 1 day!!3

Aligning two sequences tells us how they are related.Plant. A AAG A G A TT CT GC T AG C G GT C G GBug.AGAGATGCTGCAGCGAGTCGGCCAn alignment is a one-to-one association, or a set ofone-to-one associations. Aligned sequences are assumedto be homologous (having a common ancestor).Furthermore, aligned positions within the sequences areassumed to have a common ancestor position.

Positions that align in sequence usually align in spacehave acommonancestorT G CTAT GCTAsuperimposein spaceTGCAAa Venn diagram

Simple alignment Simple similarity score:Identity match 1 pointmismatch 0 pointsgap -1 points Optimal alignment The highest-scoring alignment given the similarity score.

ding an alignment starts with a scoringmatrix. In its simplest form, a dot plot.Everything aligned to everything.

AAAGAGATTCTGCTAGCGGTCGGAGAGATGCTGCAGCGAGTCAn alignmentis a path through the scoringmatrix, Galways proceeding to the right andGdown. (no non-sequential alignmentsCallowed.)C

A AAGA G A TT CT GC T AG C G GT C G GAGAGATGCTGCAGCGAGTCUnbrokendiagonals represent “blocks” ofGG sequence without indels.CC

AAAGAGATTCTGCTAGCGGTCGGdeletion of TAGAblocksGATGCmutation,T- GTGCAGC Ainsertion,GAGTCindelsGGCThe path records,and scores, all mutational events, incl. insertions, deletions, mutations.C

BLOSUM62: protein substitution matrix

PAM250

Protein versus DNA alignmentsAre protein alignment better? Protein alphabet 20, DNA alphabet 4.––––Protein alignment is more informativeLess chance of homoplasy with proteins.Homology detectable at greater edit distanceProtein alignment more informative Better Gold Standard alignments are availablefor proteins.– Better statistics from G.S. alignments. On the other hand, DNA alignments are moresensitive to short evolutionary distances.13

Bioinformatics Sequence alignmentDatabase searchingSignificance, e-valuesTreesGene ontologyProtein Structure14

Database searchingone sequenceGenBank, PIR,Swissprot,GenEMBL, DDBJlots of sequencesWhy do a database search?Mol. Bio: Determination of gene function. Primer design.Pathology, epidemiology, ecology: Determination of species,strain, lineage, phylogeny.Biophysics: Prediction of RNA or protein structure, effect ofmutation.

Searching millions of sequencesGiven a protein or DNA sequence, we want to find all of thesequences in GenBank (over 17 million sequences!!) thathave a good alignment score.Each alignment score should be the optimal score (or a closeapproximation).How do we do it?

Fast Database SearchingBLASTS. Altschul et al.First make a set of lookup tables for all 3-letter(protein) or 11-letter (DNA) matches.Make another lookup table: the locations of all 3-letterwords in the database.Start with a match, extend to the left and right until thescore no longer increases.Very fast. Selective, but not as sensitive as slower searchmethods (SSEARCH). Reliable statistics. Heuristic, notoptimal.

BLAST, precalculationsAll 8000 possible 3-tuples.PGQPGQ PGR PGS .PGT PGV PGWPGY PAQPCQ PDQPEQPFQ .50high-scoring3-tuples.Each 3-tuple is scored against all 8000 possible 3-tuples using BLOSUM. Thetop scoring 50 are kept as that 3tuple’s “neighborhood words”

BLASTquery sequencetarget sequencea 3-tupleneighborhoodwords for 3-tupleidentitymatchesseedsHSPsFor every 3-residue window, we get the set of 50 nearestneighbors. Use each word to get identity matches (seeds). Thenextend the seed alignments as long as the score increases.

BLASTHSPsalignmentThe best extended seeds are called HSPs (high scoring pairs).The top scoring HSP is picked first, then the second (as long asit falls "northwest" or "southeast" of the first.), and so on.

Other forms of lastpproteinproteintblastnproteintranslated DNAblastxtranslated DNAproteintblastxtranslated DNA translated DNApsi-blastprotein, profileproteinphi-blastpatternproteintransitive blast*anyany*not really a blast. Just a way of using blast.21

Psi-BLAST: Blast with profilesPsi-BLAST searches the database iteratively.(Cycle 1) Normal BLAST (with gaps)(Cycle 2) (a) Construct a profile from the results of Cycle 1.(b) Search the database using the profile.(Cycle 3) (a) Construct a profile from the results of Cycle 2.(b) Search the database using the profile.And So On. (user sets the number of cycles)Psi-BLAST is much more sensitive than BLAST.Also more vulnerable to low-complexity.

PHI-BLAST -Patterned Hit Initiated BLAST23

DNA or Protein search? Advantages of searching DNA databasesLarger database. Does not assume a reading frame. Can find similarityin non-coding regions (introns, promotor regions). Can find frameshiftmutations. Can find pseudogenes. DisadvantagesSlower. Not as sensitive. Ignores selective pressure at the protein level. Advantages of searching protein sequencesFaster. More sensitive. More biologically relevant. DisadvantagesNot applicable to non-coding DNA (promotors, introns, etc)

Bioinformatics Sequence alignmentDatabase searchingSignificance, e-valuesTreesGene ontologyProtein Structure25

How significant is that?.a specific inference.Thanks.as opposed to.how likely the data would nothave been the result of chance,.Please give me a number for.

Dayhoff's randomizationexperimentAligned scrambled Protein A versus scrambled Protein B100 times (re-scrambling each time).NOTE: scrambling does not change the AA composition!Results: A Normal Distributionsignificance of a score is measuredas the probability of gettingthis score in a random alignmentfreqscore

Lippman's randomization experimentAligned Protein A to 100 natural sequences, not scrambled.Results: A wider normal distribution (Std dev 3 times larger)WHY? Because natural sequences are different than random.Even unrelated sequences have similar local patterns, anduneven amino acid composition.freqWas the significanceover-estimated usingDayhoff's method?scoreLippman got a similar result if he randomized the sequences bywords instead of letters.

P(S x)E(M) gives us the expected length of the longest number ofmatches in a row. But, what we really want is the answer to thisquestion:How good is the score x? (i.e. how significant)So, we need to model the whole distribution of chance scores,then ask how likely is it that my score or greater comes fromthat model.freqscore

A normal distributionSuppose you had a Gaussian distribution “dart-board”. Youthrow 1000 darts randomly. Score your darts according thenumber on the X-axis where it lands. What is theprobability distribution of scores?Answer:The same Gaussian distribution! (duh)

Extreme values from a normaldistributionWhat if we throw 10 darts at a time and keep only thehighest-scoring dart (extreme value)? What is thedistribution of the extreme values?

The Extreme Value DistributionNormal distributions (Dayhoff, Lippman) overestimate significancewhen the scores are extreme values. EVD is the correct null model.

Fitting the EVD to random alignmentsEstimated P (integral of the EVD):P(S x) Kmne-λxwhere K constant, m size of database, n length of sequence, λ constantTaking the log,log(P(S x)) log(Kmn) - λx Generate a large number of known false alignment scores S,(all alignments with the same two lengths m and n), Plot log(P(S x)) versus x , fit to a line!logP(S x)xxThe slope is λ, the intercept is log(Kmn).x xNow we can calculate P for any score x.xxxxxxxxxxx xxxxx x

Pop-quizYou did a BLAST search using a sequence that hasabsolutely no homologs in the database. Absolutelynone.The BLAST search gave you false “hits” with the top evalues ranging from 0 to 20. You look at them and younotice a pattern in the e-values.How many of your hits have e-value 10.?

Bioinformatics Sequence alignmentDatabase searchingSignificance, e-valuesTreesGene ontologyProtein Structure35

Evolutionary timeCladogramPhylogram6BCADno meaning11315Ultrametric treeBBCCAADDgenetic changetime(D:5,(A:1,(C:1,B:6):1):3)parenthesis (notation can have both labels and distances.

Multiple Sequence AlignmentA multiple sequence alignment is made using many pairwise sequence alignments

Multiple sequence alignment1. align all pairs2. pairwise align two most similar first3. align next most similar4. repeat until all sequences are alignedA G H I . W W P FA G H I I F W P YS(P,[W,F]) (1/2)(S(P,W) S(P,F))AWPY38

Construct a distance-based treeAABCDEFBCDEF97 81 82 59 3277 80 55 3190 65 4061 4233ABCDEFDraw tree heredistances

CLUSTALWJD Thompson, DG Higgins, TJ Gibson - Nucleic acids research, 1994 Start with unrooted tree, using Neighborjoining. choose root to get guide tree progressive alignment– matches are scored using sequence weights– gaps are position dependent GOP lower for polar residues GOP zero where there is already a gap40

CLUSTALW Position specific gap penalty41

Parsimony: Finding the tree with theminimum number of mutationsGiven a tree and a set of taxa, one-letter each (1) chooseoptional characters for each ancestor. (2) Select the rootcharacter that minimizes the number of mutations byselecting each and propagating it through the tree.CCTTTTCCCCT/CCT/CCT/C/CCT/CCminimum 2 mutationsminimum 1 mutation

Columns in a MSA have a common evolutionary historyBy aligning the sequences, we assert that the alignedresidues in each column had a common ancestor.

Orthologs/paralogsOrthologs: homologs originating from a speciation eventParalogs: homologs originating from a gene duplication event.duplicationspeciationSpecies treeSequence treefish Afish Bduck Aduck Bspeciationcrab Bclam Afish Bduck Bfish Acrab Bduck Aclam Afishcrabduckclamgene lossreconciled trees

How do I know it’s a paralog? If it’s a paralog, then at some point in evolutionary history, a speciesexisted with two identical genes in it. One may have been lost since then. (Descendants are stillparalogs!) Paralogs can be from different species.Paralogous genes have more than the expected sequence divergence. Because they are more likely to have different functionsBecause they diverged earlier than the speciation event.Without species information or functional information, it’s impossibleto tell

Life is not strictly a tree -horizontal gene transferDiscrete Steps Needed for Stability of Gene TransferBF Smets, T Barkay (2005) “Horizontalgene transfer: perspectives at acrossroads of scientific disciplines”Nature Reviews Microbiology.Stably incorporating horizontally transferred genes into a recipient genome involvesfive distinct steps (Fig. 1). 1. First, a particular segment of DNA or RNA is prepared fortransfer from the donor strain through one of several processes, including excision andcircularization of conjugative transposons, initiation of conjugal plasmid transfer bysynthesis of a mating pair-formation protein complex, or packaging of nucleic acidsinto phage virions. 2. Next, the segment is transferred either by conjugation, whichrequires contact between the donor and recipient cells, or by transformation andtransduction without direct contact. 3. During the third step, genetic material enters therecipient cell, where cell exclusion may abort the transfer. 4. Otherwise, during thefourth step, the incoming gene is integrated into the recipient genome by legitimate orsitespecific recombination or by plasmid circularization and complementary strandsynthesis. Barriers to transfer during this stepcome from restriction modification systems,failure to integrate and replicate within the newhost genome, and incompatibility with residentplasmids. 5. In the final step, transferred genesare replicated as part of the recipient genome andtransmitted to daughter cells in stable fashionover successive generations. Researchers fromdifferent disciplines tend to focus on specificstages within this five-step sequence. Thus,evolutionary biologists who examine microbialgenomes for evidence of past transfers tend tolook at HGTs from the perspective of step five.Molecular biologists are more likely to examinethe details of the transfer events, while microbialecologists look more broadly when they describethe magnitude and diversity of the mobile genepool, sometimes called the mobilome.46

“Boot strap analysis” A method to validate a phylogenetic tree,branchpoint by branchpoint. Requires a means to generate independenttrees. (For example trees generated from different regionsof the mitochondrial genome.) Choose the representative tree as the‘parent’. Calculate the following:For each branchpoint in the parent tree,For each tree, askIs there a branchpoint having the same subclade contents (i.e. same taxa,any order)Bootstrap value number of trees having the branchpoint / total trees.

Comparing branchpointsABBAC DC EEDAAC B EBE CDDBA C DEBAE CBA C EDBAC DDE P((A,B),C) 5/8For each branchpoint in the parent tree,For each tree, askIs there a branchpoint having the same subclade contents (i.e. same taxa,any order)Bootstrap value number of trees having the branchpoint / total trees.

Bioinformatics Sequence alignmentDatabase searchingSignificance, e-valuesTreesGene ontologyProtein Structure49

Ontology Ontologies relate facts to knowledge facts– may be known/unknown/little known– not attached to knowers– unchanging Knowledge– attached to knower– may disappear50

Gene Ontology- Gene annotation system- Controlled vocabulary that can beapplied to all organisms- Used to describe gene products

What is the Gene Ontology?A (part of the) solution:- A controlled vocabulary that can be applied toall organisms- Used to describe gene products - proteins andRNA - in any organism

GO: Three ontologiesWhat does it do?Molecular FunctionWhat processes is itinvolved in?Biological ProcessWhere does it act?Cellular Componentgene product

Cellular Component where a gene product acts

Cellular ComponentMitochondrial membrane

Biological Process

Biological ProcessGluconeogenesis

Molecular Function A single reaction or activity, not a geneproduct A gene product may have several functions Sets of functions make up a biologicalprocess

Molecular Functionhexose kinase

Querying the GOSearch forGO terms orby Genesymbol/nameFilter queries byorganism, datasource orevidence

What can scientists do with GO? Access gene product functional information Find how much of a proteome is involved in a process/function/ component in the cell Map GO terms and incorporate manual annotations intoown databases Provide a link between biological knowledge and gene expression profiles proteomics data

analysis of high-throughput data according to GOMicroArray data analysistimeDefense responseImmune responseResponse to stimulusToll regulated genesJAK-STAT regulated genesPuparial adhesionMolting cyclehemocyaninAmino acid catabolismLipid metobolismPeptidase activityProtein catabloismImmune responseImmune responseToll regulated genesattacked controlBregje Wertheim at the Centre for Evolutionary Genomics,Department of Biology, UCL and Eugene Schuster Group, EBI.

Selec%ng microarray subsets based on GO reveals drug targetNo treatmentTreated cellsFigure modified �schema.jpgAnalysis of Func.onal Annota.on –Downregulated Genescourtesy of Shabana Shabeer, Albert Einsteing School of Medicine

Bioinformatics Sequence alignmentDatabase searchingSignificance, e-valuesTreesGene ontologyCorrelationProtein structure64

Biology Grad Core course: Discussion TopicMerck Smith-Kline was the author on a study of Trioxx, an anti-inflammatory drug used to treat arthritis, for whichit was know to be effective. The study followed over 500 long-time Trioxx users and an equal number of controlsubjects who had never used the drug. Dr. Smith-Kline was looking for correlations between the use of Trioxx andthe incidence of any disease other than arthritis, in any demographic group. He noted in the study that TunisianAmericans, in the age range from 45-55, male or female, and who had been a vegetarian for more than 6 months atany time in their lives, had a "strong negative correlation" between the use of Trioxx and the incidence of restlessleg syndrome (RLS), and began touting Trioxx as an effective anti-RLS drug.The numbers were as follows:Total Tunisian American vegetarians age 45-55 : 32Total Tunisian American vegetarians age 45-55 Trioxx users : 16Total Tunisian American vegetarians age 45-55 Trioxx non-users : 16Total Tunisian American vegetarians age 45-55 who have RLS : 4Total Tunisian American vegetarians age 45-55 who do not have RLS : 28Trioxx nonusersTrioxx usersRLS40no RLS1216 (u u )(r r ) (u u ) (r r )iDr. Smith-Kline correctly calculated the correlation between Trioxx and RLS as follows:Corr , where ui 1 if subject i is a user, and 0 otherwise, ri is 1 if the subject has RLS and 0 otherwise.i2i2iThe sums were carried out over all 32 subjects in the subset, and the resulting correlation was -0.378. Thisconfidence level was cited as 99%, since the p-value for this correlation was 0.01, The sample size of 32 and theuneven distribution of subjects with RLS were taken into account.The data itself was collected correctly and the calculations were correct, both for the correlation and its confidence.Yet Merck Smith-Kline did something dishonest in this study. What was it and what specific question would you askhim to reveal his dishonesty?65

Correlationr i ( xi - x ) ( yi - y )22 i ( xi - x ) i ( yi - y )Pearson’s correlations coefficient.OrPearson's product moment correlation

Correlation using metric dataCorrelation cancels baseline and scale.Correlation is insensitive to non-linear relationships.

Non-linearity is not picked up bycorrelationAll of these examples have r 0.816Re-sampling will fix some of these.

Correlation Confidence byResampling Start with paired data (x,y), calculate r. Let’s sayr 0.511 Randomly associate x and y values.Calculate rranRepeat 10000 times.Significance is p number of times rran 0.511,divided by 10000. (If r is negative, count rran r)

Resampling exampleThe y-values have been randomly swapped, 10000 times. 10% of the time, r is 0.816, therefore p .331

Resampling example0.816-0.130Scrambling randomizes sampling in the gray shaded area.If there is one data point set apart from the others,correlation can be high by chance.

Correlation using Booelan datax000011000101y010011001101True is assigned 1, False 0.r ( x ii x ) ( y i( x i x 2) i y ( y ii) y )2 x 4/12 0.33 y 6/12 0.50r 0.71p 0.025x \ y10140026

Bioinformatics Sequence alignmentDatabase searchingSignificance, e-valuesTreesGene ontologyCorrelationProtein structure73

Photosystem I: 1JB0

Classes of membrane proteins Single transmembrane helix several transmembrane helices beta-barrel or channel Anchored by one (not-transmembrane) helix or a covalentlyattached fatty acid

Photosystem I: Guided tourDownload and display 1JB0.pdb (one jay bee zero)restrict not protein and not hohcolor cpkDisplay - ball and stickselect magnesiumlabel %rset fontsize 12set fontstroke 2color labels yellowFind the pseudo 2-fold axisHow many Mg are there?What are the residue numbers of the“special pair” of chlorophylls?

Photosystem I : Guided tour(select the special pair using select XXX or YYY)spacefillselect hetero and not hohHow are the B-factorslabels offdistributed?color temperatureWhich side is moreordered? Chain A orchain B?Was NCS 2-fold symmetryenforced during refinement?Guess what:2-fold symmetry was notenforced during evolution!

Photosystem I : Guided tourFind the name of the lipid that does not havea phosphate group.Unix shortcut: use grepgrep ”HETNAM” 1JB0.pdbCharacterize the environmentof the lipid. Could it have a role inthe light harvest process?select [LMG]restrict selectedcenter selectedselect within (11., [LMG]) and proteinDisplay - ball and stickcolor cpkselect within (11., [LMG]) and ligand

Photosystem I : Guided tourselect within (11., [LMG]) and ligandspacefill 1.5color greenselect within (11., [LMG]) and *.MGspacefill 1.5color whiteselect within (11., [LMG]) and [PQN]color redselect within (11., [LMG]) and solventspacefill 1.0What is PQN?color cyanHow close is it to the nearestmagnesium

Photosystem I : Guided tourrestrict ligandwireframecolor cpkDisplay- ball and stickselect [CL1] or [CL2]wireframecolor greenselect [PQN]color magentaspacefill 1.0select [BCR]color orangespacefill 1.5select *.MGspacefill 1.0color whiteLight harvesting complexTrace the path of the electronsfrom the special pair to the twoquinones.Are any of the pigmentsconnected to the special pair?

Photosystem I : Guided tourrestrict [PQN]Environment of the quinonesspacefillcolor cpkWhich quinone is moreselect within (11.,[PQN]) and proteinwireframe 0.5loosely-boundcolor cpkselect within (11.,[PQN]) and ligand and not [PQN]color greenwireframe 0.5select within (11.,[PQN]) and solventspacefill 0.6How does the electron getcolor cyanfrom one quinone to theother? What protein sidechainforms a bridge?

Bioinformatics Sequence alignment Database searching Significance, e-values Trees Gene ontology Protein Structure 1. Experimental origins of sequence data F Each color is one lane of an electrophoresis gel. The Sanger