Finding Regulatory Motifs In DNA Sequences - Bioinformatics

Transcription

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoFinding Regulatory Motifs inDNA Sequences

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoOutline Implanting Patterns in Random TextGene RegulationRegulatory MotifsThe Gold Bug ProblemThe Motif Finding ProblemBrute Force Motif FindingThe Median String ProblemSearch TreesBranch-and-Bound Motif SearchBranch-and-Bound Median String SearchConsensus and Pattern Branching: Greedy Motif SearchPMS: Exhaustive Motif Search

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoRandom cacgaagcttctgggtactgatagca

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoImplanting the Motif: tctaatagcacgaagcttAAAAAAAAGGGGGGGa

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoBut where is it cgaagcttaaaaaaaaggggggga

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoImplanting the Motif with Four tagcacgaagcttActAAAAAGGaGcGGa

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoOh, geez. Where is it acgaagcttactaaaaaggagcgga

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoWhy is Finding a (15,4) Motif acgaagcttActAAAAAGGaGcGGaAgAAgAAAGGttGGG. . . . cAAtAAAAcGGcGGG

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoChallenge Problem Find a motif in a sample of- 20 “random” sequences (e.g. 600 nt long)- each sequence containing an implantedpattern of length 15,- each pattern appearing with 4 mismatchesas (15,4)-motif.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoCombinatorial Gene Regulation A microarray experiment showed that whengene X is knocked out, 20 other genes arenot expressed How can one gene have such drasticeffects?

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoRegulatory Proteins Gene X encodes regulatory protein, a.k.a. atranscription factor (TF) The 20 unexpressed genes rely on gene X’s TF toinduce transcription A single TF may regulate multiple genes

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoRegulatory Regions Every gene contains a regulatory region (RR) typicallystretching 100-1000 bp upstream of the transcriptionalstart site Located within the RR are the Transcription FactorBinding Sites (TFBS), also known as motifs, specificfor a given transcription factor TFs influence gene expression by binding to a specificlocation in the respective gene’s regulatory region TFBS

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoTranscription Factor Binding Sites A TFBS can be located anywhere within theRegulatory Region. TFBS may vary slightly across differentregulatory regions since non-essential basescould mutate

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoMotifs and Transcriptional Start Cgene

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoTranscription Factors and Motifs

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoMotif Logo Motifs can mutate on nonimportant basesThe five motifs in fivedifferent genes havemutations in position 3and 5Representations calledmotif logos illustrate theconserved and variableregions of a motifTGGGGGATGAGAGATGGGGGATGAGAGATGAGGGA

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoMotif Logos: An Example(http://www-lmmb.ncifcrf.gov/ toms/sequencelogo.html)

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoIdentifying Motifs Genes are turned on or off by regulatory proteins These proteins bind to upstream regulatoryregions of genes to either attract or block anRNA polymerase Regulatory protein (TF) binds to a short DNAsequence called a motif (TFBS) So finding the same motif in multiple genes’regulatory regions suggests a regulatoryrelationship amongst those genes

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoIdentifying Motifs: Complications We do not know the motif sequence We do not know where it is located relative tothe genes start Motifs can differ slightly from one gene to thenext How to discern it from “random” motifs?

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoA Motif Finding Analogy The Motif Finding Problem is similar to theproblem posed by Edgar Allan Poe (1809– 1849) in his Gold Bug story

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoThe Gold Bug Problem Given a secret message:53 !305))6*;4826)4 .)4 );806*;48!8 60))85;]8*: *8!83(88)5*!;46(;88*96*?;8)* (;485);5*!2:* (;4956*2(5*-4)8 8*;4069285);)6!8)4 ;1( 9;48081;8:8 1;48!85;4)485!528806*81( 9;48;(88;4( ?34;48)4 ;161;:188; ?; Decipher the message encrypted inthe fragment

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoHints for The Gold Bug Problem Additional hints: The encrypted message is in English Each symbol correspond to one letter in theEnglish alphabet No punctuation marks are encoded

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoThe Gold Bug Problem: Symbol Counts Naive approach to solving the problem: Count the frequency of each symbol in theencrypted message Find the frequency of each letter in thealphabet in the English language Compare the frequencies of the previoussteps, try to find a correlation and map thesymbols to a letter in the alphabet

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoSymbol Frequencies in the Gold Bug Message Gold Bug Message:Symbol8 ;4 ) *5 6 ( ! 1 0 2 9 3 : ? - ] .Frequency34191512251614119876554432111 English Language:etaoinsrhldcumfpgwybvkxjqzMost frequentLeast frequent

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoThe Gold Bug Message Decoding: First Attempt By simply mapping the most frequentsymbols to the most frequent letters of itdrdtpdeetiwt The result does not make sense

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoThe Gold Bug Problem: l-tuple count A better approach: Examine frequencies of l-tuples,combinations of 2 symbols, 3 symbols, etc. “The” is the most frequent 3-tuple inEnglish and “;48” is the most frequent 3tuple in the encrypted text Make inferences of unknown symbols byexamining other frequent l-tuples

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoThe Gold Bug Problem: the ;48 clue Mapping “the” to “;48” and substituting alloccurrences of the symbols:53 !305))6*the26)h .)h )te06*the!e 60))e5t]e*: *e!e3(ee)5*!th6(tee*96*?te)* (the5)t5*!2:* (th956*2(5*h)e e*th0692e5)t)6!e)h t1( 9the0e1te:e 1the!e5th)he5!52ee06*e1( 9thet(eeth( ?3hthe)h t161t:1eet ?t

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoThe Gold Bug Message Decoding: Second Attempt Make inferences:53 !305))6*the26)h .)h )te06*the!e 60))e5t]e*: *e!e3(ee)5*!th6(tee*96*?te)* (the5)t5*!2:* (th956*2(5*h)e e*th0692e5)t)6!e)h t1( 9the0e1te:e 1the!e5th)he5!52ee06*e1( 9thet(eeth( ?3hthe)h t161t:1eet ?t “thet(ee” most likely means “the tree” Infer “(“ “r” “th( ?3h” becomes “thr ?3h” Can we guess “ ” and “?”?

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoThe Gold Bug Problem: The Solution After figuring out all the mappings, the finalmessage ETOUT

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoThe Solution (cont’d) Punctuation is important:A GOOD GLASS IN THE BISHOP’S HOSTEL IN THE DEVIL’S SEA,TWENY ONE DEGREES AND THIRTEEN MINUTES NORTHEAST AND BY NORTH,MAIN BRANCH SEVENTH LIMB, EAST SIDE, SHOOT FROM THE LEFT EYE OFTHE DEATH’S HEAD A BEE LINE FROM THE TREE THROUGH THE SHOT,FIFTY FEET OUT.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoSolving the Gold Bug Problem Prerequisites to solve the problem: Need to know the relative frequencies ofsingle letters, and combinations of two andthree letters in English Knowledge of all the words in the Englishdictionary is highly desired to makeaccurate inferences

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoMotif Finding and The Gold Bug Problem: Similarities Nucleotides in motifs encode for a message in the“genetic” language. Symbols in “The Gold Bug”encode for a message in English In order to solve the problem, we analyze thefrequencies of patterns in DNA/Gold Bug message. Knowledge of established regulatory motifs makesthe Motif Finding problem simpler. Knowledge of thewords in the English dictionary helps to solvethe Gold Bug problem.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoSimilarities (cont’d) Motif Finding: In order to solve the problem, we analyze thefrequencies of patterns in the nucleotide sequences In order to solve the problem, we analyze thefrequencies of patterns in the nucleotide sequences Gold Bug Problem: In order to solve the problem, we analyze thefrequencies of patterns in the text written in English

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoSimilarities (cont’d) Motif Finding: Knowledge of established motifs reducesthe complexity of the problem Gold Bug Problem: Knowledge of the words in the dictionary ishighly desirable

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoMotif Finding and The Gold Bug Problem: DifferencesMotif Finding is harder than Gold Bug problem: We don’t have the complete dictionary of motifs The “genetic” language does not have astandard “grammar” Only a small fraction of nucleotide sequencesencode for motifs; the size of data is enormous

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoThe Motif Finding Problem Given a random sample of DNA tggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc Find the pattern that is “implanted” in each ofthe individual sequences, namely, the motif

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoThe Motif Finding Problem (cont’d) Additional information: The hidden sequence is (for example) oflength 8 The pattern is not exactly the same in eacharray because random point mutations mayoccur in the sequences

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoThe Motif Finding Problem (cont’d) The patterns revealed with no acgtacgtConsensus String

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoThe Motif Finding Problem (cont’d) The patterns with 2 point tggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoThe Motif Finding Problem (cont’d) The patterns with 2 point Can we still find the motif, now that we have 2 mutations?

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoDefining Motifs To define a motif, lets say we know where themotif starts in the sequenceThe motif start positions in their sequences canbe represented as s (s1,s2,s3, ,st)

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoMotifs: Profiles and 4 Line up the patterns bytheir start indexess (s1, s2, , st) Construct matrix profilewith frequencies of eachnucleotide in columnsConsensusA C G T A C G T Consensus nucleotide ineach position has thehighest score in column

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoConsensus Think of consensus as an “ancestor” motif,from which mutated motifs emerged The distance between a real motif and theconsensus sequence is generally less thanthat for two real motifs

An Introduction to Bioinformatics AlgorithmsConsensus (cont’d)www.bioalgorithms.info

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoEvaluating Motifs We have a guess about the consensussequence, but how “good” is this consensus? Need to introduce a scoring function tocompare different guesses and choose the“best” one.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoDefining Some Terms t - number of sample DNA sequences n - length of each DNA sequence DNA - sample of DNA sequences (t x n array) l - length of the motif (l-mer) si - starting position of an l-mer in sequence i s (s1, s2, st) - array of motif’s startingpositions

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoParametersl acaccggcaacctgaaacaaacgctcagaaccagaagtgct acgGcn 69ss1 26s2 21s3 3s4 56s5 60

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoScoring Motifsl a G g t a c T tC c A t a c g ta c g t T A g ta c g t C c A tC c g t a c g GGiven s (s1, st) andDNA:Score(s,DNA) ACGTConsensusScoret3 0 1 0 3 1 1 02 4 0 0 1 4 0 00 1 4 0 0 0 3 10 0 0 5 1 0 1 4a c g t a c g t3 4 4 5 3 4 3 4 30

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoThe Motif Finding Problem If starting positions s (s1, s2, st) are given,finding consensus is easy even withmutations in the sequences because we cansimply construct the profile to find the motif(consensus)But the starting positions s are usually notgiven. How can we find the “best” profilematrix?

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoThe Motif Finding Problem: Formulation Goal: Given a set of DNA sequences, find a set of lmers, one from each sequence, that maximizes theconsensus score Input: A t x n matrix of DNA, and l, the length of thepattern to find Output: An array of t starting positionss (s1, s2, st) maximizing Score(s,DNA)

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoThe Motif Finding Problem: Brute Force Solution Compute the scores for each possiblecombination of starting positions s The best score will determine the best profile andthe consensus pattern in DNA The goal is to maximize Score(s,DNA) by varyingthe starting positions si, where:si [1, , n-l 1]i [1, , t]

An Introduction to Bioinformatics rch1.2.3.4.5.6.7.BruteForceMotifSearch(DNA, t, n, l)bestScore - 0for each s (s1,s2 , . . ., st) from (1,1 . . . 1)to (n-l 1, . . ., n-l 1)if (Score(s,DNA) bestScore)bestScore - score(s, DNA)bestMotif - (s1,s2 , . . . , st)return bestMotif

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoRunning Time of BruteForceMotifSearch Varying (n - l 1) positions in each of tsequences, we’re looking at (n - l 1)t sets ofstarting positionsFor each set of starting positions, the scoringfunction makes l operations, so complexity isl (n – l 1)t O(l nt)That means that for t 8, n 1000, l 10 wemust perform approximately 1020computations – it will take billions years

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoThe Median String Problem Given a set of t DNA sequences find apattern that appears in all t sequences withthe minimum number of mutations This pattern will be the motif

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoHamming Distance Hamming distance: dH(v,w) is the number of nucleotide pairsthat do not match when v and w arealigned. For example:dH(AAAAAA,ACAAAC) 2

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoTotal Distance: Example Given v “acgtacgt” and sdH(v, x) tgtgcgaatctatgcgtttccaaccatdH(v, x) tttdH(v, x) 0dH(v, x) ccctattacatcttacgtacgtatacadH(v, x) tcgtacgctcgatcgttaacgtaGgtcv is the sequence in red, x is the sequence in blue TotalDistance(v,DNA) 1 0 2 0 1 4

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoTotal Distance: Definition For each DNA sequence i, compute all dH(v, x),where x is an l-mer with starting position si(1 si n – l 1) Find minimum of dH(v, x) among all l-mers insequence iTotalDistance(v,DNA) is the sum of the minimumHamming distances for each DNA sequence iTotalDistance(v,DNA) mins dH(v, s), where s isthe set of starting positions s1, s2, st

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoThe Median String Problem: Formulation Goal: Given a set of DNA sequences, find amedian stringInput: A t x n matrix DNA, and l, the length ofthe pattern to findOutput: A string v of l nucleotides thatminimizes TotalDistance(v,DNA) over allstrings of that length

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoMedian String Search Algorithm1.MedianStringSearch (DNA, t, n, l)2.bestWord - AAA AbestDistance - for each l-mer s from AAA A to TTT Tif TotalDistance(s,DNA) bestDistancebestDistance -TotalDistance(s,DNA)bestWord - sreturn bestWord3.4.5.6.7.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoMotif Finding Problem Median String Problem The Motif Finding is a maximization problem whileMedian String is a minimization problemHowever, the Motif Finding problem and MedianString problem are computationally equivalentNeed to show that minimizing TotalDistanceis equivalent to maximizing Score

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoWe are looking for the same thingla G g t a c T tC c A t a c g ta c g t T A g ta c g t C c A tC c g t a c g GAlignmentProfileACGT3 0 1 0 3 1 1 02 4 0 0 1 4 0 00 1 4 0 0 0 3 10 0 0 5 1 0 1 4Consensusa c g t a c g tScore3 4 4 5 3 4 3 4TotalDistance 2 1 1 0 2 1 2 1Sum5 5 5 5 5 5 5 5t At any column iScorei TotalDistancei t Because there are l columnsScore TotalDistance l * t Rearranging:Score l * t - TotalDistance l * t is constant the minimizationof the right side is equivalent tothe maximization of the left side

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoMotif Finding Problem vs.Median String Problem Why bother reformulating the Motif Findingproblem into the Median String problem? The Motif Finding Problem needs toexamine all the combinations for s. That is(n - l 1)t combinations!!! The Median String Problem needs toexamine all 4l combinations for v. Thisnumber is relatively smaller

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoMotif Finding: Improving the Running TimeRecall the BruteForceMotifSearch:3.BruteForceMotifSearch(DNA, t, n, l)4.bestScore ß 0for each s (s1,s2 , . . ., st) from (1,1 . . . 1) to (n-l 1, . . ., n-l 1)5.6.7.8.9.if (Score(s,DNA) bestScore)bestScore ß Score(s, DNA)bestMotif ß (s1,s2 , . . . , st)return bestMotif

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoStructuring the Search How can we perform the linefor each s (s1,s2 , . . ., st) from (1,1 . . . 1) to (n-l 1, . . ., n-l 1) ? We need a method for efficiently structuringand navigating the many possible motifsThis is not very different than exploring all tdigit numbers

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoMedian String: Improving the Running Time1.MedianStringSearch (DNA, t, n, l)2.bestWord ß AAA AbestDistance ß for each l-mer s from AAA A to TTT Tif TotalDistance(s,DNA) Word ß sreturn bestWord3.4.5.6.7.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoStructuring the Search For the Median String Problem we need to considerall 4l possible l-mers:laa aaaa acaa agaa at.tt ttHow to organize this search?

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoAlternative Representation of the Search Space Let A 1, C 2, G 3, T 4Then the sequences from AA A to TT T become:l11 1111 1211 1311 14.44 44 Notice that the sequences above simply list all numbersas if we were counting on base 4 without using 0 as adigit

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoLinked List Suppose l 2Startaa acagatcacccgctgagcgggttatctgttNeed to visit all the predecessors of asequence before visiting the sequence itself

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoLinked List (cont’d) Linked list is not the most efficient data structure formotif findingLet’s try grouping the sequences by their prefixesaaacagatcacccgctgagcgggttatctgtt

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoSearch Treeroot--a-aaacagc-atcacccgg-ctgagcgggtt-tatctgtt

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoAnalyzing Search Trees Characteristics of the search trees: The sequences are contained in its leaves The parent of a node is the prefix of itschildrenHow can we move through the tree?

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoMoving through the Search Trees Four common moves in a search tree that weare about to explore: Move to the next leaf Visit all the leaves Visit the next node Bypass the children of a node

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoVisit the Next LeafGiven a current leaf a , we need to compute the “next” leaf:2.3.4.5.6.7.8.NextLeaf( a,L, k )for i ß L to 1if ai kai ß ai 1return aai ß 1return a// a : the array of digits// L: length of the array// k : max digit value

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoNextLeaf (cont’d) The algorithm is common addition in radix k: Increment the least significant digit “Carry the one” to the next digit position whenthe digit is at maximal value

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoNextLeaf: Example Moving to the next leaf:--Current Location1-1112132-142122233-243132334-3441424344

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoNextLeaf: Example (cont’d) Moving to the next leaf:--Next Location1-1112132-142122233-243132334-3441424344

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoVisit All Leaves 3.4.5.6.7.8.9.Printing all permutations in ascending order:AllLeaves(L,k) // L: length of the sequencea ß (1,.,1)// k : max digit valuewhile forever // a : array of digitsoutput aa ß NextLeaf(a,L,k)if a (1,.,1)return

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoVisit All Leaves: Example Moving through all the leaves in order:--Order of 4112421343144415

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoDepth First Search So we can search leaves How about searching all vertices of the tree? We can do this with a depth first search

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoVisit the Next Vertex1.2.3.4.5.6.7.8.9.10.NextVertex(a,i,L,k) // a : the array of digitsif i L// i : prefix lengtha i 1 ß 1// L: max lengthreturn ( a,i 1)// k : max digit valueelsefor j ß l to 1if aj kaj ß aj 1return( a,j )return(a,0)

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoExample Moving to the next vertex:Current Location1-111213--2-142122233-2431323

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Finding Regulatory Motifs in . e t a o i n s r h l d c u m f p g w y b v k x j q z Most frequent Least frequent Symbol 8 ; 4 ) * 5 6 ( ! 1 0 2 9 3 : ?