Computing Haplotype Frequencies And Haplotype Phasing Via The .

Transcription

OutlineThe Algorithm by one example, firstThe solutionThe EM AlgorithmComputing Haplotype Frequencies and HaplotypePhasing via the Expectation Maximization (EM)AlgorithmSorin IstrailDepartment of Computer ScienceBrown University, Providencesorin@cs.brown.eduSeptember 14, 2010Sorin IstrailComputing Haplotype Frequencies and Haplotype Phasing via the

OutlineThe Algorithm by one example, firstThe solutionThe EM AlgorithmOutline1Outline2The Algorithm by one example, firstProblem definition3The solutionThe InputThe number of 00 haplotypes in the input(t 1)Computing θ004The EM AlgorithmSorin IstrailComputing Haplotype Frequencies and Haplotype Phasing via the

OutlineThe Algorithm by one example, firstThe solutionThe EM AlgorithmProblem definitionEM by one ExampleProblem: Consider two loci with two allele 0 and 1 at eachlocus.Given: (We observe) the genotypes of the individuals at bothloci.Find: The estimate at the haplotype frequencies.Sorin IstrailComputing Haplotype Frequencies and Haplotype Phasing via the

OutlineThe Algorithm by one example, firstThe solutionThe EM AlgorithmThe InputThe number of 00 haplotypes in the input(t 1)Computing θ00SolutionThere are a total of four possible haplotypes 00, 01, 10, 11 atthe two loci.Let us denote their frequencies by θ00 , θ01 , θ10 , θ11 .(t)(t)(t)(t)Suppose that we have computed already θ00 , θ01 , θ10 , θ11 .(t 1)We want to compute θ00Sorin Istrail(t)(t)(t)(t)as a function of θ00 , θ01 , θ10 , θ11 .Computing Haplotype Frequencies and Haplotype Phasing via the

OutlineThe Algorithm by one example, firstThe solutionThe EM AlgorithmThe InputThe number of 00 haplotypes in the input(t 1)Computing θ00The Genotype Sample: several types A, B, C , D, E , FThere are nA genotypes or individuals of type 22 - we denoteYA the set of such genotypesThere are nB genotypes or individuals of type 02There are nC genotypes or individuals of type 20There are nD genotypes or individuals of type 00There are nE genotypes or individuals of type 11Sorin IstrailComputing Haplotype Frequencies and Haplotype Phasing via the

OutlineThe Algorithm by one example, firstThe solutionThe EM AlgorithmThe InputThe number of 00 haplotypes in the input(t 1)Computing θ00The fraction of the genotypes in each category thatcontains the 00 haplotype(A) For the A group of nA individuals the possible haplotypes00or 01show as follows in explanations of the genotypes: 1110(the ”fractions” represent the separation of mother-father)chromosomes.(t) (t)(t) (t)P(YA ) 2θ00 θ11 2θ01 θ1000P( 11 YA ) (t) (t)2θ00 θ11(t) (t)(t) (t)2θ00 θ11 2θ01 θ10Sorin IstrailComputing Haplotype Frequencies and Haplotype Phasing via the

OutlineThe Algorithm by one example, firstThe solutionThe EM AlgorithmThe InputThe number of 00 haplotypes in the input(t 1)Computing θ00The fraction of the genotypes in each category thatcontains the 00 haplotype (continued)For group B one haplotype is 00 and the other one is 01For group C one haplotype is 00 and the other one is 10For group D both haplotypes are 00For group E both haplotypes are 11Sorin IstrailComputing Haplotype Frequencies and Haplotype Phasing via the

OutlineThe Algorithm by one example, firstThe solutionThe EM AlgorithmThe InputThe number of 00 haplotypes in the input(t 1)Computing θ00(t 1)Computing θ00Therefore the total expected number of 00 haplotypes are:(t 1)n00 nA P( 0011 YA ) nB nC 2nDso we update(t 1)θ00(t 1) n002nwhere n nA nB nC nD nESorin IstrailComputing Haplotype Frequencies and Haplotype Phasing via the

OutlineThe Algorithm by one example, firstThe solutionThe EM AlgorithmThe EM AlgorithmThe EM algorithm is an iterative method to computesuccessive sets of haplotype frequencies p1 , p2 , ., pT starting(0) (0)(0)with some initial arbitrary values p1 , p2 , ., pTThose initial values are used as used as if they were theunknown true frequencies to estimate the explanationfrequencies P(hk hl )(0) . This is the Expectation step.These expected explanation frequencies are used in turn toestimate haplotype frequencies at the next iteration(1) (1)(1)p1 , p2 , ., pT . This is the Maximization step. and so on until convergence is reached (i.e., when thechanges in haplotype frequency in consecutive iterations areless than some small value ( ).Sorin IstrailComputing Haplotype Frequencies and Haplotype Phasing via the

OutlineThe Algorithm by one example, firstThe solutionThe EM AlgorithmEM Algorithm initialization1All explanations are equally likelyPj (hk hl )(0) c1j , 1 j mwhere m is the total number of genotypes in the input; andn1 , n2 , ., nm are the counts for each genotype type.2All haplotypes are equally frequent in the sample.3Complete Linkage Equilibrium: Haplotype frequencies theproduct of single locus allele frequencies4Initial haplotype frequencies are picked at random.Sorin IstrailComputing Haplotype Frequencies and Haplotype Phasing via the

OutlineThe Algorithm by one example, firstThe solutionThe EM AlgorithmThe E StepThe Expectation step at the tth iteration consists of using thehaplotype frequencies in the previous iteration to calculate theprobability of resolving each genotype into different possibleexplanations:PcjPcjPj i 1P(explanationi ) i 1P(hik hil )if k l then P(hk hl ) pk2if k 6 l then P(hk hl ) 2pk plwhere a1 is a constant term and pik and pil are the populationfrequencies of the corresponding haplotypes.Sorin IstrailComputing Haplotype Frequencies and Haplotype Phasing via the

OutlineThe Algorithm by one example, firstThe solutionThe EM AlgorithmThe E Step (continued)The likelihood of the haplotype frequencies given thegenotype counts n1 , n2 , ., nm iscjm XYL(p1 , ., pT ) a1 (P(hik hil ))njj 1 i 1PTwhere i 1 1, and(hik hil ), 1 i cj are the set ofexplanations of the jth genotype that occurs nj times in theinput.Pcj(t)Let Pj i 1P(hik hil )(t)Sorin IstrailComputing Haplotype Frequencies and Haplotype Phasing via the

OutlineThe Algorithm by one example, firstThe solutionThe EM AlgorithmThe E Step formulaThe E Step formula is:P(hk hl )(t)Pj (hk hl )(t) Pc(t)ji 1 PjSorin IstrailComputing Haplotype Frequencies and Haplotype Phasing via the

OutlineThe Algorithm by one example, firstThe solutionThe EM AlgorithmThe M StepHaplotype frequencies are then computed for eachMaximization step: for 1 r Tm(t 1)prcj1 XXδir Pj (hik hil )(t) 2j 1 i 1where δit is an indicator variable equal to the number of timeshaplotype t is present in explanation i; and this number canbe 0, 1 or 2.Sorin IstrailComputing Haplotype Frequencies and Haplotype Phasing via the

haplotype frequencies in the previous iteration to calculate the probability of resolving each genotype into di erent possible explanations: P j P c j i 1 P(explanation i) P c j i 1 P(h ikh il) if k l then P(h kh l) p2 k if k 6 l then P(h kh l) 2p kp l where a 1 is a constant term and p