A Unified Haplotype-based Method For Accurate And Comprehensive Variant .

Transcription

A unified haplotype-based method for accurate and comprehensivevariant callingPaper Presentation by Chuanyi ZhangDaniel P Cooke1,2 , David C Wedge2 , and Gerton Lunter11 WellcomeCentre for Human Genetics, University of Oxford, Oxford, UKData Institute, University of Oxford, Oxford, UK2 BigMarch 7, 2019.Chuanyi Zhang(UIUC)Octopus. . . . . . . . . . . . . . . . . . .1 / 23.

Motivation & BackgroundTable of Contents1 Motivation & BackgroundVariant CallingSomatic Variant Calling2 MethodCancer PriorSubclone genotype modelCancer calling model3 Result.Chuanyi Zhang(UIUC)Octopus. . . . . . . . . . . . . . . . . . .2 / 23.

Motivation & BackgroundVariant CallingVariant Calling Ideal scenario: enough readdepth (1) read processing,(2) mapping,(3) calling haplotype analysis:HaplotypeCaller in GenomeAnalysis Toolkit (GATK).Chuanyi Zhang(UIUC)Octopus. . . . . . . . . . . . . . . . . . .3 / 23.

Motivation & BackgroundSomatic Variant CallingSomatic Variant CallingDifferences from germline calling Allele frequency assumption: purity, multiple subclones, CNA Low VAF vs. Artifacts Somatic vs. Germline: matched tumor-normal sample.Chuanyi Zhang(UIUC)Octopus. . . . . . . . . . . . . . . . . . .4 / 23.

MethodTable of Contents1 Motivation & BackgroundVariant CallingSomatic Variant Calling2 MethodCancer PriorSubclone genotype modelCancer calling model3 Result.Chuanyi Zhang(UIUC)Octopus. . . . . . . . . . . . . . . . . . .5 / 23.

MethodbioRxiv preprint first posted online Oct. 29, 2018; doi: http://dx.doi.org/10.1101/456103. The copyright holder for this preprint(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.OverviewIt is made available under a CC-BY-ND 4.0 International license.CALLING MODELBAM PROCESSINGGENOTYPE POSTERIORSGENOTYPE MODELSSgsµ.rHAPLOTYPE LIKELIHOODSCOMBINEDHAPLOTYPE GENERATIONGROWGENOTYPINGENPRUWGROVARIANT FILTERING Rs ALLELE GENERATIONPILEUPASSEMBLYVCF R, M) R, M) R, M)p(MgPHASINGp(p(p(r1i p(r1i p(r1i p(r1i ))))p(r2i p(r2i p(r2i p(r2i ))))p(p(p(p( R, M) R, M) R, M) R, M)HAPLOTYPE POSTERIORSp(p(p(p(VCF OUTPUT R, M) R, M) R, M) R, M)1 100 . A C 200 . . GT:PS 1 0:100 0 0:1001 130 . G T 1000 . . GT:PS 1 0:100 1 1:1001 200 . ACGT A 500 . . GT:PS 0 1:100 0 1:100Figure 1 Overview of the unified haplotype-based algorithm, showing joint calling of two samples with the population calling model. Two SNVs (blueand red) are detected from read pileups, a deletion from local re-assembly, and a third SNV (yellow) from input VCF. The first two SNVs are addedto the haplotype-tree, which then contains four haplotypes. After computing likelihoods for read-haplotype pairs,.the. haplotype. . . . . . posterior. . . . . distribution. . blue. .SNV). . . Next,. . . the. . .haplotype-tree. . . .computed by the calling model is used to prune the haplotype-tree by removing one haplotype (containing just. theis extendedwith thedeletion,and the process repeats. The polymorphicOctopuscalling model is shown in the green box. Only the population genotype modelChuanyiZhang(UIUC)6 / 23.

GENOTYPE MODELSp(p(MethodHaplotype generatingµgsMgrSp( Rs ALLELE TYPE LIKELIHOODSCOMBINEDHAPLOTYPE GENERATIONWGROERUNPROWGp(r1i p(r1i p(r1i p(r1i ))))p(r2i p(r2i p(r2i p(r2i )))) R, M) R, M) R, M) RHAPLOTYPE POSTEp(p(p(p( build from candidate allelesFigure 1 Overview of the unified haplotype-based algorithm, showing joint calling of two samples with the haplotypetreeand red)are detected from read pileups, a deletion from local re-assembly, and a third SNV (yellow) fromthe haplotype-tree, which then contains four haplotypes. After computing likelihoods for read-haplotyp prune,tostages:(1) pre: haplotype likelihood, (2) post: haplotype posteriorcomputed by the calling model is used to prune the haplotype-tree by removing one haplotype (containingis extended with the deletion, and the process repeats. The polymorphic calling model is shown in the gree. . any. . . model-specific. . . . . . . . .inferences. . .(Online Methods) is shown, in plate notation. Calling models also compute.Chuanyi Zhang(UIUC)Octopus. . . . . . . . . .7 / 23

MethodCancer PriorGenotype Prior Models1Uniform2Hardy-Weinberg-Equilibrium (HWE)3Coalescent-HWE4Trio5 CancerFor ploidy m, genotypes: g (h1 , . . . , hm );for n populations inside a tumor, joint genotypes: g (g1 , · · · , gn ).Chuanyi Zhang(UIUC)Octopus. . . . . . . . . . . . . . . . . . .8 / 23.

MethodCancer PriorCancergcancer (ggerm , gsom )p(gcancer Mcancer ) p(ggerm Mgerm )p(gsom ggerm , Msom ), Mgerm can beCoalescent-HWE prior model,if there’s only 1 somatic haplotypep(gsom ggerm , Msom ) 1 ggerm ggerm p(gsom ggerm,i , Msom )i 1if multi somatic haplotypes: (assume all haplotypes originate from germline, independently) gsom p(gsom ggerm , Msom ) gsom p(gsom,j ggerm ) j 1 1j 1 ggerm,j ggerm,j p(gsom ggerm,j,i , Msom )i 1.Chuanyi Zhang(UIUC)Octopus. . . . . . . . . . . . . . . . . . .9 / 23.

MethodSubclone genotype modelGraphical model & joint posteriorWe want to know joint posterior distributionp(π, g, α, Mg , R)p(α, Mg , R)p(R π, g)p(π α)p(g Mg ) p(α, Mg , R)p(g, π R, α, Mg ) SsMgsgr p(g Mg ) p(g Mg )n. a Population.b Trio. c Subclone.Chuanyi Zhang (UIUC)p(Rs πs , g)p(πs αs )s 1S p(Rs ϕs , g)p(ϕs αs ) dϕss 1 Rs (c)S p(g Mg ) g S s 1ϕsi p(r hi )p(ϕs αs ) dϕsr Rs i 1.Octopus. . . . . . . . . . . . . . . . . . .10 / 23.

MethodSubclone genotype modelProblem of computing This posterior is intractable Since ϕs Dir(αs ) So the integration over ϕs is intractable.ϕ is latent variables. Using Variational Bayes (VB)Approximate p (x) p(x D) (intractable posterior) with q(x). MaximizeL(q) DKL (q p̃) (not DKL (p p)), where p̃ p(x, D) p (x)p(D)[] qq(x)L(q) Eq log q(x) log dµ(x)p̃p (x)p(D) q(x) q(x) log q(x) log p(D) dµ(x)p (x) DKL (q p ) log p(D) log p(D).Chuanyi Zhang(UIUC)Octopus. . . . . . . . . . . . . . . . . . .11 / 23.

MethodSubclone genotype modelELBO9.4. The EM Algorithm in General451of the decomposition givenwhich holds for any choiceution q(Z). Because theLeibler divergence satisfies KL(q p)! 0 , we see that the quanis a lower bound on the logfunction ln p(X θ).L(q, θ)ln p(X θ) L(q) is evidence lower bound (ELBO). Maximizer is q p .the forms of the expressions (9.71) and (9.72), and in particular noting thatr in sign and also that L(q, θ) contains the joint distribution of X and ZChuanyi Zhang(UIUC)L(q p) containsthe conditionaldistribution of ZOctopusgiven X. To verify the. . . . . . . . . . . . . . . . . . .12 / 23.

MethodSubclone genotype modelVB cont’Bayes:p(D x)p(x)p(x D) p(D)(likelihood · prior, post evidence)And by assuming this factorizationq(g, Z, ϕ) q(g)S q(Zs )q(ϕs )s 1where we introduce the latent binary matrix Zs , q(Zsnk ) are so-called responsibilities ofassuming haplotype k for read n in sample s. By this factorization (mean field) we canoptimize on them alternately. Moreover, if we assume these priors are Dirichlet, then prior andposterior are conjugated. Categorical (likelihood) and Dirichlet are conjugate distributions.Chuanyi Zhang(UIUC)Octopus. . . . . . . . . . . . . . . . . . .13 / 23.

MethodCancer calling modelCalling ModelAssume 3 possible cases:1No somatic mutations, clean germline, the individual model with any germline prior(merge) Mind2Copy number changes, but no somatic, the subclone model with germline prior (e.g.Coalescent-HWE) Mind3Somatic occurs, possible CNA, the subclone model with cancer genotype prior.Chuanyi Zhang(UIUC)Octopus. . . . . . . . . . . . . . . . . . .14 / 23.

MethodCancer calling modelCalling ModelGermline genotype posteriorp(g R) p(g, Mx R)x p(g, Mx , R)p(Mx R)x p(g, Mind )p(Mind R) p(g, MCNV )p(MCNV R) p(g, Msomatic )p(Msomatic R)where p(Mx R) p(Mx )p(R Mx ), and p(R Mx ) is the “evidence”; andp(g Msomatic ) g̃:g g̃ p(g̃ Msomatic ), g̃ (ggerm , gsom ), from cancer priorGermline allele posterior p(a R) g:a g p(g R).Chuanyi Zhang(UIUC)Octopus. . . . . . . . . . . . . . . . . . .15 / 23.

MethodCancer calling modelCredible somatic mass psomatic (a R) ? p(g̃ Msomatic , credible)There are K somatic haplotypes, then the credible somatic frequencies satisfy() 1P p(ϕsk τ Msomatic ) Beta θ ; αP 1 ,αi dθτi 1 αk 11 Kwhere ϕsk Beta(αk , α αk ) since ϕs Dir(αs ) i.e. p(ϕs ) B(α).k 1 ϕskThe credible somatic mass is λs 1 1 p(ϕsk τ Msomatic )kmeans the probability mass of at least 1 credible in K somatic haplotypes.Chuanyi Zhang(UIUC)Octopus. . . . . . . . . . . . . . . . . . .16 / 23.

MethodCancer calling modelCalling alleleThen λ 1 ( 1 · · · S 1 · · · S ))( 1{a g̃.germ a g̃.som}p(g̃ R, Msomatic )psomatic (a R) λ 1 /s λs .saMight be a typo?.Chuanyi Zhang(UIUC)Octopus. . . . . . . . . . . . . . . . . . .17 / 23.

ResultTable of Contents1 Motivation & BackgroundVariant CallingSomatic Variant Calling2 MethodCancer PriorSubclone genotype modelCancer calling model3 Result.Chuanyi Zhang(UIUC)Octopus. . . . . . . . . . . . . . . . . . .18 / 23.

ResultSynthetic TumorsEvaluation of somatic mutation calling is challenging Real tumor with manual inspected mutations Mix reads from unrelated individuals Spike mutations directly into raw sequenceing reads from healthy tissue.Chuanyi Zhang(UIUC)Octopus. . . . . . . . . . . . . . . . . . .19 / 23.

ed bioRxiv a license to display the preprint in perpetuity.D 4.0 International license.edegdeewse,6ResultSynthetic Tumors1. Select sample with known germline4. Sample PCAWG tumour-specific callsNA12878VAF 15%VAF 30%2. Assign reads to germline haplotypes5. Spike PCAWG mutations onto reads3. Realign reads to germline haplotypes6. Remap spiked reads Reads from NA12878 300 {30 , 35 , 60 , 65 } Assign and realign to make surespiked mutations fall on samehaplotype, and positionconsistent Sample mutations frompan-cancer analysis of wholegenomes (PCAWG) uniformlyProduce raw unmapped reads(FASTQ) files.Figure 3 Overview of synthetic tumour creation. We used germline seChuanyi Zhang(UIUC)Octopus. . . . . . . . . . . . . . . . . . .20 / 23.

ResultSomatic Mutations Calling AccuracybioRxiv preprint first posted online Oct. 29, 2018; doi: http://dx.doi.org/10.1101/456103. The copyright holder for this preprint(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available under a CC-BY-ND 4.0 International license.abFigure 4 Somatic mutation calling accuracy for synthetic skin and breast tumours with a paired normal sample. a Precision-recall curves. Scoringmetrics used to generate curves were RFQUAL (Octopus), TLOD (Mutect2), SomaticEVS (Strelka2), QUAL (Lancet), QUAL (LoFreq), SSF (VarDict),and QUAL (Platypus). Only PASS calls are used. VarDict is not visible as it is outside the axis limits due to low precision. Precisions on the twotests are substantiallydi erent asZhangthe skin set (UIUC)has almost 50 times as many true mutations as the breast set. Dots on Octopusthe Octopus curve are placedChuanyi1Precision-Recall curve: top-rightis optimal2Recalls for each VAF, usingPASS variants3Most differences in recall is dueto low frequencies. . . . . . . . . . . . . . . . . . .21 / 23.

ResultLow coverage.Chuanyi Zhang(UIUC)(a)Octopus. . . . . . . . . . . . . . . . . . .22 / 23.

ResultbioRxiv preprint first posted online Oct. 29, 2018; doi:http://dx.doi.org/10.1101/456103. The copyright holder for this preprin(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available under a CC-BY-ND 4.0 International license.Somatic Mutations Calling Accuracy without paired normalamutations as germline variants in the skin texplains the drop-o in recall at VAFs aboveUnlike Pisces, Octopus provides a measureclassification. Depending on the application,or may not be important; it may be sufficientthe variant is present or not. We therefore evamanceof both callers1 Octopusis ableontocombineddiscovergermlinesets; mutationsignoring somaticclassification.We foueven without pairedlittle di erence in sensitivity between the callenormal sampleference in precision (Supplementary Fig. 9marginally more false positives on the combthe somatic-only test ( 1; 700 in both teststhere are few germline calling errors, while17; 000 additional false positives in both testbChuanyi Zhang(UIUC)Octopus. . .mutations. . . . . . . . . . . . .Phasing somatic. . . . . . . . . . . . . . . .23 / 23In some situations, such as when compound

sented in the haplotype-tree explain all surrounding reads su!-ciently well. Haplotype likelihoods are computed for each read and haplotype using a hidden Markov model (HMM) with con-text aware single nucleotide variant (SNV) and indel penalties. These likelihoods are the input to a polymorphic genotype call-