Molecular Barcodes - FAQ - Agilent Technologies

Transcription

Molecular Barcodes - FAQDetect low allele frequency variants with HaloPlexHSAuthorsQ&A IntroductionAnniek De Witte1, Katie L. Zobeck1,Linus Forsmark , Christian LeCocq ,11Heather Tao1, Bahram Arezi2 , MagnusIsaksson11Agilent Technologies, Inc. Santa Clara,CA, USA. 2 Agilent Technologies, Inc.La Jolla, CA, USAWhat is NGS?Next generation sequencing (NGS), massively parallel sequencing, and high-throughputsequencing are related terms that describe an innovation in DNA sequencing technologythat allows for the sequencing of millions of small fragments of DNA at the same time.The result of this modern technology is a massive increase in the number of base pairssequenced compared to the standard Sanger sequencing method.Among the most popular applications using NGS are genomic DNA variant analysis andRNA expression analysis. The extent of these analyses can be as wide as the wholegenome (WGS or whole genome sequencing) and whole exome (WES or whole exomesequencing), or as focused as specific regions and gene panels.While WGS may capture all possible mutations, WES and in particular targeted sequencing,remains advantageous for achieving very high coverage of the regions of interest whilestaying cost-effective and keeping the complexity of data interpretation manageable.Having very high sequencing coverage is especially important for discovering cancermutations present at low fractions (1).How can a target region be enriched before NGS?There are several ways to enrich regions of interest before NGS. The two most commonlyused approaches are: 1) hybridization capture from sequencing libraries using targetspecific probes and 2) PCR amplification directly from sample DNA using target-specificprimers (2, 3).With hybrid capture methods, DNA fragments are hybridized in solution to sequencespecific capture probes corresponding to targeted regions of the genome. Examples ofhybridization capture technology include Agilent SureSelect, NimbleGen SeqCap, andIllumina TruSeq (4).Amplicon sequencing enriches target genes by PCR with a set of primers for the exonsof selected genes prior to NGS. Examples of amplicon capture technology include purePCR-based methods such as Ion Torrent AmpliSeq, RainDance ThunderBolts or IlluminaTruSeq amplicon, and hybridization and extension methods such as Agilent HaloPlexHS (5).

What are potential sources oferrors in NGS?Advancements in NGS and targetenrichment have made it possible todetect low-occurrence mutations in aheterogeneous mixture. However, furtheradvancements are hindered by systemicerrors in PCR and sequencing methods.Library preparation, target enrichment,and sequencing all use DNA polymeraseand amplification steps. These processesintroduce bias in the form of duplicatesand associated non-uniform amplificationand artifacts defined as polymerase errorsthat generate sequence changes notpresent in the original sample (1). To putit in numbers, with the commonly usedIllumina sequencing instruments, theerror rate can vary from 1% to 0.05%,depending on factors such as the readlength, base-calling algorithms used, andthe type of variants detected (6). Molecularbarcodes can enhance confidence,particularly in clinical research sampleswherein the mutation prevalence can be inthe same range as the error rate.Q&A Molecular BarcodesWhat are molecular barcodes?The concept of molecular barcodes isthat each original DNA fragment, withinthe same sample, is attached to a uniquesequence barcode. Molecular barcodesare usually designed as a string of totallyrandom nucleotides (such as NNNNNNN),partially degenerate nucleotides (such asNNNRNYN), or defined nucleotides (whentemplate molecules are limited). Molecularbarcodes allow for the accountingof sequencer and PCR errors in highcoverage NGS data.What is the difference betweenmolecular barcodes and samplebarcodes?It is common to have both samplebarcodes and molecular barcodes inthe same sequencing reads. Molecularbarcodes can be used to reduce falsepositives. Whereas sample barcodes, alsocalled indexed adaptors, are customarilyused in most current NGS workflowsand allow the mixing of samples priorto sequencing. These sample barcodesact like identifiers or tags that let youdetermine from which sample the readoriginated, enabling the assaying ofmultiple samples in a single run. Samplebarcodes are typically short specificsequences that are incorporated prior toDNA sequencing. The barcodes will besequenced together with the unknownsample DNA. After sequencing thereads are sorted by barcode and groupedtogether (demultiplexing).Why do molecular barcodes matter?The concept of molecular barcodesis that each fragment is attached to aunique molecular barcode. Sequencereads that have different molecularbarcodes represent different original DNAmolecules, while reads that have the samebarcodes are the result of PCR duplicationfrom the same original molecule.Molecular barcodes do not prevent PCRduplication from happening but theyallow users to track the duplicates andremove them from the downstreamanalysis. Additionally, by using molecularbarcodes PCR artifacts can be separatedfrom real variants present in the originalmolecules, allowing for variant detectionat a much lower variant allele frequency(VAF). Peng et al. (1) developed an NGStarget enrichment process that integratesmolecular barcodes into high multiplexPCR amplicon sequencing. They compareddata with and without molecular barcodeand showed that with high sensitivitysettings, using molecular barcodessignificantly reduced false positives.Do hybridization capture methodsbenefit from molecular barcodes?Although random shearing beforehybridization capture creates random anddiverse fragment ends that can be usedas unique identifiers for each startingmolecule, molecular barcodes are stilluseful in hybridization-based targetenrichment. First, at very high sequencingread depths (10,000 - 50,000X raw reads)it becomes clear that the fragmentscreated by random shearing are actuallynot completely random (7). Secondly dueto concerns about cost, throughput, andyield, the field of NGS is moving towardsreplacing random shearing with enzymaticfragmentation before hybridization2based target enrichment. This enzymaticshearing leads to much less randomfragment ends and also hampersthe ability to track different startingmolecules and to remove PCR duplicatesand associated amplification artifacts.What is the history of molecularbarcodes?Tagging individual templates with amolecular barcode has been proposedand reported since 2007 to alleviate theproblems of PCR duplication and biasedamplification in NGS. In the past, molecularbarcodes or molecular indexes have beengiven various names such as uniqueidentifiers (UID), unique molecular identifiers(UMI), primer ID, or duplex barcodes.Molecular barcodes can be addedto a template using several differenttechniques. One approach is to incorporatethem into the sequencing adaptors duringthe library construction step for genomesequencing. This “duplex sequencing”greatly reduces errors by independentlytagging and sequencing each of the twostrands of a DNA duplex (8). Anotherapproach is to incorporate the barcodesinto molecular inversion probes fortargeted somatic mutation detection. WithsmMIP (for single molecule molecularinversion probes) single-molecule taggingis combined with multiplex targetedcapture to enable practical and highlysensitive detection of low variant allelefrequency (9). Finally, molecular barcodescan also be incorporated into targetspecific PCR primers, usually in the form ofa short stretch of random bases.How are HaloPlexHS molecularbarcodes created?HaloPlexHS is a combination of ampliconbased and target enrichment sequencing.The protocol utilizes specificity gainedfrom restriction enzyme recognition,hybridization and DNA ligation to capturemolecules originating from the targetregion to be sequenced. To enableidentification of duplicate reads fromlibraries prepared with HaloPlexHS, wehave added a molecular barcode to theintroduced primer vector (Figure 1). TheHaloPlexHS workflow tags one barcode perfragment. In the case when HaloPlexHS

A1Digest and denature sample DNATarget RegionLigate and captureuniquely barcoded targets3Streptavidin2Hybridize probe library to DNA targets4Amplify enriched fragments by PCRPCR primersBiotinSequencing primer motifIndexBridge or emulsionPCR primerTarget & complementaryprobe sequenceMolecular barcodeB IlluminaI2Molecularbarcode, 10 ntR1Vectorsequence, 1 ntR2Ion PGMI1Samplebarcode, 8 ntMolecularbarcode, 10 ntIonXpressbarcodeVectorsequence, 15 ntFigure 1 A) Overview of the HaloPlexHS workflow. B) Schematic of HaloPlexHS libraries. For librariessequenced on the Illumina platform, the molecular barcodes are introduced as part of Illumina's dualindexing system. For libraries sequenced on the Ion PGM, the barcodes are incorporated at the beginningof the reads. Libraries are compatible with standard sequencing protocols.3captures both strands (i.e. when using theFFPE design option in SureDesign), bothdouble-stranded molecules are taggedindependently. The molecular barcodeconsists of ten degenerate bases allowingfor over one million unique sequences tobe present for tagging of molecules. Thebarcodes allow you to differentiate truevariants from PCR or sequencing errors.What level of coverage do I need totarget for the sensitivity I require?Combining high-fidelity DNA polymeraseswith the power of molecular barcodes todistinguish errors from true mutationsmakes it possible in theory to detectmutations at 1% or lower variant allelefrequency. In reality, however, undersampling and sampling bias challenges thepractical application of molecular barcodesand deep sequencing (10). To accuratelyassess the mutant representation in a poolof templates, all the templates within thepool have to be uniformly tagged, with noerror occurring during the tagging process,and the tags have to be immune tomutations in subsequent amplification.In an online article, Jon Armstrong (11)discussed the balance between theamount (or copies) of DNA going intothe library process, sequence coveragegenerated and the lower theoreticalthreshold of minor allele frequencydetection. If we want to detect variantsconfidently to 0.1% sensitivity whilerequiring 4 reads to show the alternateallele, we would require 4,000X coverageafter deduplication (4 reads / 0.001). Ina perfect world, we would need at least4,000 haploid genome copies, or 2000cells, or 12 ng of DNA. In other words,no matter how many sequencing readsyou generate, you can never attain agreater deduplicated coverage than thenumber of genomics copies going ontothe sequencer. However, these numbersincrease even further due to process noise,variability, PCR amplifications, and lossduring the molecular procedures.

Has HaloPlexHS with molecularbarcodes been used in clinicalresearch?A recent paper (12) outlines an exampleof the use of molecular barcoding incombination with next-generationsequencing to detect somatic mosaicismin cancer. In this paper children withmultiple primary tumors known to beassociated with the DICER1 syndromewere studied. The different cases wereeach found to carry a specific DICER1“hotspot” RNase IIIb mutation in multipletumor biopsies from different sites.However, the mutations were not detectedin their germline DNA using conventionaltechnology, leading to the suspicion ofsomatic mosaicism. A HaloPlexHS panelincorporating molecular barcodes wasused prior to next generation sequencingon the Illumina HiSeq at high coverage.Using the molecular barcoding technology,the relative abundance of the previouslyidentified mutations was assessedbetween tumor and non-tumor sampleswith the aim of confirming or refutingthe hypothesis of a mosaic origin. Usingthis approach, the team confirmed itshypothesis that DICER1 RNase IIIbmosaicism was the cause of the rareDICER1-associated tumors in thesechildren. In the researchers’ hands, theHaloPlexHS target enrichment systemcontaining molecular barcodes providedthe sensitivity required for detection ofmutant allele fractions as low as 0.24%.Q&A AnalysisHow are molecular barcodesanalyzed?Typically, molecular barcode analysisconsists of five steps (Figure 2). First,the individual DNA molecules are taggedwith molecular barcodes during librarypreparation. Reads are aligned ignoring themolecular barcodes. Then the read pairsaligning to the same genomic coordinatesare grouped into molecular barcodefamilies. Finally, the base sequences for amolecular barcode are consolidated to oneread per molecule, and PCR duplicatesare removed.Can I use Agilent SureCallsoftware to analyze HaloPlexHSmolecular barcodes?Agilent SureCall software has beenoptimized for the analysis of NGS datagenerated using HaloPlexHS with theincorporation of molecular barcodes,allowing for the identification of duplicatereads, hence significantly improvingbase calling accuracy even at low allelicfrequencies compared to conventionalNGS methods. Agilent SureCall softwarecan be downloaded free of charge from theAgilent web site http://www.genomics.agilent.com/article.jsp?pageId 3341Figure 2. A schematic showing the generation of consensus reads using molecular barcode sequence information.4

How does Agilent SureCallsoftware handle read pairs withonly one read pair per barcode?A) Histogram of read pairs per barcode(sufficient sequencing)25% Reads20151050123456789101112131415Reads per molecular barcode family% ReadsB) Histogram of read pairs per barcode(insufficient sequencing)454035302520151050Depending on the amount of sequencingperformed, Agilent SureCall softwaredecides whether to discard or keep readpairs that only have one read pair perbarcode. Figure 3 shows two theoreticalhistograms of the number of read pairs perbarcode. When using the default settings,an average of 4 read pairs or more arerequired for error correction (green verticallines). Figure 3A shows that sufficientsequencing was performed, the read pairsthat are only supported by one read perbarcode can be discarded and the readsthat have 4 or more read pairs can be usedfor error correction. Figure 3B shows thata much lower amount of sequencing wasperformed. Discarding read pairs supportedby only one barcode would result in asignificant (30-40% or more) reduction ofthe data that can be analyzed, although alarge fraction of the discarded data couldactually be valid data.Can I use different software toanalyze HaloPlexHS molecularbarcodes?12345678910111213Reads per molecular barcode familyFigure 3. Theoretical histograms of the number of read pairs per barcode.51415For users who do not wish to installAgilent SureCall software there is theAgilent Genomics NextGen Toolkit(AGeNT). AGeNT is a Java-based softwaremodule that processes the read sequencesfrom targeted high-throughput sequencingdata generated by sequencing HaloPlexHSlibraries. AGeNT trims low qualitybases from the ends, removes adaptorsequences, and masks enzyme footprints.Properly preparing read sequences inthis manner prior to alignment improvesalignment efficiency and decreases therate of false positive variant calls. AGeNTalso processes the Molecular Barcode(MBC) information in HaloPlexHS Illuminadata after alignment, where it tags readpairs in a BAM/SAM file with their MBCsequences from the index 2 FASTQ file(s)and can flag or remove MBC duplicatesfrom that BAM/SAM file. AGeNT canbe downloaded free of charge fromthe Agilent web site at s-Software/AGeNT/?cid AG-PT154&tabId prod2570007

References1. Peng et al. Reducing amplificationartifacts in high multiplex ampliconsequencing by using molecularbarcodes. BMC Genomics. 2015Aug 7;16:589.2. Gnirke et al. Solution hybrid selectionwith ultra-long oligonucleotidesfor massively parallel targetedsequencing. Nat Biotechnol. 2009Feb;27(2):182-9.3. Taylor et al. Ultradeep bisulfitesequencing analysis of DNAmethylation patterns in multiple genepromoters by 454 sequencing. CancerRes. 2007 Sep 15;67(18):8511-8.4. Bodi et al. Comparison ofcommercially available targetenrichment methods for nextgeneration sequencing. J. Biomol.Tech. 2013 Jul;24(2):73-86.5. Chang et al. Clinical application ofamplicon-based next generationsequencing in cancer. Cancer Genet.2013 Dec;206(12):413-9.6. Kinde et al. Detection andquantification of rare mutationswith massively parallel sequencing.Proc Natl Acad Sci U S A. 2011 Jun7;108(23):9530-5.7. Poptsova et al. Non-random DNAfragmentation in next-generationsequencing. Sci Rep. 2014; 4: 4532.8. Schmitt et al. Detection of ultrarare mutations by next-generationsequencing. Proc Natl Acad Sci U S A.2012 Sep 4;109(36):14508-13.9. Hiatt et al. Single molecule molecularinversion probes for targeted, highaccuracy detection of low-frequencyvariation. Genome Res. 2013May;23(5):843-54.610. Kou et al. Benefits and challengeswith applying unique molecularidentifiers in next generationsequencing to detect low frequencymutations. PLoS One. 2016 Jan11;11(1):e0146638.11. ty/12. de Kock et al. High-sensitivitysequencing reveals multi-organsomatic mosaicism causing DICER1syndrome. J. Med. Genet. 2016;53:43-52.

www.agilent.com/genomics/HaloPlexHSThis information is subject to change without notice.For Research Use Only.Not for use in diagnostic procedures.PR7000-0254 Agilent Technologies, Inc., 2017Published in the USA, May 3, 20175991-7421EN

Figure 2. A schematic showing the generation of consensus reads using molecular barcode sequence information. are grouped into molecular barcode families. Finally, the base sequences for a molecular barcode are consolidated to one read per molecule, and PCR duplicates are removed. Can I use Agilent SureCall software to analyze HaloPlexHS