Whole Exome Sequencing Benchmark - QIAGEN Digital Insights

Transcription

Application NoteWhole Exome Sequencing BenchmarkFrancesco Lescai, Ph.D., Leif Schauser, Ph.D. and Jonathan Jacobs, Ph.D.QIAGEN Bioinformatics – Aarhus, DenmarkSummaryIntroductionWe have benchmarked QIAGEN Bioinformatics’ CLCResearchers frequently question the speed and accuracyGenomics Server (GxS) with GATK 4.0 (Broad Institute,of a bioinformatics analysis pipeline they depend on. TheMA) (1), with and without the use of Apache Sparkindata and results presented in this benchmark in fact origi-different computing environments, and compared runtimenate from a customer request to compare the speed andand performance when analyzing 100X coverage of humanaccuracy of the CLC Genomics Server (CLC GxS) and CLCwhole-exome data. We also compared the sensitivity,Genomics Workbench (GxWB) to that of GATK, a popularprecision and overall accuracy of variant calling performedpipeline for variant calling in human biomedical genomicsby a workflow developed with CLC to both GATK anddata. Strelka2 (2).In this Application Note, we present a comparative analysisWhen comparing runtime, CLC GxS performs competi-of both runtime and performance of the CLC Genomicstively to GATK with Apache Spark on a 40-core serverServer and GATK v.4, both with and without the use of anand Intel -Optane SSD technology (60 versus 58 minutes,Apache Spark framework. Apache Spark is considered anrespectively). The CLC Genomics Server performs betterenvironment for in-memory distributed computing designedoverall than GATK without Apache Spark.to accelerate “big data” analytics and processing. SinceIn terms of sensitivity, the CLC GxS outperforms the otherplatforms in both SNP and INDEL calling (99.6% and 91.9%,respectively, on PASS variants). GATK, however, scoreshighest for SNP calling precision (99.9% versus 99.1%by CLC) and Strelka2 performs higher for INDELs precision(95.2% versus 92.3% by CLC). In terms of accuracy, CLCthe most recent release of GATK replaced multi-threadingprocessing with the use of Apache Spark, it seemed necessaryto conduct this benchmark on a GATK Apache Sparkcombination, as well. Also, we tested the software packagesin three different computing environments, in order todescribe the most common situations a user could face.GxS performs best with SNPs (99.4% versus 97.4% andWe also performed a comparison of the sensitivity,97.3% by GATK and Strelka2, respectively), while Strelka2precision and resulting accuracy (F1 score) of the variantsperforms best for INDELs (93.3% versus 92.3% by CLC).called by each pipeline. For this specific benchmark,Sample to Insight

we have also included Strelka2, a recently releasedTable 1. Computing environmentshigh-performance variant caller.Intel-OptaneIntel-networkLaptopCPUs2 x Intel Xeon Gold6148 2.40GHz2 x Intel Xeon Gold6148 2.40GHzIntel Core i7widely for accuracy, testing and pipeline comparisons (4, 5).Cores40 cores/80 threads40 cores/80 threads4 coresFor the study presented here, the GiAB sample wasRAM192 GB RAM192 GB RAM16 GB RAMDiskIntel-Optanetechnology SSDNetwork Storage500 GB MacSSDTo to execute this test, we used data from the Genome ina Bottle (GiAB) Consortium (3) sample (NA12878), usedsequenced in-house at QIAGEN on an Illumina sequencer, utilizing capture technology from Twist Bioscience (SanFrancisco, CA, USA). This capture technology was chosenbased on the need for uniformity of coverage in ensuringhigher quality results (6).The dataIn general, it is essential to remember that speed is notWe used exome sequencing of a NA12878 reference samplethe only important factor when designing an analyticalfrom the Genome in a Bottle Consortium. The sequencingpipeline, and that speed is not constituted by the runtimedata were generated in-house with Illumina HiSeq , andalone. Kawalia and colleagues offer a series of generalthe library captured with Twist Bioscience technology. Theconsiderations when describing their workflow (7).resulting dataset consisted of 77.5 M reads, 76 M of whichComputing environmentsFor this benchmark we used three different computingwere mapped at an average coverage of 107.6x with CLC.Softwareenvironments, a standard reference sample throughout andThe following software versions were used for therepeated comparable workflows in all combinations threecomparison: The Genome Analysis Toolkit (GATK) v4.0.1.1;times each to account for variances in runtime.Apache Spark v.2.2.3; Strelka v.2.9; CLC Genomics ServerTable 1 is a summary of the three environments. Thev11.0 and CLC Command Line Tools v6.0.Intel-Optane system is a high-performance server withTo evaluate the accuracy of each pipeline, Hap.py40 cores (80 threads) and a dedicated high-performancev0.3.10 (8) was used. The analyses and plotting of theseSolid State Disk (SSD) technology. This is a solution likely tobenchmarking results were primarily carried out usingaccelerate algorithms capable of multi-threaded analysisRStudio with R version 3.5.1 (2018-07-02).which will also scale with threads, while minimizing anybottleneck due to disk Input/Output (I/O).Runtime benchmarkIn contrast, and to illustrate the benefit of high-performanceThe analysis was repeated three times consecutively withI/O while maintaining the same computing environment, weeach software to account for the variance in the execution.also used the Intel-Optane server but we moved the dataIt is important to highlight that the CLC workflow includesto a network file system – thereby intentionally creating ansteps not present in the GATK workflow (Figure 1), namelyI/O bottleneck.a structural variant analysis step and an optimized localIn addition, to test the performance in a portable environmentwe conducted the benchmark on a MacBook Pro laptop.2 realignment step. A local realignment is performed byHaplotype Caller, as part of haplotype-based genotypeWhole Exome Sequencing Benchmark07/2019

CLC Genomics ionIndexMarkDuplicatesPre-recaland inalResults (CLC)QCGATK 4.0MappingBWAFinalResults (VCF)Figure 1. Comparison of the workflows.Whole pipeline runtime re 2. Comparison of pipeline runtimes.Whole Exome Sequencing Benchmark07/2019 3

calling in GATK, but an entirely local realigned mappingTable 2. Runtime benchmarking results (in minutes)is not provided as a result, and no structural variant callingis included in the workflow. These elements are essentialwhen considering the differences in runtime.EnvironmentAnalysis nameGATKGATKApacheSparkGxSIntel-Optane SSDALIGN11.2311.2316.34Despite the additional steps in the CLC workflow, GxSIntel-network driveALIGN17.2711.6521.97can run in a comparable time to the most advanced GATKLaptop SSDALIGN81.23NA90.26 Apache Spark installation, when executed on a high-Intel-Optane SSDDEDUP18.2916.068.69end server. CLC GxS completes the end-to-end analysis,Intel-network driveDEDUP38.6822.7718.40Laptop SSDDEDUP14.67NA10.73Intel-Optane SSDSVNANA6.78Intel-network driveSVNANA13.26including the structural variant and local realignment steps, inabout 60 minutes. GATK using Apache Spark completes theanalysis in 57 minutes on a high-performance server coupledwith SSD storage (Figure 2). It is also worth highlighting thatSVNANA24.14Intel-Optane SSDLOCALREALIGNNANA7.71Intel-network driveLOCALREALIGNNANA10.66Laptop SSDLOCALREALIGNNANA25.06Intel-Optane SSDRECAL29.7124.12NAIntel-network driveRECAL33.7733.36NALaptop SSDRECAL36.14NANAIntel-Optane SSDQCNANA5.40Intel-network driveQCNANA10.78When the data are being processed on a networkLaptop SSDQCNANA8.95drive, the CLC Genomics Server reduces analysisIntel-Optane SSDCALLING30.065.0112.28time by 20% compared to GATK alone ( 102 minutesIntel-network driveCALLING34.135.9717.21versus 126 minutes), while GATK with Apache Spark runsLaptop SSDCALLING47.61NA52.7125% faster ( 76 minutes). Most likely, Apache Spark is ableIntel-Optane SSDFILTER0.240.210.35to compensate for part of the overhead generated into theIntel-network driveFILTER0.310.260.78I/O when reading from network drives.Laptop SSDFILTER0.36NA0.65CLC analysis on a laptop runs in about three and a halfIntel-Optane SSDANNO1.481.292.42hours, compared to GATK running in three hours. This isIntel-network driveANNO1.661.5215.39a very acceptable time, considering the additional dataLaptop SSDANNO2.19NA3.44produced by CLC and the QC step included in the workflow.Intel-Optane SSDPIPELINE91.0257.9160.14Additionally, it is worth noting that we used a dataset withIntel-network drivePIPELINE125.8175.53101.91higher coverage (100X) than typical exome applications.Laptop SSDPIPELINE182.21NA216.10the CLC pipeline can be efficiently executed from a graphicalinterface, using a workflow editor, and is accessible bynon-bioinformaticians, which may be an important feature formany users. The use of an advanced distributed computingsolution like the Apache Spark environment adds an extraeffort to the installation of GATK, and requires expertise inLinux system administration to be appropriately configuredon a server or computing cluster.Laptop SSDThe analysis performed on a laptop does not include a comparison with Apache Spark, because the environment wouldFigure 3 illustrates the execution on a high-end server, withcreate more overhead on a 4-cores machine by partitioninga breakdown of each pipeline by analysis step. Figure 4the data, outweighing the advantages in the computation.shows in a similar way the performance on a laptop.4 Whole Exome Sequencing Benchmark07/2019

All jobs, by analysis, by start time – GATKALIGNgatk 01 ALIGNgatk 01 INDEXgatk 01 DEDUPgatk 01 PRERECALgatk 01 RECALgatk 01 CALLINGgatk 01 SELECTSNPgatk 01 FILTERSNPgatk 01 SELECTINDELSgatk 01 FILTERINDELSgatk 01 MERGEgatk 01 SNPEFFgatk 01 PASSING4030Job numberAnalysis namegatk 01 PIPELINEINDEXDEDUPPRERECALgatk 02 ALIGNgatk 02 INDEXgatk 02 DEDUPgatk 02 PRERECALgatk 02 RECALgatk 02 CALLINGgatk 02 SELECTSNPgatk 02 FILTERSNPgatk 02 SELECTINDELSgatk 02 FILTERINDELSgatk 02 MERGEgatk 02 SNPEFFgatk 02 PASSING20CALLINGSELECTSNPFILTERSNPgatk 03 PIPELINEgatk 03 ALIGNgatk 03 INDEXgatk 03 DEDUPgatk 03 PRERECALgatk 03 RECALgatk 03 CALLINGgatk 03 SELECTSNPgatk 03 FILTERSNPgatk 03 SELECTINDELSgatk 03 FILTERINDELSgatk 03 MERGEgatk 03 SNPEFFgatk 03 PASSING100RECALgatk 02 PIPELINE020406080100 120 140 160 180 200 220 240 260 280 300 320Start time, relative to first job (minutes)All jobs, by analysis, by start time – GATK-SPARKgatk 01 ALIGNgatk 01 INDEXgatk 01 DEDUPgatk 01 PRERECALgatk 01 RECALgatk 01 CALLINGgatk 01 SELECTSNPgatk 01 FILTERSNPgatk 01 SELECTINDELSgatk 01 FILTERINDELSgatk 01 MERGEgatk 01 SNPEFFgatk 01 PASSINGJob number30100020406080Job LOCALREALIGNQCGxS 002 PIPELINECALLINGFILTERANNOGxS 003 PIPELINEGxS 003 ALIGNGxS 003 DEDUPGxS 003 SVGxS 003 LOCALREALIGNGxS 003 QCGxS 003 CALLINGGxS 003 FILTERGxS 003 ANNO20Analysis nameALIGNG

Apache Spark framework. Apache Spark is considered an environment for in-memory distributed computing designed to accelerate "big data" analytics and processing. Since the most recent release of GATK replaced multi-threading processing with the use of Apache Spark, it seemed necessary to conduct this benchmark on a GATK Apache Spark