Intel Select Solutions For Genomics Analytics

Transcription

Solution briefIntel Select SolutionsHigh Performance ComputingApril 2019Intel Select Solutions forGenomics AnalyticsAccess performance, scale, and ease of deployment for genomicsinsight and discovery.Advancements in genomics are opening new doors for understanding humandiseases, and they are increasingly informing innovative precision treatmentplans. Discoveries are dependent on processing, storing, and analyzing a growingamount of genomic sequencing data. In 2015, worldwide sequencing storagecapacity approached a petabyte per year, and it continues to double everyseven months.1,2 At this rate, genomics sequencing will generate hundreds ofpetabytes per year in the next five years, and could require nearly a zettabyteof storage per year by 2025.1,2The Broad Institute of MIT and Harvard (broadinstitute.org) is one of the world’slargest producers of human genomic data, creating about 24 TB of new data perday. Currently, Broad Institute manages more than 50 PB of data.Researchers require tools to analyze these enormous volumes of data in a timelymanner to gain insights into disease and possible treatments. They need tools likethe Genome Analysis Toolkit* (GATK*), a set of leading software methods createdby the Broad Institute and trusted by the majority of genomics centers worldwide.Intel Select Solutions for GenomicsAnalytics are based on the BIGstack2.0* reference architecture.Broad Institute has released GATK 4.1 as its latest major version, under an opensource license for all users, including for commercial purposes. An open sourcelicense makes GATK available to a wider audience of scientists and researchers andhelps accelerate and advance genomics analytics worldwide.Intel-Broad Center for Genomic Data EngineeringIntel and Broad Institute have collaborated on computing infrastructure andsoftware optimization for years. In 2017, they launched a new effort—theIntel-Broad Center for Genomic Data Engineering is a five-year collaborationbetween the two organizations to simplify and accelerate genomics workflowexecution using GATK, Burrow-Wheeler Aligner (BWA), Cromwell, Intel Genomics Kernel Library (Intel GKL), GenomicsDB*, and other tools andtechniques. Together, experts from Broad Institute and Intel build, optimize, andwidely share tools and infrastructure to help scientists integrate and processgenomic data. The result is a growing set of optimized best practices in hardwareand software for genomics analytics on Intel architecture–based platforms thatcan be applied to research datasets stored in private data centers and thatextend to private, public, and hybrid clouds.

Solution Brief Intel Select Solutions for Genomics AnalyticsWith the massive growth of genomics data, the collaborationmakes use of technology to enable genomics analytics atscale. It has already resulted in Intel Select Solutions forGenomics Analytics, a suite of optimized software, along withreference architectures for turnkey configuration, setup, anddeployment to run genomics analysis that is qualified forGATK pipelines, Cromwell, and GenomicsDB.Intel Select Solutions forGenomics AnalyticsThe Intel-Broad Center for Genomic Data Engineering worksto optimize GATK on Intel architecture and technologies andto define a reference architecture for genomics analytics.The result is Intel Select Solutions for Genomics Analytics,developed by Intel and the Broad Institute, based on theBIGstack 2.5* reference architecture, and delivered by Intelsolution providers. The solutions demonstrated a fivetimes overall performance improvement running GATK 4.0compared to previous versions of the genomics software,and they reduce setup time for deploying an infrastructure toaccelerate genomics workflows. 3 Performance gains includea 75 percent speedup for the BWA using Intel Solid-StateDrives (SSDs). 3 The validated performance and quality resultshave been certified by Broad Institute.What Are Intel Select Solutions?Intel Select Solutions are verified hardware andsoftware stacks that are optimized for specific softwareworkloads across compute, storage, and network. Thesolutions are developed from deep Intel experiencewith industry solution providers, in addition toextensive collaboration with the world’s leading datacenter and service providers.To qualify as an Intel Select Solution, solutionproviders must:1. Follow the software and hardware stackrequirements outlined by Intel2. Replicate or exceed Intel’s reference benchmarkperformance threshold3. Publish a detailed implementation guide to facilitatecustomer deploymentSolution providers can develop their own optimizationsto add further value to their solutions.Intel Select Solutions for Genomics AnalyticsApplicationBWAPre-packaged genomics applications:PlatformCromwellWorkflow ExecutionOptimized GenomicsKernel Library*GenomicsDB*Large-scale analysisInfrastructureJob SchedulerContainer ManagementHardwareData Parallel ProcessingStorage ManagementIntel Omni-PathArchitectureFigure 1. A high-level overview of the solution configuration2

Solution Brief Intel Select Solutions for Genomics AnalyticsPhoto credit: Len Rubenstein & Broad InstituteHigh-performance data analytics computing clusters andoptimized workflows for genomics analytics are complicatedhardware and software systems. Intel Select Solutions forGenomics Analytics are end-to-end optimized hardware andopen source software configurations designed specificallyto accelerate genomics analytics—both the deployment ofsystems and the software that runs on them—by providingverified stacks for setup and configuration of thesecomplicated genomics pipelines.Intel Select Solutions for Genomics Analytics aredesigned to scale from small to very large clusteredsupercomputers. The customized systems can quicklyand dynamically be configured to meet specific needs.Organizations can scale as they grow their workloads.And Intel Select Solutions for Genomics Analytics includetools to discover, compose, and monitor resources withpowerful, modern API-based software.Intel Select Solutions for GenomicsAnalytics: Software, Firmware, andTechnology ConfigurationIntel Select Solutions for Genomics Analytics take advantageof the high-performance capabilities of Intel architecture,including Intel Xeon Scalable processors, Intel SSD DataCenter Family drives, and Intel Omni-Path Architecturehigh-performance fabric. Solutions incorporating the latestIntel Xeon Gold 6252 processors, Intel Xeon Gold 6226processors, and Intel Xeon Platinum 8280 processors deliverthe same performance or incremental performance gainsas compared to similarly configured solutions based onprevious-generation Intel Xeon Scalable processors.Table 1 shows hardware and software for both the “Base”and “Plus” configurations of Intel Select Solutions forGenomics Analytics. To refer to a solution as an Intel SelectSolution, a server vendor or data center solution providermust use these or better configurations.These solutions can be tailored with 2, 4, 16, 24, 36, or 48 ofthe specified compute devices and, when applicable, localand shared storage devices in order to meet the needs ofindividual environments.3

Solution Brief Intel Select Solutions for Genomics AnalyticsvTable 1. The Base and Plus configurations for Intel Select Solutions for Genomics AnalyticsINGREDIENTINTEL SELECT SOLUTIONS FOR GENOMICS ANALYTICSBASE CONFIGURATIONINTEL SELECT SOLUTIONS FOR GENOMICSANALYTICS PLUS CONFIGURATIONAPPLICATIONNODES1 x application node (also the compute node for the Baseconfiguration)1 x application nodePLATFORM1 x Intel Server Board S2600WFT**1 x Intel Server Board S2600WFT**PROCESSOR2 x Intel Xeon Gold 6152 processor (22 cores, 2.10 GHz), IntelXeon Gold 6252 processor (24 cores, 2.10 GHz), or a highernumber Intel Xeon Scalable processor2 x Intel Xeon Gold 6136 processor (12 cores, 3.0 GHz),Intel Xeon Gold 6226 processor (12 cores 2.70 GHz), or ahigher number Intel Xeon Scalable processorMEMORY12 x 16 GB DDR4-2666MHz 1DC (total capacity 192 GB orhigher)12 x 16 GB DDR4-2666MHz 1DC (total capacity 192 GB orhigher)HOST ADAPTERS1 x 100HFA016LS Intel Omni-Path Host Fabric InterfaceAdapter, PCIe* x16COMPUTE NODES4 x compute nodesPLATFORM1 x Intel Server Board S2600WFTPROCESSOR2 x Intel Xeon Platinum 8180 processor (28 cores, 2.50GHz), Intel Xeon Platinum 8280 processor (28 cores, 2.70GHz), or a higher number Intel Xeon Scalable processorMEMORY12 x 16 GB DDR4-2666MHz 1DC (total capacity 192 GB orhigher)LOCAL STORAGE2 x 480 GB Intel SSD DC S4510 (mirrored)2 x 480 GB Intel SSD DC S4510 (mirrored)STORAGE4 x 4 TB Intel SSD DC P4600/P4610, PCIe HHHL4 x 4 TB Intel SSD DC P4600/P4610, PCIe HHHLCAPACITYSTORAGECapacity of 10 TB (minimum)Capacity of 10 TB (minimum)HOST ADAPTERS1 x 100HFA016LS Intel Omni-Path Host Fabric InterfaceAdapter, PCIe x16DATA NETWORKADAPTERIntel Ethernet Connection X722 with Intel EthernetConverged Network Connection X527-DA2/DA4or10 gigabit (Gb) Intel Ethernet Converged NetworkAdapter X710Host NetworkAdapterIntegrated 1 gigabit Ethernet (GbE)NETWORKINFRASTRUCTUREDATA NETWORK1 x 100 Gb per second (Gbps) Intel Omni-Path 24x portswitchMANAGEMENTNETWORK1 x 1 Gbps 24x port switchINTEL OMNI-PATHARCHITECTURE1 x Intel Omni-Path EdgeSwitch 100 Series or betterLUSTRE*STORAGEINFRASTRUCTURELUSTRE MDS*1 x metadata serverPLATFORM1 x Intel Server Board S2600WFT**PROCESSORIntel Xeon Gold 6152 processor (12 cores, 3.0 GHz), IntelXeon Gold 6226 processor (12 cores, 2.80 GHz), or ahigher number Intel Xeon Scalable processorMEMORY192 GB or higher (12 x 16 GB DDR4-2666)DISKS2 x 480 GB Intel SSD DC S3520 (mirrored operatingsystem)HOST ADAPTERS1 x 12 Gbps Intel RAID Controller RS3SC008, JBOD mode;1 x 100HFA016LS Intel Omni-Path Host Fabric InterfaceAdapter, PCIe x164

Solution Brief Intel Select Solutions for Genomics AnalyticsLUSTRE OSS2 x object storage serverPLATFORM1 x Intel Server Board S2600WFT**PROCESSORIntel Xeon Gold 6136 processor (12 cores, 3.0 GHz), IntelXeon Gold 6226 processor (12 cores, 2.80 GHz), or a highernumber Intel Xeon Scalable processorMEMORY192 GB or higher (12 x 16 GB DDR4-2666)DISKS2 x 480 GB Intel SSD DC S3520 (mirrored OS)HOST ADAPTERS1 x 12 Gbps Intel RAID Controller RS3SC008, JBODmode; 1 x 100HFA016LS Intel Omni-Path Host FabricInterface Adapter, PCIe x16LUSTRE OSTObject storage targetJBOD1 x JBOD drive enclosureDRIVESCapacity 6–10 TBLUSTRE MDTMetadata targetJBODIntel Storage System JBOD2224S2DP, 2U, dual-port**SAS SSDSSOFTWAREFIRMWARE ANDSOFTWAREOPTIMIZATIONS4 x HGST SAS 400 GB*GATK*, BWA, and GATK workflows optimized for Intel technologiesGATK, BWA, and GATK workflows optimized for InteltechnologiesOptimized Cromwell workflowOptimized Cromwell workflowIntel GKL with optimized routines for accelerating developercodesIntel GKL with optimized routines for accelerating developer codesGenomicsDB*, specializing in large-scale variant analysisGenomicsDB, specializing in large-scale variant analysisHTCondor* or Slurm* job scheduler for running clusteredanalytics jobsHTCondor or Slurm job scheduler for running clusteredanalytics jobsDocker* for running multiple jobs in isolated containersacross a clusterDocker for running multiple jobs in isolated containersacross a clusterApache Spark* for big data analytics processingApache Spark for big data analytics processingLustre, the open source parallel file system, for high-performance storageLustre, the open source parallel file system, for highperformance storageIntel Advanced Vector Extensions 512 (Intel AVX-512)Intel AVX-512Linux* frequency governor—performance modeLinux frequency governor—performance modeIntel Hyper-Threading Technology (Intel HT Technology)—disabledIntel HT Technology—disabledCPU power and performance policy—balanced performanceCPU power and performance policy—balanced performanceWorkload configuration—balancedWorkload configuration—balancedPackage C-state—C6(Retention) statePackage C-state—C6(Retention) stateProcessor C6—enabledProcessor C6—enabledHardware P-states—native modeHardware P-states—native modeHardware PM interrupt—disabledHardware PM interrupt—disabledEnhanced Intel SpeedStep technology—enabledEnhanced Intel SpeedStep technology—enabledIntel Turbo Boost Technology—enabledIntel Turbo Boost Technology—enabledEnergy-efficient turbo—enabled5

Solution Brief Intel Select Solutions for Genomics AnalyticsPhoto credit: Len Rubenstein & Broad InstituteSimplified Code Development with Intel Genomics KernelLibrary (Intel GKL)Intel GKL provides code used in genomics that is optimizedfor Intel architecture. Developers can call these routines tohelp accelerate their code performance. The library enablesdevelopers to focus on the function and operation of theircode (instead of specific optimizations), while letting IntelGKL make use of the capabilities of Intel architecture.Scalability Improvements with GenomicsDB*GenomicsDB is a unique variant store capable of supportingup to hundreds of thousands of genome variant data. Itwas first developed by Intel Labs and customized for BroadInstitute’s use cases.GenomicsDB is packaged with GATK 4.1, and it helpssignificantly accelerate workflows. It offers a large scalingbenefit to the HaplotypeCaller Genomic VCF (GVCF)workflow. For example, without using GenomicsDB, BroadInstitute took six weeks to generate a database from 2,300whole genomes. With GenomicsDB, it was able to generatedatabases with five times more information in only twoweeks.4 That successfully enabled the Broad Institute–hosted Genomics Aggregation Database* (gnomAD*) project,which includes 15,000 whole genomes—one of the largestgenomic data aggregations in the world.4In addition to being integrated into GATK 4.1, GenomicsDBis available open source through Omics Data Automation*(genomicsdb.org/).Continuing DevelopmentThere are large genomic databases around the world thatcan bring great benefits to worldwide research efforts. Theongoing work of the Intel-Broad Center for Genomic DataEngineering continues to develop Intel Select Solutions forGenomics Analytics to efficiently access those databasesfor analysis. In the future, incorporated technologieswill provide the connectivity, performance, privacy, andsecurity necessary for genomics in the cloud and sharedenvironments.Benefits of the Intel and Broad Institute CollaborationThe work of Intel and Broad Institute offers many benefitsto the genomics community and to the technologists andbusiness managers that support it, including:Scientists Who Enjoy: Support for optimized and efficient pipelines Optimized, turnkey solutions Prepackaged workflow description languages (WDLs)scripts Peer application support Low-touch IT support Access to more in-house genomics data Increased statistical power Open source software Flexible application architecture6

Solution Brief Intel Select Solutions for Genomics AnalyticsIT Departments Who Need: Ease of implementation Scalability Reduced setup timeIntel Xeon Scalable ProcessorsIntel Xeon Scalable processors: Offer high scalability for enterprise data centers Known reference architecture Deliver performance gains for virtualizedinfrastructure compared toprevious-generation processors Vendor and solution support Achieve exceptional resource utilization and agility Optimal use of hardware versus workload(for example, prepackaged WDLs) E nable improved data and workload integrity andregulatory compliance for data center solutionsBusiness Owners Who Enjoy:The family includes Intel Xeon Bronze processors, IntelXeon Silver processors, Intel Xeon Gold processors,and Intel Xeon Platinum processors. Open source software with no licensing costs The ability to scale the solution to fit a budget Low price/watt Preconfigured solutions to reduce setup time andsupport costs Maximized value for in-house genomic data No license fees Open source application software Open source application software Extendability to other applicationsOEM Partners—Simplifying Genomics AnalyticsCluster DeploymentThe introduction of Intel Select Solutions for GenomicsAnalytics makes it easier to run genomics workloads. It alsoenables accelerated deployment of predictable clustersdesigned for genomics analytics. Thus, many integrators ofhigh-performance systems have partnered with Intel and areoffering design and deployment of solutions that will meetthe needs of their customers in the genomics community.“ Our goal is to reduce the challenges thatresearchers face to generate ever-moremeaningful insights from ever-larger sets ofgenomics data. For us, running GATK 4 onversion 1.0 of the Intel Select Solutions forGenomics Analytics delivered a 5xperformance gain right away. We're workingwith Intel to make the GATK Best Practicespipelines run even faster, at even greater scale,and with easier deployment forgenomic research worldwide.”Access Performance, Scale, and Ease ofDeployment for Genomics AnalyticsThe work of genomics science is critical to the understandingof disease and the creation of diagnostic tools and safe andeffective therapies. Genomics data and analytics are quicklyadvancing as researchers use technology to build massivegenomics data repositories and come to understand thepower of that data. Broad Institute is one of the largestcontributors of genomics data in the world, and its GATKsoftware is the world’s leading genome analysis tool foranalytics and variant call research. The Intel-Broad Centerfor Genomic Data Engineering brings together scienceand technology to optimize genomics analytics codes andworkflows and to define an optimized infrastructure—IntelSelect Solutions for Genomics Analytics—to run thoseworkloads. The results enable faster analysis and quickertimes to deploy hardware solutions that are customized forgenetics analysis. Several system integrators already offerservices to install such systems that will continue to enablefurther discoveries through genetics.—G eraldine Van der Auwera, Associate Director of Outreach andCommunications, Data Sciences Platform Group, Broad Institute7

Solution Brief Intel Select Solutions for Genomics AnalyticsLearn MoreThe Intel-Broad Center for Genomic Data Engineering: intel.com/broadinstitute“Big Data Genomics and Optimized Genomics Code”: s/genomicscode.htmlIntel and Broad Institute white paper, “Infrastructure for Deploying GATK Best Practices Pipeline”: fIntel Select Solutions: intel.com/selectsolutionsIntel Xeon Scalable processors: intel.com/xeonscalableIntel Select Solutions are supported by Intel Builders: http://builders.intel.com. Follow us on Twitter: #IntelBuilders1Stephens, Zachary D., et al. “Big Data: Astronomical or Genomical?” PLOS Biology. July 2015. son, Reid J. “How Big Is the Human Genome?” Precision Medicine. January 2014. he-human-genome-e90caa3409b0.3 Intel. “Infrastructure for Deploying GATK Best Practices Pipeline.” November er.pdf.4Geraldine Van der Auwera, Ph.D. Broad Institute. Bio-IT World. May 2017.Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources ofinformation to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit intel.com/benchmarks.Benchmark results were obtained prior to implementation of recent software and firmware updates intended to address exploits referred to as “Spectre” and “Meltdown.” Implementation ofthese updates may make these results inapplicable to your device or system.Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, aremeasured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult otherinformation and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For moreinformation go to intel.com/benchmarks.Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide costsavings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on systemconfiguration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as anywarranty arising from course of performance, course of dealing, or usage in trade.Intel, the Intel logo, Intel Inside, Intel Arria, and Intel Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.*Other names and brands may be claimed as the property of others. 2018 Intel Corporation.Printed in USA0419/JS/PRW/PDFPlease Recycle336731-004US8

Apache Spark* for big data analytics processing Lustre, the open source parallel file system, for high-perfor-mance storage GATK, BWA, and GATK workflows optimized for Intel technologies Optimized Cromwell workflow Intel GKL with optimized routines for accelerating devel - oper codes GenomicsDB, specializing in large-scale variant analysis