Applying DDN To Machine Learning - VI4IO

Transcription

Applying DDNtoMachine LearningJean-Thomas AcquavivaDDN Storage 2018 DDN Storagejacquaviva@ddn.com

Learning from What?Multivariate dataImage dataFacial recognitionAction recognitionObject detection and recognitionHandwriting and character recognitionAerial images.Text dataReviewsNews articlesMessagesTwitter and tweetsSocial networkAnomaly dataTime seriesHumanAnimalPlantMicrobeDrug sOther multivariateSignal dataElectricalMotion-trackingOther signalsSound dataMusicSpeech dataDDN Storage Biological data 2018 DDN StoragePhysical dataHigh-energy physicsSystemsAstronomyEarth science.Type of dataSupported operationsdiscret quantitative dataCalculations, equality /difference, inferiority /superiorityContinuous quantitative dataCalculations, equality /difference, inferiority /superioritynominal qualitative dataEquality / differenceordinal qualitative dataEquality / difference,

Machine LAnalyticsBig 010000110101011010101001111101IO Characteristics: Read, Random, High Throughput perClient, File and IO Sizes between a few kb and a few MBTraining Sets typically larger than local cachesDDN Storage 2018 DDN Storage

4Diversity of Load: IO500Detailed writeDetailed read KAUST BurstBuffer and Lustre at DKRZ show massive falls inIO performance Small DDN Lustre based on 12K at PNNL shows a similarpattern IME has a order of magnitude better ratio between easy andhardDDN Storage 2018 DDN Storage

5DDN Storage Acknowledging multi-criteria performance metricsI/O GranularityI/O control plane PatternI/O Data plane PatternLarge ( 1MB)File Per Process ( share nothing)SequentialLargeFile Per ProcessRandomLargeSingle Shared FileSequentialLargeSingle Shared FileRandomMedium (47008 Bytes)File Per ProcessSequentialMediumFile Per ProcessRandomMediumSingle Shared FileSequentialMediumSingle Shared FileRandomSmall (4KB)File Per ProcessSequentialSmallFile Per ProcessRandomSmallSingle Shared FileSequentialSmallSingle Shared FileRandom 2018 DDN StorageIO500Easy !IO500Hard !

6IO500 to a comprehensive picture: DDN Flash native vs LustreLarge FPP SequentialLarge FPP Random100.0%Large Shared SequentialMedium Shared RandomMedium Shared Sequential50.0%Large Shared RandomSmall FPP Sequential0.0%Medium FPP RandomMedium FPP SequentialLustreSmall FPP RandomSmall Shared Random Lustre FlashSmall Shared SequentialWrite patternsIMEDDN Storage 2018 DDN Storage

7IO500 to a comprehensive picture: DDN Flash native E vs GridScalerEfficiency of GS Full Flash and IME using MPI-IOLarge FPP SequentialLarge FPP Random100.0%Large Shared SequentialMedium Shared RandomMedium Shared Sequential50.0%Large Shared Random0.0%Small FPP SequentialMedium FPP SequentialSmall FPP RandomWrite patternsSmall Shared RandomSmall Shared SequentialDDN Storage 2018 DDN StorageMedium FPP RandomGPFS FlashIME

Example: EXAScaler DGX Solution (hardware view)IB EDRHigh Availability2 VMs: OSS2 VMs: OSS72 x 1.2 TB NVMeDDN Storage 2018 DDN StorageES14KXE 72 SSDs

Platform DDN ES14KXE Full Flash: 1M IOPS – 40 GB/sRandom Read 4K IOPS on ES14KX (All Flash vs. HDD)1000000010000400721000100101SSD Based10HDD4K RandomDDN Storage 2018 DDN StorageProjected Performance for HDD Augment Flash with HDD at scalewith up to 1680 HDDs per controller100000MEASURED Performance for SSDs Scale IOPs further in the namespacewith additional controllers400010000004K IOPS (Log scale) ES14KX ALL Flash active-activecontrollers deliver 1M file IOPs – theequivalent of 4000 HDDsHDD 40 pools Projected HDDNumber of Devices

WHAT PFS FOR AI APPLICATIONS?FeatureImportance for AIGPFSLustreShared MetadataOperationsHigh - training data are usually curatedinto a single directory Lower than 10K (minimal Up to 200KSupport for highperformancemmap() I/O CallsHigh - many AI applications use mmap()calls Extremely poor StrongContainer SupportHigh - most AI applications arecontainerized Poor (network complexity AvailableData Isolation forContainersMedium/High – important for sharedenvironments Not available today AvailableData-on-Metadata(small file support)Medium/High – depends on data set DOM only for files smallerthan 3.4k DOM is highly tunableUnique MetadataOperationsMedium - depends on Installation Sizeand Application Workflow Highly scalable Highly scalable with DNE 2018 DataDirect Networks, Inc.improvements with v5)& root issues)1/2

Example: EXAScaler DGX Solution (host part)Tesla P100 NVLINK (170 Tflps) Integrated Flash Parallel FileSystem Access via TCP or IBxESclientESESclientclientESclient Extreme Data Access Rates forconcurrent DGX ContainersContainersBinaries/LibrariesEXAScaler DGX Docker VolumeDGX-1 hostubuntu 16.04kernel 4.4.0-97-genericOFED-internal-4.0-1.0.1DDN Storage 2018 DDN StorageHost OSEXAScaler DGX Docker Volume Lustre ES3.2 kernel modules compiledfor Ubuntu kernel and host’s OFED Lustre userspace tools scripts for Lustre mount/umountResource isolation: Ios/GPUs/NIC/memory/namespacesFor the application/SW suite

EXASCALER DGX-1CONTAINER PINNINGDDN’s EXASCALER for DGX manages I/Opaths optimally through DGX-1 to maximizeperformance to your AI application andkeep IO traffic from consuming internaldata pathsmmap() supportLMDB is key to manage file in severalframework (caffé). LMDB relies on mmap() 2018 DataDirect Networks, Inc.I/O Direct to PUNVLinkPCIeCPUCPUCPUCPURAMRAMInternal data paths free from I/O

SIDE NOTE ON LMDB MMAP() 1/2Lightning Memory-Mapped Database (LMDB) Ubiquitous in Deep learning frameworkAt the core of file management for Caffé and Tensor flowDeclare file as loaded in memory using mmap()MMAP() declares as already in memory Similar to page swapint fd open(“my file”, O RDONLY, 0);void* mmappedData mmap(NULL, filesize, PROT READ, MAP PRIVATE, fd, 0);MD5 Update (&mdContext, mmappedData, filesize);Memory pages are provisioned to potentially host the whole filesize First access to a page triggers a page fault. Page fault mechanism will ask the kernel to emit the I/O requests 2018 DataDirect Networks, Inc.

SIDE NOTE ON LMDB MMAP() 2/2Page fault mechanism will ask the kernel to emit the I/O requests No Posix read() or write() at the application level I/O access are dealt internally by the kernel / page cache (readahead)mmap() is nice for data scientists Abstraction of the storage layer Unified APImmap() is tough for computer scientists Kernel activity is more complicated to monitor than application Tracing tools (ftrace) have significant overhead Work with Julian and Eugen on this topic 2018 DataDirect Networks, Inc.

CONTAINER PINNING OPTIMIZATIONRandom Read (4k) IOPs300,000250,000IOPs200,000Directed I/OUnoptimized150,000100,00050,0000124Container Count 2018 DataDirect Networks, Inc.8

FROM IOPS TO BANDWIDTHSingle Container Performance (Lustre/SSD)12000120000IOPs10000010000IOPs or MB/sBW80000800060000600040000400020000200001284KB random read is IOPS bound 41MB random read is Bandwidth boundDDN Storage 2018 DDN Storage3164563264128IO SizeIO size inKB72568512910240

DGX CONTAINER THROUGHPUTSCALING UPWORKLOADS INDGX-1Container Throughput4000035000Throughput (MB/s)300002500020000150001000050000container 1container 2 2018 DataDirect Networks, Inc.container 3container 4TOTAL1 container2 containers 3 containers4 containers

EXASCALER ALL FLASHSingle DGX-1Remove I/O burden from datascientist shoulders I/O no longer the limiting factor Saturation of the network 250 KIOPS on a DGX-1 1 Millions IOPS with 4 DGX-1 2018 DataDirect Networks, Inc.38GB/s 250 KIOPs

192018 DDN x12 AnalyticsBringing HPC technologies and know-howto analytics x12SCALESPEEDSCALEVOLUMEULTIMATE FLEXIBILITY40GB/s in 4RU1TB/s &10M IOPS1PB in 4RU12x More Answers/Minute forYour AI at Any Scale! 2018 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.Over 20PBddn.com

Applying DDNtoMachine LearningJean-Thomas AcquavivaDDN Storage 2018 DDN Storagejacquaviva@ddn.comSystems that automatically learn without beingprogrammed, it’s easy to understand, butcomplicated to put in placegetting value from large amount of data facedresolution of complex computeAnalytics allow you to understand the meaningbehind your dataML is to predict and act; train model that learnof taking decisionFor this you need a powerfull compute and anefficient storage to feed your compute withdata1

Learning from What?Multivariate dataImage dataFacial recognitionAction recognitionObject detection and recognitionHandwriting and character recognitionAerial images.Text dataReviewsNews articlesMessagesTwitter and tweetsSocial networkTime seriesHumanAnimalPlantMicrobeDrug sOther multivariateSignal dataElectricalMotion-trackingOther signalsSound dataMusicSpeech dataDDN Storage Biological dataAnomaly data 2018 DDN StoragePhysical dataHigh-energy physicsSystemsAstronomyEarth science.Type of dataSupported operationsdiscret quantitative dataCalculations, equality /difference, inferiority /superiorityContinuous quantitative dataCalculations, equality /difference, inferiority /superioritynominal qualitative dataEquality / differenceordinal qualitative dataEquality / difference,inferiority / superiorityIn our case intelligent car data are comingfrom, image, text (information on traffic),electrical vehicule sensors, weather and whynot anomaly data (not an exhaustiv list)2

Machine LAnalyticsBig 010000110101011010101001111101IO Characteristics: Read, Random, High Throughput perClient, File and IO Sizes between a few kb and a few MBTraining Sets typically larger than local cachesDDN Storage 2018 DDN StorageHow ML BD and NoSQL interfere with suchflow of dataMachine learning use different kind ofmathematical algoritm like (description ofapplied math methods)Different sofware are available to manage thatamount of dataNow on the market one kind find differentframeworks tooIO profile: In ML mainly we read, randomHThroughput and file / IO sizeImage-based deep learning for classification, object detection andsegmentation benefit from high streaming bandwidth, random3

4Diversity of Load: IO500Detailed writeDetailed read KAUST BurstBuffer and Lustre at DKRZ show massive falls inIO performance Small DDN Lustre based on 12K at PNNL shows a similarpattern IME has a order of magnitude better ratio between easy andhardDDN Storage 2018 DDN Storage4

5DDN Storage Acknowledging multi-criteria performance metricsI/O GranularityI/O control plane PatternI/O Data plane PatternLarge ( 1MB)File Per Process ( share nothing)SequentialLargeFile Per ProcessRandomLargeSingle Shared FileSequentialLargeSingle Shared FileRandomMedium (47008 Bytes)File Per ProcessSequentialMediumFile Per ProcessRandomMediumSingle Shared FileSequentialMediumSingle Shared FileRandomSmall (4KB)File Per ProcessSequentialSmallFile Per ProcessRandomSmallSingle Shared FileSequentialSmallSingle Shared FileRandom 2018 DDN StorageIO500Easy !IO500Hard !

6IO500 to a comprehensive picture: DDN Flash native vs LustreLarge FPP SequentialLarge FPP Random100.0%Large Shared SequentialMedium Shared RandomMedium Shared Sequential50.0%Large Shared RandomSmall FPP SequentialSmall FPP RandomDDN Storage 2018 DDN Storage0.0%Medium FPP RandomMedium FPP SequentialLustreSmall Shared Random Lustre FlashSmall Shared SequentialWrite patternsIME

7IO500 to a comprehensive picture: DDN Flash native E vs GridScalerEfficiency of GS Full Flash and IME using MPI-IOLarge FPP SequentialLarge FPP Random100.0%Large Shared SequentialMedium Shared RandomMedium Shared Sequential50.0%Large Shared Random0.0%Small FPP SequentialMedium FPP SequentialSmall FPP RandomWrite patternsSmall Shared RandomSmall Shared SequentialDDN Storage 2018 DDN StorageMedium FPP RandomGPFS FlashIME

Example: EXAScaler DGX Solution (hardware view)IB EDRHigh Availability2 VMs: OSS2 VMs: OSS72 x 1.2 TB NVMeDDN Storage 2018 DDN StorageES14KXE 72 SSDs8

Platform DDN ES14KXE Full Flash: 1M IOPS – 40 GB/sRandom Read 4K IOPS on ES14KX (All Flash vs. HDD)1000000010000400721000100101SSD Based10HDD4K RandomDDN Storage Projected Performance for HDD Augment Flash with HDD at scalewith up to 1680 HDDs per controller100000MEASURED Performance for SSDs Scale IOPs further in the namespacewith additional controllers400010000004K IOPS (Log scale) ES14KX ALL Flash active-activecontrollers deliver 1M file IOPs – theequivalent of 4000 HDDsHDD 40 pools Projected HDDNumber of Devices 2018 DDN Storage9

WHAT PFS FOR AI APPLICATIONS?FeatureImportance for AIGPFSLustreShared MetadataOperationsHigh - training data are usually curatedinto a single directory Lower than 10K (minimal Up to 200KSupport for highperformancemmap() I/O CallsHigh - many AI applications use mmap()calls Extremely poor StrongContainer SupportHigh - most AI applications arecontainerized Poor (network complexity AvailableData Isolation forContainersMedium/High – important for sharedenvironments Not available today AvailableData-on-Metadata(small file support)Medium/High – depends on data set DOM only for files smallerthan 3.4k DOM is highly tunableUnique MetadataOperationsMedium - depends on Installation Sizeand Application Workflow Highly scalable Highly scalable with DNE 2018 DataDirect Networks, Inc.improvements with v5)& root issues)1/2

Example: EXAScaler DGX Solution (host part)Tesla P100 NVLINK (170 Tflps) Integrated Flash Parallel FileSystem Access via TCP or IBxESclientESESclientclientESclient Extreme Data Access Rates forconcurrent DGX ContainersContainersBinaries/LibrariesEXAScaler DGX Docker VolumeDGX-1 hostubuntu 16.04kernel 4.4.0-97-genericOFED-internal-4.0-1.0.1DDN Storage 2018 DDN StorageHost OSEXAScaler DGX Docker Volume Lustre ES3.2 kernel modules compiledfor Ubuntu kernel and host’s OFED Lustre userspace tools scripts for Lustre mount/umountResource isolation: Ios/GPUs/NIC/memory/namespacesFor the application/SW suiteDocker est un outil qui peut empaqueter une application et sesdépendances dans un conteneur isolé, qui pourra être exécuté surn'importe quel serveur ». Ceci permet d'étendre la flexibilité et laportabilité d’exécution d'une application, que ce soit sur lamachine locale, un cloud privé ou public, une machine nue, etcIl s'appuie sur les fonctionnalités du noyau et utilise l'isolation deressources (comme le processeur, la mémoire, les entrées etsorties et les connexions réseau) ainsi que des espaces de nomsséparés pour isoler le système d'exploitation tel que vu parl'application.DGX est sur ubuntu dont le kernelchange très vite (pas comme rh oucentos)DDN a développé des scripts pourcompiler/installer /utiliser un clientlustre à la volée par containerUn container par appli/suite ML on peut11

EXASCALER DGX-1CONTAINER PINNINGDDN’s EXASCALER for DGX manages I/Opaths optimally through DGX-1 to maximizeperformance to your AI application andkeep IO traffic from consuming internaldata pathsmmap() supportLMDB is key to manage file in severalframework (caffé). LMDB relies on mmap() 2018 DataDirect Networks, Inc.I/O Direct to PUNVLinkPCIeCPUCPUCPUCPURAMRAMInternal data paths free from I/O

SIDE NOTE ON LMDB MMAP() 1/2Lightning Memory-Mapped Database (LMDB) Ubiquitous in Deep learning frameworkAt the core of file management for Caffé and Tensor flowDeclare file as loaded in memory using mmap()MMAP() declares as already in memory Similar to page swapint fd open(“my file”, O RDONLY, 0);void* mmappedData mmap(NULL, filesize, PROT READ, MAP PRIVATE, fd, 0);MD5 Update (&mdContext, mmappedData, filesize);Memory pages are provisioned to potentially host the whole filesizeFirst access to a page triggers a page fault. Page fault mechanism will ask the kernel to emit the I/O requests 2018 DataDirect Networks, Inc.

SIDE NOTE ON LMDB MMAP() 2/2Page fault mechanism will ask the kernel to emit the I/O requests No Posix read() or write() at the application level I/O access are dealt internally by the kernel / page cache (readahead)mmap() is nice for data scientists Abstraction of the storage layer Unified APImmap() is tough for computer scientists Kernel activity is more complicated to monitor than application Tracing tools (ftrace) have significant overhead Work with Julian and Eugen on this topic 2018 DataDirect Networks, Inc.

CONTAINER PINNING OPTIMIZATIONRandom Read (4k) IOPs300,000250,000IOPs200,000Directed I/OUnoptimized150,000100,00050,0000124Container Count 2018 DataDirect Networks, Inc.8

FROM IOPS TO BANDWIDTHSingle Container Performance (Lustre/SSD)12000120000IOPs10000010000IOPs or MB/sBW8000080006000060004000040002000020000824KB random read is IOPS bound 411MB random read is Bandwidth boundDDN Storage 2018 DDN Storage3164563264128IOSizeIO size in KB72568512910240

DGX CONTAINER THROUGHPUTSCALING UPWORKLOADS INDGX-1Container Throughput4000035000Throughput (MB/s)300002500020000150001000050000container 1container 2 2018 DataDirect Networks, Inc.container 3container 4TOTAL1 container2 containers 3 containers4 containers

EXASCALER ALL FLASHSingle DGX-1Remove I/O burden from datascientist shoulders I/O no longer the limiting factor Saturation of the network 250 KIOPS on a DGX-1 1 Millions IOPS with 4 DGX-1 2018 DataDirect Networks, Inc.38GB/s 250 KIOPs

192018 DDN x12 AnalyticsBringing HPC technologies and know-howto analytics x12SCALESPEEDSCALEVOLUMEULTIMATE FLEXIBILITY40GB/s in 4RU1TB/s &10M IOPS1PB in 4RU12x More Answers/Minute forYour AI at Any Scale! 2018 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.Over 20PBddn.com

2018 DataDirect Networks, Inc. EXASCALER DGX-1 CPUCPU RAM CPUCPU RAM GPUCPU GPUCPU GPUCPU GPUCPU GPUCPU GPUCPU GPUCPU GPUCPU I/O Direct to GPUs Internal data paths free from I/O PCIe NVLink CONTAINER PINNING DDN's EXASCALER for DGX manages I/O-