GPUfs: Integrating A File System With GPUs - Dedis.cs.yale.edu

Transcription

GPUfs:Integrating a file system withGPUsMark Silberstein(UT Austin/Technion)Bryan Ford (Yale), Idit Keidar (Technion)Emmett Witchel (UT Austin)Mark Silberstein - UT Austin1

Traditional System ArchitectureApplicationsOSCPUMark Silberstein - UT Austin2

Modern System ArchitectureAccelerated applicationsOSCPUManycoreprocessorsFPGAMark Silberstein - UT AustinHybridCPU-GPUGPUs3

Software-hardware gap is wideningAccelerated applicationsOSCPUManycoreprocessorsFPGAMark Silberstein - UT AustinHybridCPU-GPUGPUs4

Software-hardware gap is wideningAccelerated applicationsOSCPUAd-hoc abstractions andmanagement mechanismsManycoreprocessorsFPGAMark Silberstein - UT AustinHybridCPU-GPUGPUs5

On-accelerator OS supportcloses the programmability gapAcceleratedapplicationsNative acceleratorapplicationsOSOn-accelerator OS supportCPUCoordinationManycoreprocessorsFPGAMark Silberstein - UT AustinHybridCPU-GPUGPUs6

GPUfs: File I/O support for GPUs Motivation Goals Understanding the hardware Design Implementation EvaluationMark Silberstein - UT Austin7

Building systems with GPUs is hard.Why?Mark Silberstein - UT Austin8

Goal of GPU programming frameworksGPUCPUData transfersGPU invocationMemorymanagementParallelAlgorithmMark Silberstein - UT Austin9

Headache for GPU programmersGPUCPUData transfersInvocationMemory managementParallelAlgorithmHalf of the CUDA SDK 4.1 samples:at least 9 CPU LOC per 1 GPU LOCMark Silberstein - UT Austin10

GPU kernels are isolatedGPUCPUData transfersInvocationMemory managementMark Silberstein - UT AustinParallelAlgorithm11

Example: accelerating photo ace-CollageWhile(Unhappy()){Read next image file()Decide placement()Remove outliers()}Mark Silberstein - UT Austin12

CPU ImplementationCPUApplicationWhile(Unhappy()){Read next image file()Decide placement()Remove outliers()}Mark Silberstein - UT Austin13

Offloading computations to GPUCPUApplicationMove to GPUWhile(Unhappy()){Read next image file()Decide placement()Remove outliers()}Mark Silberstein - UT Austin14

Offloading computations to GPUCo-processor programming modelCPUDatatransferGPUKernel startKernelterminationMark Silberstein - UT Austin15

Kernel start/stop overheadskeinvoy tocop UGPCache flushcopyCPU toCPUInvocationlatencyGPUSynchronizationMark Silberstein - UT Austin16

Hiding the overheadsAsynchronous invocationManual data reuse managementDouble bufferingy tocop UGPkeinvoy tocop UGPcopyCPU toCPUGPUMark Silberstein - UT Austin17

Implementation complexityManagement overheadAsynchronous invocationManual data reuse managementDouble bufferingy tocop UGPkeinvoy tocop UGPcopyCPU toCPUGPUMark Silberstein - UT Austin18

Implementation complexityManagement overheadAsynchronous invocationManual data reuse managementDouble bufferingy tocop UGPkeinvoy tocop UGPcopyCPU toCPUGPUWhy do we need to deal withlow-level system details?Mark Silberstein - UT Austin19

The reason is.GPUs are peer-processorsThey need I/O OS servicesMark Silberstein - UT Austin20

GPUfs: application viewGPU2GPU1dharen(“sopeGPU3()pamm”)”)filef iledreha(“ senopwrite()CPUsGPUfsHost File SystemMark Silberstein - UT Austin21

GPUfs: application viewdharen(“sopeGPU2GPU3()pamm”)”)filef iledreha(“ X GPUfs(CPU)-like APIHost File SystemPersistentstorageMark Silberstein - UT Austin22

Accelerating collage app with GPUfsNo CPUmanagement codeCPUGPUfsGPUopen/read from GPUMark Silberstein - UT Austin23

Accelerating collage app with GPUfsCPURead-aheadGPUfsGPUfsGPUfsbuffer cacheGPUOverlappingOverlapping computationsand transfersMark Silberstein - UT Austin24

Accelerating collage app with GPUfsCPUGPUfsGPUData reuseRandom dataaccessMark Silberstein - UT Austin25

ChallengeGPU CPUMark Silberstein - UT Austin26

Massive parallelismParallelism is essential for performance indeeply multi-threaded wide-vector hardwareAMD HD5870*NVIDIA Fermi*23,000active threads31,000active threadsFrom M. Houston/A. Lefohn/K. Fatahalian – A trip through the architecture of modern GPUs*Mark Silberstein - UT Austin27

Heterogeneous memoryGPUs inherently impose high bandwidthdemands on memoryGPUCPU10-32GB/s288-360GB/sMemoryMemory x206-16 GB/sMark Silberstein - UT Austin28

How to build an FS layeron this hardware?Mark Silberstein - UT Austin29

GPUfs: principled redesign of thewhole file system stack Relaxed FS API semantics for parallelismRelaxed FS consistency for heterogeneousmemoryGPU-specific implementation ofsynchronization primitives, lock-free datastructures, memory allocation, .Mark Silberstein - UT Austin30

GPUfs high-level designCPUGPUUnchanged applicationsusing OS File APIGPU applicationusing GPUfs File APIGPUfs hooksOS File System InterfaceOSMassiveparallelismGPUfs GPUFile I/O libraryGPUfs Distributed Buffer Cache(Page cache)CPU MemoryHeterogeneousGPU MemorymemoryHost File SystemDiskMark Silberstein - UT Austin31

GPUfs high-level designCPUGPUUnchanged applicationsusing OS File APIGPU applicationusing GPUfs File APIGPUfs hooksOS File System InterfaceOSGPUfs GPUFile I/O libraryGPUfs Distributed Buffer Cache(Page cache)CPU MemoryGPU MemoryHost File SystemDiskMark Silberstein - UT Austin32

Buffer cache semanticsLocal or Distributed file systemdata consistency?Mark Silberstein - UT Austin33

GPUfs buffer cacheWeak data consistency model close(sync)-to-open semantics (AFS)open()read(1)GPU1Not visible to CPUGPU2write(1)fsync()write(2)Remote-to-Local memory performanceratio is similar toa distributed system Mark Silberstein - UT Austin34

In the paperOn-GPU File I/O cgftruncChanges in the semantics are crucialMark Silberstein - UT Austin35

Implementation bits In the paper Paging supportDynamic data structures and memoryallocators Lock-free radix tree Inter-processor communications (IPC) Hybrid H/W-S/W barriers Consistency module in the OS kernel 1,5K GPU LOC, 600 CPU LOCMark Silberstein - UT Austin36

EvaluationAll benchmarks arewritten as a GPUkernel:no CPU-sidedevelopmentMark Silberstein - UT Austin37

Matrix-vector product(Inputs/Outputs in files)Vector 1x128K elements, Page size 2MB, GPU TESLA C20753500CUDA piplinedCUDA optimizedGPU file I/OThroughput nput matrix size (MB)Mark Silberstein - UT Austin38

Word frequency count in text Count frequency of modern English words inthe works of Shakespeare, and in the Linuxkernel source treeEnglish dictionary: 58,000 wordsChallengesDynamic working setSmall filesLots of file I/O (33,000 files,1-5KB each)Unpredictable output sizeMark Silberstein - UT Austin39

Results8CPUsGPU-vanillaGPU-GPUfsLinux source33,000 files,524MB6h50m (7.2X)53m (6.8X)Shakespeare1 file, 6MB292s40s (7.3X)40s (7.3X)Mark Silberstein - UT Austin40

Results8CPUsGPU-vanillaGPU-GPUfsLinux source33,000 files,524MB6h50m (7.2X)53m (6.8X)Shakespeare1 file, 6MB292s8% overhead40s (7.3X)40s (7.3X)Unboundedinput/outputsize supportMark Silberstein - UT Austin41

GPUfs is the first system to provide native accessto host OS services from GPU programsGPUfsCPUCPUGPUGPUCode is available for download ome/gpufshttp://goo.gl/ofJ6JMark Silberstein - UT Austin42

Our life would have been easier with PCI atomics Preemptive background daemons GPU-CPU signaling support In-GPU exceptions GPU virtual memory API (host-based or device) Compiler optimizations for register-heavylibraries Seems like accomplished in 5.0Mark Silberstein - UT Austin43

Sequential access to file:3 versionsCUDA whole file transferGPU file I/OGPUCPUgmmap()Read fileTransfer to GPUCUDA pipelined transferCPURead chunkTransfer to GPURead chunkRead chunkTransfer to GPUTransfer to GPUMark Silberstein - UT AustinRead chunkTransfer to GPU44

Sequential readThroughput vs. Page size4000GPU File I/OCUDA whole fileCUDA pipeline3500Throughput ge sizeMark Silberstein - UT Austin45

Sequential readThroughput vs. Page size4000GPU File I/OCUDA whole fileCUDA pipeline3500Throughput (MB/s)300025002000150010005000Benefit:Decouple performance constraintsfrom application logic16K64K256K512K1M2MPage sizeMark Silberstein - UT Austin46

YesterdayOn-acceleratorOS torsaspeersMark Silberstein - UT Austin47

What about sas peersAcceleratorsas coprocessorsMark Silberstein - UT Austin48

Set GPUs free!Mark Silberstein - UT Austin49

Parallel square root on GPUgpu thread(thread id i){float buffer;int fd gopen(filename,O GRDWR);Same codewill run in alloffset &buffer,offset);the GPUbuffer r,offset);gclose(fd);}Mark Silberstein - UT Austin50

GPUfs impact on GPU programsMemory overhead Register pressure Very little CPU coding Makes exitless GPU kernels possiblePay-as-you-go designMark Silberstein - UT Austin51

Preserve CPU semantics?What does it mean toopen/read/write/close/mmap a filein thousands of threads?GPU threads aredifferentfrom CPU IMD vectorThreadSIMD vectorMark Silberstein - UT Austin52

Preserve CPU semantics?What does it mean toopen/read/write/close/mmap a filein thousands of threads?GPU kernel is asingle data-parallelapplicationGPU threads aredifferentfrom CPU IMD vectorThreadSIMD vectorMark Silberstein - UT Austin53

GPUfs semantics(see more discussion in the paper)int fd gopen(“filename”,O IMD vectorThreadSIMD vectorOne callper SIMD vector:bulk-synchronouscooperative executionOne filedescriptor per file:open()/close()cached on a GPUMark Silberstein - UT Austin54

GPU hardware characteristicsParallelismHeterogeneous memoryMark Silberstein - UT Austin55

API semanticsint fd gopen(“filename”,O GRDWR);Mark Silberstein - UT Austin56

API semanticsint fd gopen(“filename”,O GRDWR);CPUMark Silberstein - UT Austin57

This code runs in 100,000GPU threadsint fd gopen(“filename”,O GRDWR);CPU GPUMark Silberstein - UT Austin58

Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) Mark Silberstein - UT Austin 2 Traditional System Architecture Applications OS CPU. . Read chunk Transfer to GPU Read chunk Transfer to GPU CUDA whole file transfer GPU gmmap() Read file Transfer to GPU. Mark Silberstein -