Modern C , Heterogeneous Computing & OpenCL SYCL - IWOCL

Transcription

Modern C , heterogeneous computing & OpenCL SYCLRonan KeryellKhronos OpenCL SYCL committee05/12/2015—IWOCL 2015 SYCL Tutorial

C 14IOutline1C 142C dialects for OpenCL (and heterogeneous computing)3OpenCL SYCL 1.2C . putting everything altogether4OpenCL SYCL 2.1.5Conclusion Modern C , heterogeneous computing & OpenCL SYCLIWOCL 20152 / 43

C 14IC 14 2 Open Source compilers available before ratification (GCC & Clang/LLVM) Confirm new momentum & pace: 1 major (C 11) and 1 minor (C 14) version on a6-year cycle Next big version expected in 2017 (C 1z)I Already being implemented! , Monolithic committee replaced by many smaller parallel task forcesIIIIParallelism TS (Technical Specification) with Parallel STLConcurrency TS (threads, mutex.)Array TS (multidimensional arrays à la Fortran)Transactional Memory TS.Race to parallelism! Definitely matters for HPC and heterogeneous computing!C is a complete new language Forget about C 98, C 03. Send your proposals and get involved in C committee (pushing heterogeneouscomputing)! Modern C , heterogeneous computing & OpenCL SYCLIWOCL 20153 / 43

C 14IModern C & HPC(I) Huge library improvementsIIIII thread library and multithread memory model atomic Hash-mapAlgorithmsRandom numbers.; HPC Uniform initialization and range-based for loops t d : : v e c t o r i n t my vector { 1 , 2 , 3 , 4 , 5 } ;f o r ( i n t &e : my vector )e 1 ; Easy functional programming style with lambda (anonymous) functionss t d : : t r a n s f o r m ( s t d : : begin ( v ) , s t d : : end ( v ) , [ ] ( i n t v ) { r e t u r n 2 v ; } ) ; Modern C , heterogeneous computing & OpenCL SYCLIWOCL 20154 / 43

C 14IModern C & HPC(II) easier: easy Lot of meta-programming improvements to make meta-programming variadic templates, type traits type traits . Make simple things simpler to be able to write generic numerical libraries, etc. Automatic type inference for terse programmingI Python 3.x (interpreted):d e f add ( x , y ) :return x yp r i n t ( add ( 2 , 3 ) )# 5p r i n t ( add ( " 2 " , " 3 " ) ) # 23I Same in C 14 but compiled static compile-time type-checking:auto add [ ] ( auto x , auto y ) { r e t u r n x y ; } ;s t d : : c o u t add ( 2 , 3 ) s t d : : e n d l ;// 5s t d : : c o u t add ( " 2 " s , " 3 " s ) s t d : : e n d l ; / / 23((((( typename ,Without using templated code! template((((( Modern C , heterogeneous computing & OpenCL SYCLIWOCL 20155 / 43

C 14IModern C & HPC(III) R-value references & std :: move semanticsI matrix A matrix B matrix C Avoid copying (TB, PB, EB. /) when assigning or function return Avoid raw pointers, malloc()/free()/delete[]: use references and smart pointersinstead/ / A l l o c a t e a double w i t h new ( ) and wrap i t i n a smart p o i n t e rauto gen ( ) { r e t u r n s t d : : make shared double { 3.14 } ; }[.]{auto p gen ( ) , q p ; q 2 . 7 1 8 ;/ / Out o f scope , no l o n g e r use o f t h e memory : d e a l l o c a t i o n happens here} Lot of other amazing stuff. Allow both low-level & high-level programming. Useful for heterogeneous computing Modern C , heterogeneous computing & OpenCL SYCLIWOCL 20156 / 43

C dialects for OpenCL (and heterogeneous computing)IOutline1C 142C dialects for OpenCL (and heterogeneous computing)3OpenCL SYCL 1.2C . putting everything altogether4OpenCL SYCL 2.1.5Conclusion Modern C , heterogeneous computing & OpenCL SYCLIWOCL 20157 / 43

C dialects for OpenCL (and heterogeneous computing)IOpenCL 2.1 C kernel language(I) Announced at GDC, March 2015 Move from C99-based kernel language to C 14-based/ / Template c l a s s e s t o express OpenCL address spacesl o c a l a r r a y i n t , N a r r a y ;local float v ;c o n s t a n t p t r double p ;/ / Use C 11 g e n e r a l i z e d a t t r i b u t e s , t o i g n o r e v e c t o r dependencies[ [ safelen ( 8 ) , ivdep ] ]f o r ( i n t i 0 ; i N; i )/ / Can i n f e r t h a t o f f s e t 8array [ i o f f s e t ] array [ i ] 42; Modern C , heterogeneous computing & OpenCL SYCLIWOCL 20158 / 43

C dialects for OpenCL (and heterogeneous computing)IOpenCL 2.1 C kernel language(II) Kernel side enqueueI Replace OpenCL 2 infamous Apple GCD block syntax by C 11 lambdak e r n e l v o i d main kernel ( i n t N, i n t a r r a y ) {/ / Only work i t e m 0 w i l l launch a new k e r n e li f ( g e t g l o b a l i d ( 0 ) 0 )/ / Wait f o r t h e end o f t h i s work group b e f o r e s t a r t i n g t h e new k e r n e lg e t d e f a u l t q u e u e ( ) . enqueue kernel (CLK ENQUEUE FLAGS WAIT WORK GROUP,ndrange { N } ,[ ] kernel {array [ get global id ( 0 ) ] 7;});} C 14 memory model and atomic operations Newer SPIR-V binary IR format Modern C , heterogeneous computing & OpenCL SYCLIWOCL 20159 / 43

C dialects for OpenCL (and heterogeneous computing)IOpenCL 2.1 C kernel language(III) Amazing progress but no single source solution à la CUDA yetI Still need to play with OpenCL host API to deal with buffers, etc. Modern C , heterogeneous computing & OpenCL SYCLIWOCL 201510 / 43

C dialects for OpenCL (and heterogeneous computing)IBolt C (I) Parallel STL map-reduce https://github.com/HSA-Libraries/Bolt Developed by AMD on top of OpenCL, C AMP or TBB# i n c l u d e b o l t / c l / s o r t . h # include vector #include algorithm i n t main ( ) {/ / generate random data ( on h o s t )std : : vector i n t a (8192);s t d : : generate ( a . begin ( ) , a . end ( ) , rand ) ;/ / s o r t , run on b e s t d e v i c e i n t h e p l a t f o r mb o l t : : c l : : s o r t ( a . begin ( ) , a . end ( ) ) ;return 0;} Simple! Modern C , heterogeneous computing & OpenCL SYCLIWOCL 201511 / 43

C dialects for OpenCL (and heterogeneous computing)IBolt C (II) But.I No direct interoperability with OpenCL worldI No specific compiler required with OpenCLon device; some special syntax to define operation OpenCL kernel source strings for complex operations with macros BOLT FUNCTOR(),BOLT CREATE TYPENAME(), BOLT CREATE CLCODE(). Work better with AMD Static C Kernel Language Extension (now in OpenCL 2.1) & best withC AMP (but no OpenCL interoperability.) Modern C , heterogeneous computing & OpenCL SYCLIWOCL 201512 / 43

C dialects for OpenCL (and heterogeneous computing)IBoost.Compute(I) Boost library accepted in 2015 https://github.com/boostorg/compute Provide 2 levels of abstractionI High-level parallel STLI Low-level C wrapping of OpenCL concepts Modern C , heterogeneous computing & OpenCL SYCLIWOCL 201513 / 43

C dialects for OpenCL (and heterogeneous computing)IBoost.Compute(II)/ / Get a d e f a u l t command queue on t h e d e f a u l t a c c e l e r a t o rauto queue boost : : compute : : system : : d e f a u l t q u e u e ( ) ;/ / A l l o c a t e a v e c t o r i n a b u f f e r on t h e d e v i c eboost : : compute : : v e c t o r f l o a t d e v i c e v e c t o r { N, queue . g e t c o n t e x t ( ) } ;boost : : compute : : i o t a ( d e v i c e v e c t o r . begin ( ) , d e v i c e v e c t o r . end ( ) , 0 ) ;/ / Create an e q u i v a l e n t OpenCL k e r n e lBOOST COMPUTE FUNCTION( f l o a t , add four , ( f l o a t x ) , { r e t u r n x 4 ; } ) ;boost : : compute : : t r a n s f o r m ( d e v i c e v e c t o r . begin ( ) , d e v i c e v e c t o r . end ( ) ,d e v i c e v e c t o r . begin ( ) , add four , queue ) ;boost : : compute : : s o r t ( d e v i c e v e c t o r . begin ( ) , d e v i c e v e c t o r . end ( ) , queue ) ;/ / Lambda e x p r e s s i o n e q u i v a l e n tboost : : compute : : t r a n s f o r m ( d e v i c e v e c t o r . begin ( ) , d e v i c e v e c t o r . end ( ) ,d e v i c e v e c t o r . begin ( ) ,boost : : compute : : lambda : : 1 3 4 , queue ) ; Modern C , heterogeneous computing & OpenCL SYCLIWOCL 201514 / 43

C dialects for OpenCL (and heterogeneous computing)IBoost.Compute(III) Elegant implicit C conversions between OpenCL and Boost.Compute types for finercontrol and optimizationsauto command queue boost : : compute : : system : : d e f a u l t q u e u e ( ) ;auto c o n t e x t command queue . g e t c o n t e x t ( ) ;auto program boost : : compute : : program : : c r e a t e w i t h s o u r c e f i l e ( k e r n e l f i l e n a m e ,context ) ;program . b u i l d ( ) ;boost : : compute : : k e r n e l i m 2 c o l k e r n e l { program , " i m 2 c o l " } ;boost : : compute : : b u f f e r i m b u f f e r { c o n t e x t , image size s i z e o f ( f l o a t ) ,CL MEM READ ONLY } ;command queue . e n q u e u e w r i t e b u f f e r ( i m b u f f e r , 0 / O f f s e t / ,im data . s i z e ( ) s i z e o f ( d e c l t y p e ( im data ) : : v a l u e t y p e ) ,im data . data ( ) ) ; Modern C , heterogeneous computing & OpenCL SYCLIWOCL 201515 / 43

C dialects for OpenCL (and heterogeneous computing)IBoost.Compute(IV)im2col kernel . set args ( im buffer ,h e i g h t , width ,ksize h , ksize w ,pad h , pad w ,stride h , stride w ,height col , width col ,data col ) ;command queue . enqueue nd range kernel ( k e r n e l ,boost : : compute : : e x t e n t s 1 { 0 } / g l o b a l work o f f s e t / ,boost : : compute : : e x t e n t s 1 { workitems } / g l o b a l work i t e m / ,boost : : compute : : e x t e n t s 1 { workgroup size } ; / Work group s i z e / ) ; Provide program caching Direct OpenCL interoperability for extreme performance No specific compiler required; some special syntax to define operation on device Probably the right tool to use to translate CUDA & Thrust to OpenCL world Modern C , heterogeneous computing & OpenCL SYCLIWOCL 201516 / 43

C dialects for OpenCL (and heterogeneous computing)IVexCL(I) Parallel STL similar to Boost.Compute mathematical ndom generators (Random123)FFTTensor operationsSparse matrix-vector productsStencil convolutions. OpenCL (CL.hpp or Boost.Compute) & CUDA back-end Allow device vectors & operations to span different accelerators from different vendorsin a same contextvex : : Context c t x { vex : : F i l t e r : : Type { CL DEVICE TYPE GPU }&& vex : : F i l t e r : : D o u b l e P r e c i s i o n } ;vex : : v e c t o r double A { c t x , N } , B { c t x , N } , C { c t x , N } ;A 2 B s i n (C ) ; Modern C , heterogeneous computing & OpenCL SYCLIWOCL 201517 / 43

C dialects for OpenCL (and heterogeneous computing)IVexCL(II)I Allow easy interoperability with back-end/ / Get t h e c l b u f f e r s t o r i n g A on t h e d e v i c e 2auto c l B u f f e r A ( 2 ) ; Use heroic meta-programming to generate kernels without using specific compilerwith deep embedded DSLI Use symbolic types (prototypal arguments) to extract function structure/ / Set r e c o r d e r f o r e x p r e s s i o n sequences t d : : o s t r i n g s t r e a m body ;vex : : g e n e r a t o r : : s e t r e c o r d e r ( body ) ;vex : : symbolic double sym x { vex : : symbolic double : : VectorParameter } ;sym x s i n ( sym x ) 3 ;sym x cos (2 sym x ) 5 ;/ / B u i l d k e r n e l from t h e recorded sequenceauto f o o b a r vex : : g e n e r a t o r : : b u i l d k e r n e l ( c t x , " f o o b a r " ,body . s t r ( ) , sym x ) ; Modern C , heterogeneous computing & OpenCL SYCLIWOCL 201518 / 43

C dialects for OpenCL (and heterogeneous computing)IVexCL(III)/ / Now use t h e k e r n e lfoobar (A ) ;I VexCL is probably the most advanced tool to generate OpenCL without requiring aspecific compiler. Interoperable with OpenCL, Boost.Compute for extreme performance & ViennaCL Kernel caching to avoid useless compiling Probably the right tool to use to translate CUDA & Thrust to OpenCL world Modern C , heterogeneous computing & OpenCL SYCLIWOCL 201519 / 43

C dialects for OpenCL (and heterogeneous viennacl-dev OpenCL/CUDA/OpenMP back-end Similar to VexCL for sharing context between various platforms Linear algebra (dense & sparse) Iterative solvers FFT OpenCL kernel generator from high-level expressions Some interoperability with Matlab Modern C , heterogeneous computing & OpenCL SYCLIWOCL 201520 / 43

C dialects for OpenCL (and heterogeneous computing)IC AMPparallel for each (e ,/ / D e f i n e t h e k e r n e l t o execute[ ] ( Concurrency : : index 1 i ) r e s t r i c t (amp) {a[ i ] i [0];});/ / I n t h e d e s t r u c t i o n o f a r r a y v i e w " a " happening here ,/ / t h e data are copied back b e f o r e i o t a n ( ) r e t u r n s/ / Use i o t a a l g o r i t h m i n C AMP# i n c l u d e amp . h # i n c l u d e iostream enum { NWITEMS 512 } ;i n t data [NWITEMS ] ;}/ / To a v o i d w r i t i n g Concurrency : : everywhereu s i n g namespace Concurrency ;void iota n ( size t n , i n t dst [ ] ) {/ / S e l e c t t h e f i r s t t r u e a c c e l e r a t o r found as t h e d e f a u l t onef o r ( auto c o n s t & acc : a c c e l e r a t o r : : g e t a l l ( ) )i f ( ! acc . g e t i s e m u l a t e d ( ) ) {a c c e l e r a t o r : : s e t d e f a u l t ( acc . g e t d e v i c e p a t h ( ) ) ;break ;}/ / D e f i n e t h e i t e r a t i o n spacee x t e n t 1 e ( n ) ;/ / Create a b u f f e r from t h e g i v e n a r r a y memoryarray view i n t , 1 a ( e , d s t ) ;/ / I s t h e r e a b e t t e r way t o express w r i t e o n l y data ?a . discard data ( ) ;/ / Execute a k e r n e l i n p a r a l l e l Modern C , heterogeneous computing & OpenCL SYCL Developed by Microsoft, AMD &MultiCoreWare Single source: easy to write kernels Require specific compiler Not pure C (restrict, tile static) No OpenCL interoperability Difficult to optimize the data transfersIWOCL 201521 / 43

C dialects for OpenCL (and heterogeneous computing)IOpenMP 4# i n c l u d e s t d i o . h enum { NWITEMS 512 } ;i n t a r r a y [NWITEMS ] ;p r i n t f ( "%d %d \ n " , i , a r r a y [ i ] ) ;return 0;}void iota n ( size t n , i n t dst [ n ] ) { Old HPC standard from the 90’s#pragma omp t a r g e t map( from : d s t [ 0 : n 1]) Use #pragma to express parallelism#pragma omp p a r a l l e l f o r OpenMP 4 extends it to acceleratorsf o r ( i n t i 0 ; i n ; i )dst [ i ] i ;I Work-group parallelism}I Work-item parallelismi n t main ( i n t argc , c o n s t char argv [ ] ) { Deal with CPU & heterogeneouscomputing parallelismi o t a n (NWITEMS, a r r a y ) ;/ / Display r e s u l t sf o r ( i n t i 0 ; i NWITEMS; i ) Modern C , heterogeneous computing & OpenCL SYCL No LDS support No OpenCL interoperability But quite simple! Single source.IWOCL 201522 / 43

C dialects for OpenCL (and heterogeneous computing)IOther (non-)OpenCL C framework(I) ArrayFire, Aura, CLOGS, hemi, HPL, Kokkos, MTL4, SkelCL, SkePU, EasyCL. nVidia CUDA 7 now C 11-based;I Single sourcesimpler for the programmerI nVidia Thrust parallel STL map-reduce on top of CUDA, OpenMP or TBBhttps://github.com/thrust/thrust Not very clean because device pointers returned by cudaMalloc() do not have a special typeuse some ugly casts; OpenACC OpenMP 4 restricted to accelerators LDS finer control Modern C , heterogeneous computing & OpenCL SYCLIWOCL 201523 / 43

C dialects for OpenCL (and heterogeneous computing)IMissing link. No tool providingI OpenCL interoperabilityI Modern C environmentI Single source for programming productivity Modern C , heterogeneous computing & OpenCL SYCLIWOCL 201524 / 43

OpenCL SYCL 1.2IOutline1C 142C dialects for OpenCL (and heterogeneous computing)3OpenCL SYCL 1.2C . putting everything altogether4OpenCL SYCL 2.1.5Conclusion Modern C , heterogeneous computing & OpenCL SYCLIWOCL 201525 / 43

OpenCL SYCL 1.2IPuns and pronunciation explained Modern C , heterogeneous computing & OpenCL SYCLOpenCL SYCLOpenCL SPIRsickle [ "si-k@l ]spear [ "spir ]IWOCL 201526 / 43

OpenCL SYCL 1.2IOpenCL SYCL goals Ease of useI Single source programming model Take advantage of CUDA & C AMP simplicity and power Compiled for host and device(s) Easy development/debugging on host: host fall-back target Programming interface based on abstraction of OpenCL components (datamanagement, error handling.) Most modern C features available for OpenCLI Enabling the creation of higher level programming modelsI C templated libraries based on OpenCLI Exceptions for error handling Portability across platforms and compilers Providing the full OpenCL feature set and seamless integration with existing OpenCLcode Task graph programming model with interface à la TBB/Cilk (C 17) High performancehttp://www.khronos.org/opencl/sycl Modern C , heterogeneous computing & OpenCL SYCLIWOCL 201527 / 43

OpenCL SYCL 1.2IComplete example of matrix addition in OpenCL SYCL# i n c l u d e CL / s y c l . hpp # i n c l u d e iostream b u f f e r f l o a t , 2 B { b , range 2 { N, M } } ;b u f f e r f l o a t , 2 C { c , range 2 { N, M } } ;/ / Enqueue some computation k e r n e l t a s kmyQueue . submit ( [ & ] ( h a n d l e r& cgh ) {/ / D e f i n e t h e data used / producedauto ka A . get access access : : read ( cgh ) ;auto kb B . get access access : : read ( cgh ) ;auto kc C . get access access : : w r i t e ( cgh ) ;/ / Create & c a l l OpenCL k e r n e l named " mat add "cgh . p a r a l l e l f o r c l a s s mat add ( range 2 { N, M } ,[ ] ( i d 2 i ) { kc [ i ] ka [ i ] kb [ i ] ; });} ) ; / / End o f our commands f o r t h i s queue} / / End scope , so w a i t f o r t h e queue t o complete ./ / Copy back t h e b u f f e r data w i t h RAII b e h a v i o u r .return 0;u s i n g namespace c l : : s y c l ;constexpr s i z e t N 2;constexpr s i z e t M 3;u s i n g M a t r i x f l o a t [N ] [M] ;i n t main ( ) {Matrix a { { 1 , 2 , 3 } , { 4 , 5 , 6 } } ;Matrix b { { 2 , 3 , 4 } , { 5 , 6 , 7 } } ;Matrix c ;{ / / Create a queue t o work onqueue myQueue ;/ / Wrap some b u f f e r s around our datab u f f e r f l o a t , 2 A { a , range 2 { N, M } } ; Modern C , heterogeneous computing & OpenCL SYCL}IWOCL 201528 / 43

OpenCL SYCL 1.2IAsynchronous task graph model Theoretical graph of an application described implicitly with kernel tasks using buffersthrough accessorscl::sycl::accessor write init acl::sycl::accessor read cl::sycl::accessor write cl::sycl::buffer ainit bcl::sycl::buffer bcl::sycl::buffer cmatrix addcl::sycl::accessor write Displaycl::sycl::accessor read cl::sycl::accessor read Possible schedule by SYCL runtime:init b init a matrix addDisplay; Automatic overlap of kernels & communicationsI Even better when looping around in an applicationI Assume it will be translated into pure OpenCL event graphI Runtime uses as many threads & OpenCL queues as necessary (AMD synchronousqueues, AMD compute rings, AMD DMA rings.) Modern C , heterogeneous computing & OpenCL SYCLIWOCL 201529 / 43

OpenCL SYCL 1.2ITask graph programming — the code# i n c l u d e CL / s y c l . hpp # i n c l u d e iostream u s i n g namespace c l : : s y c l ;/ / Size o f t h e m a t r i c e sc o n s t s i z e t N 2000;c o n s t s i z e t M 3000;i n t main ( ) {{ / / By s t i c k i n g a l l t h e SYCL work i n a { } block , we ensure/ / a l l SYCL t a s k s must complete b e f o r e e x i t i n g t h e b l o c k[ ] ( auto i n d e x ) {B [ i n d e x ] i n d e x [0] 2014 i n d e x [ 1 ] 4 2 ;});});/ / Launch an asynchronous k e r n e l t o compute m a t r i x a d d i t i o n c a bmyQueue . submit ( [ & ] ( auto &cgh ) {/ / I n t h e k e r n e l a and b are read , b u t c i s w r i t t e nauto A a . get access access : : read ( cgh ) ;auto B b . get access access : : read ( cgh ) ;auto C c . get access access : : w r i t e ( cgh ) ;/ / From these accessors , t h e SYCL r u n t i m e w i l l ensure t h a t when/ / t h i s k e r n e l i s run , t h e k e r n e l s computing a and b completed/ / Create a queue t o work onqueue myQueue ;/ / Create some 2D b u f f e r s o f f l o a t f o r our m a t r i c e sb u f f e r double , 2 a ( { N, M } ) ;b u f f e r double , 2 b ( { N, M } ) ;b u f f e r double , 2 c ( { N, M } ) ;/ / Launch a f i r s t asynchronous k e r n e l t o i n i t i a l i z e amyQueue . submit ( [ & ] ( auto &cgh ) {/ / The k e r n e l w r i t e a , so g e t a w r i t e accessor on i tauto A a . get access access : : w r i t e ( cgh ) ;/ / Enqueue a p a r a l l e l k e r n e l on a N M 2D i t e r a t i o n spacecgh . p a r a l l e l f o r c l a s s matrix add ( { N, M } ,[ ] ( auto i n d e x ) {C[ index ] A[ index ] B[ index ] ;});});/ Request an access t o read c from t h e host s i d e . The SYCL r u n t i m eensures t h a t c i s ready when t h e accessor i s r e t u r n e d /auto C c . get access access : : read , access : : h o s t b u f f e r ( ) ;s t d : : c o u t s t d : : e n d l " R e s u l t : " s t d : : e n d l ;f o r ( s i z e t i 0 ; i N ; i )f o r ( s i z e t j 0 ; j M; j )/ / Compare t h e r e s u l t t o t h e a n a l y t i c v a l u ei f (C [ i ] [ j ] ! i (2 2014) j (1 4 2 ) ) {s t d : : c o u t " Wrong v a l u e " C [ i ] [ j ] " on element " i ’ ’ j s t d : : e n d l ;e x i t ( 1);}} / End scope o f myQueue , t h i s w a i t f o r any r e m a i n i n g o p e r a t i o n s on t h equeue t o complete /s t d : : c o u t " Good computation ! " s t d : : e n d l ;return 0;/ / Enqueue p a r a l l e l k e r n e l on a N M 2D i t e r a t i o n spacecgh . p a r a l l e l f o r c l a s s i n i t a ( { N, M } ,[ ] ( auto i n d e x ) {A [ i n d e x ] i n d e x [0 ] 2 i n d e x [ 1 ] ;});});/ / Launch an asynchronous k e r n e l t o i n i t i a l i z e bmyQueue . submit ( [ & ] ( auto &cgh ) {/ / The k e r n e l w r i t e b , so g e t a w r i t e accessor on i tauto B b . get access access : : w r i t e ( cgh ) ;/ From t h e access p a t t e r n above , t h e SYCL r u n t i m e d e t e c tt h i s command group i s independant from t h e f i r s t oneand can be scheduled i n d e p e n d e n t l y // / Enqueue a p a r a l l e l k e r n e l on a N M 2D i t e r a t i o n spacecgh . p a r a l l e l f o r c l a s s i n i t b ( { N, M } , Modern C , heterogeneous computing & OpenCL SYCL}IWOCL 201530 / 43

OpenCL SYCL 1.2IFrom work-groups & work-items to hierarchical parallelismconst i n t size 10;i n t data [ s i z e ] ;const i n t gsize 2;b u f f e r i n t my buffer { data , s i z e } ;Very close to OpenMP 4 style! ,my queue . submit ( [ & ] ( auto &cgh ) {auto i n my buffer . get access access : : read ( cgh ) ;auto o u t my buffer . get access access : : w r i t e ( cgh ) ;/ / I t e r a t e on t h e work groupcgh . p a r a l l e l f o r w o r k g r o u p c l a s s h i e r a r c h i c a l ( { s i z e ,gsize } ,[ ] ( group grp ) {/ / Code executed o n l y once per work groups t d : : c e r r " Gid " grp [ 0 ] s t d : : e n d l ;/ / I t e r a t e on t h e work i t e m s o f a work groupcgh . p a r a l l e l f o r w o r k i t e m ( grp , [ ] ( item 1 t i l e ) {s t d : : c e r r " i d " t i l e . g e t l o c a l ( ) [ 0 ] " " t i l e . g e t g l o b a l ( ) [ 0 ] s t d : : e n d l ;out [ t i l e ] i n [ t i l e ] 2;});/ / Can have o t h e r cgh . p a r a l l e l f o r w o r k i t e m ( ) here . . .});}); Modern C , heterogeneous computing & OpenCL SYCL Easy to understand the concept ofwork-groups Easy to write work-group only code Replace code barriers withseveral parallel for workitem ()I Performance-portable betweenCPU and GPUI No need to think about barriers(automatically deduced)I Easier to compose components &algorithmsI Ready for future GPU with nonuniform work-group sizeIWOCL 201531 / 43

OpenCL SYCL 1.2IC 11 allocators C 11 allocators to control the way objects are allocated in memoryI For example to allocate some vectors on some storageI Concept of scoped allocator to control storage of nested data structuresI Example: vector of strings, with vector data and string data allocated in different memoryareas (speed, power consumption, caching, read-only.) SYCL reuses allocator to specify how buffer and image are allocated on the hostside Modern C , heterogeneous computing & OpenCL SYCLIWOCL 201532 / 43

OpenCL SYCL 1.2IC . putting everything altogetherOutline1C 142C dialects for OpenCL (and heterogeneous computing)3OpenCL SYCL 1.2C . putting everything altogether4OpenCL SYCL 2.1.5Conclusion Modern C , heterogeneous computing & OpenCL SYCLIWOCL 201533 / 43

OpenCL SYCL 1.2IC . putting everything altogetherExascale-ready Use your own C compilerI Only kernel outlining needs SYCL compiler SYCL with C can address most of the hierarchy levelsI MPII OpenMPI C -based PGAS (Partitioned Global Address Space) DSeL (Domain-Specific embeddedLanguage, such as Coarray C .)I Remote accelerators in clustersI Use SYCL buffer allocator for RDMAOut-of-core, mapping to a filePiM (Processor in Memory). Modern C , heterogeneous computing & OpenCL SYCLIWOCL 201534 / 43

OpenCL SYCL 1.2IC . putting everything altogetherDebugging Difficult to debug code or detect precondition violation on GPU and at large. Rely on C to help debuggingI Overload some operations and functions to verify preconditionsI Hide tracing/verification code in constructors/destructorsI Can use pure-C host implementation for bug-tracking with favorite debugger Modern C , heterogeneous computing & OpenCL SYCLIWOCL 201535 / 43

OpenCL SYCL 1.2IC . putting everything altogetherPoor-man SVM with C 11 SYCL For complex data structuresI Objects need to be in buffers to be shipped between CPU and devicesI Do not want marshaling/unmarshaling objects.I Use C 11 allocator to allocate some objects in 1 SYCL buffer Useful to send efficiently data through MPI and RDMA too!I But since no SVM, not same address on CPU and GPU side. How to deal with pointers? / Override all pointer accessed (for example use std::pointer trait) to do address translationon kernel side ,Cost: 1 addition per *p When no or inefficient SVM.I Also useful optimization when need to work on a copy only on the GPU Only allocation on GPU side Spare some TLB trashing on the CPU Modern C , heterogeneous computing & OpenCL SYCLIWOCL 201536 / 43

OpenCL SYCL 1.2IC . putting everything altogether¿¿¿Fortran? Fortran 2003 introduces C-interoperability that can be used for C interoperability.SYCL C boost:: multi array & others provides à la Fortran arraysI Allows triplet notationI Can be used from inside SYCL to deal with Fortran-like arrays Perhaps the right time to switch your application to modern C ? , Modern C , heterogeneous computing & OpenCL SYCLIWOCL 201537 / 43

OpenCL SYCL 1.2IC . putting everything altogetherUsing SYCL-like models in other areas SYCL generic heterogeneous computing model beyond OpenCLIIIIIqueue expresses where computations happenparallel for launches computationsaccessor defines the way we access databuffer for storing dataallocator for defining how data are allocated/backed Example for HSA: almost direct mapping à la OpenCL Example in PiM worldI Use queue to run on some PiM chipsI Use allocator to distribute data structures or to allocate buffer in special memory(memory page, chip.)I Use accessor to use alternative data access (split address from computation, streamingonly, PGAS.)I Use pointer trait to use specific way to interact with memoryI . Modern C , heterogeneous computing & OpenCL SYCLIWOCL 201538 / 43

OpenCL SYCL 2.1.IOutline1C 142C dialects for OpenCL (and heterogeneous computing)3OpenCL SYCL 1.2C . putting everything altogether4OpenCL SYCL 2.1.5Conclusion Modern C , heterogeneous computing & OpenCL SYCLIWOCL 201539 / 43

OpenCL SYCL 2.1.ISYCL 2.1 is coming! Skip directly to OpenCL 2.1 and C 14 Kernel side enqueue Shared memory between host and accelerator Parallel STL C 17 Array TS Modern C , heterogeneous computing & OpenCL SYCLIWOCL 201540 / 43

OpenCL SYCL 2.1.ISYCL and fine-grain system shared memory (OpenCL 2)# i n c l u d e CL / s y c l . hpp # i n c l u d e iostream # include vector u s i n g namespace c l : : s y c l ;i n t main ( ) {std : : vector a { 1 , 2 , 3 } ;std : : vector b { 5 , 6 , 8 } ;std : : vector c ( a . size ( ) ) ;/ / Enqueue a p a r a l l e l k e r n e lp a r a l l e l f o r ( a . size ( ) , [ & ] ( i n t index ) {c [ index ] a [ index ] b [ index ] ;});/ / Since t h e r e i s no queue o r no accessor , we assume p a r a l l e l f o r are b l o c k i n g k e r n e l ss t d : : c o u t s t d : : e n d l " R e s u l t : " s t d : : e n d l ;f o r ( auto e : c )s t d : : c o u t e " " ;s t d : : c o u t s t d : : e n d l ;return 0;} Very close to OpenMP simplicity Can still use of buffers & accessors for compatibility & finer control (task graph,optimizations.)I SYCL can remove the copy when possible Modern C , heterogeneous computing & OpenCL SYCLIWOCL 201541 / 43

ConclusionIOutline1C 142C dialects for OpenCL (and heterogeneous computing)3OpenCL SYCL 1.2C . putting everything altogether4OpenCL SYCL 2.1.5Conclusion Modern C , heterogeneous computing & OpenCL SYCLIWOCL 201542 / 43

ConclusionIConclusion Many C frameworks to leverage OpenCLI None of them provides seamless single source Require some kind of macros & weird syntaxI But they should be preferred to plain OpenCL C for productivity SYCL provides seamless single source with OpenCL

IWOCL 2015 SYCL Tutorial C 14 I Outline 1 C 14 2 C dialects for OpenCL (and heterogeneous computing) 3 OpenCL SYCL 1.2 C . putting everything altogether 4 OpenCL . IParallelism TS (Technical Specification) with Parallel STL IConcurrency TS (threads, mutex.) IArray TS (multidimensional arrays à la Fortran) ITransactional Memory TS .