Habanero-Java: Multicore Programming For The Masses

Transcription

Habanero-Java: Multicore Programmingfor the Masses!PPoPP 2014 TutorialShams Imam, Vivek SarkarDepartment of Computer Science, Rice University{shams,vsarkar}@rice.edu!15 February 2014

Motivation: With Multicore Processors and Cloud Computing,all Computers are Parallel Computers!!2PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Habanero-Java library Pure Java library without any other dependencies!!!!!! !3Uses Java 8 Lambda Expressions Required for terse syntaxEarly access release downloads of the Java 8 availableTarget release of Java 8 is March 2014PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Context: Rice Habanero Extreme Scale Software Project§ Parallel ApplicationsPortable execution model1) Lightweight asynchronous tasks and datatransfers§ Creation: async tasks, future tasks, data-driventasks§ Termination: finish, future get, await§ Data Transfers: asyncPut, asyncGet, asyncISend,asyncIRecv2) Locality control for task and data distribution§ Task Distributions: hierarchical places§ Data Distributions: hierarchical places, globalname space3) Inter-task synchronization operations§ Mutual exclusion: isolated, actors§ Collective and point-to-point sHabanero StaticCompiler & ParallelIntermediateRepresentationTwo-level programming modelDeclarative Coordination Languagefor Domain Experts:CnC-HC, CnC-Java, CnC-Python,CnC-Matlab, Task-Parallel Languages forParallelism-aware Developers:Habanero-C, Habanero-Java,Habanero-ScalaHabanero RuntimeSystem§ ExtremeScale Platforms!4http://habanero.rice.edu

Outline !5Part 1 (Parallelism) Task and Loop-level Parallelism (Async, Finish, Forall)FuturesData-driven TasksData Races and DeterminismBREAKPart 2 (Concurrency) Global and Object-based IsolationActorsHJ-Lib ImplementationPPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Outline !6Part 1 (Parallelism) Task and Loop-level Parallelism (Async, Finish, Forall)FuturesData-driven TasksData Races and DeterminismBREAKPart 2 (Concurrency) Global and Object-based IsolationActorsHJ-Lib ImplementationPPoPP 2014 Tutorial (S. Imam, V. Sarkar)

What is Parallel Programming? Specification of operationsthat can be executed inparallelTask ATask B Aparallel program isdecomposed into sequentialsubcomputations called tasks Parallel programmingconstructs define taskcreation, termination, andinteractionCore 0Core 1L1 cacheL1 cacheBUSL2 CacheSchematic of a dual-coreProcessor!7PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Example of a Sequential Program:Computing the sum of array elementsint sum 0;for (int i 0 ; i X.length ; i )Computation Graphsum X[i];!Observations: The decision to sum up theelements from left to right wasarbitrary The computation graph showsthat all operations must beexecuted sequentially!8PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Parallelization Strategy for two cores(Two-way Parallel Array Sum)Task 0: Compute sum oflower half of arrayTask 1: Compute sum ofupper half of array "Compute total sumBasic idea: !9Decompose problem into two tasks for partial sumsCombine results to obtain final answerParallel divide-and-conquer patternPPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Async and Finish Constructsfor Task Creation and Terminationasync S"finish S " § Execute S, but wait until all asyncs inS’s scope have terminated.Creates a new child task that executesstatement S// T0(Parent task)STMT0;finish {//Begin finishasync {STMT1; //T1(Child task)}STMT2;//Continue in T0//Wait for T1}//End finishSTMT3;//Continue in T0T1T0STMT0forkSTMT1STMT2joinSTMT3Acknowledgments: X10 and Habanero projects!10PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Two-way Parallel Array Sumusing HJ-Lib’s finish & async API’s1.// Start of Task T0 (main program)2.sum1 0; sum2 0; // sum1 & sum2 are static fields3.finish(() - {4.async(() - {5.// Child task computes sum of lower half of array6.for(int i 0; i X.length/2; i ) sum1 X[i];7.});8.// Parent task computes sum of upper half of array9.for(int i X.length/2; i X.length; i ) sum2 X[i];10. });11. // Parent task waits for child task to complete (join)12. return sum1 sum2;!11PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Java 8 Lambda Expressions Behave like anonymous classes lambda expressions can capture local variablesRelies on Functional Interfaces One abstract method.Can omit the name of that method when you implement it.Example:!1. public interface Runnable {void run();}!2. void invoke(Runnable r) {!3.r.run();!4. }!5. invoke(() - {!6.System.out.println(“Inside Runnable.run!”);!7. });!12PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Computation Graphs !13A Computation Graph (CG) captures the dynamicexecution of a parallel program, for a specific inputCG nodes are “steps” in the program’s execution— A step is a sequential subcomputation without any async,begin-finish and end-finish operationsCG edges represent ordering constraints— “Continue” edges define sequencing of steps within a task— “Spawn” edges connect parent tasks to child async tasks— “Join” edges connect the end of each async task to its IEF’send-finish operationsAll computation graphs must be acyclic—It is not possible for a node to depend on itselfComputation graphs are examples of “directed acyclicgraphs” (dags)PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Which statements can potentially beexecuted in parallel with each other?1.Computation Graphfinish { // F12.async A;3.finish { // F24.async B1;5.async B2;6.} // F27.B3;8.} // F1spawnF1-startAF2-startjoinF2-endB1B2!14PPoPP 2014 Tutorial (S. Imam, V. Sarkar)B3F1-end

Complexity Measures for Computation GraphsDefine TIME(N) execution time of node N WORK(G) sum of TIME(N), for all nodes N in CG G—WORK(G) is the total work to be performed in G CPL(G) length of a longest path in CG G, when addingup execution times of all nodes in the path—Such paths are called critical paths—CPL(G) is the length of these paths (critical pathlength)—CPL(G) is also the smallest possible execution timefor the computation graph!15PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

What is the critical path length of thisparallel computation?1.Step B1finish { // F12.async A; // Boil pasta3.finish { // F24.async B1; // Chop veggies5.async B2; // Brown meat6.} // F27.B3; // Make pasta sauce8.} // F1Step B2Step B3Step A!16PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Ideal Parallelism Define ideal parallelism of1Computation G Graph as theratio, WORK(G)/CPL(G)1 Ideal Parallelism is independentof the number of processors thatthe program executes on, andonly depends on the computationgraph1411411111Example:WORK(G) 261CPL(G) 11Ideal Parallelism WORK(G)/CPL(G) 26/11 2.36!17PPoPP 2014 Tutorial (S. Imam, V. Sarkar)11311

HJ-Lib Abstract Performance Metrics Basic Idea—Count operations of interest, as in big-O analysis—Abstraction ignores overheads that occur on real systemsCalls to doWork()—Programmer inserts calls of the form, doWork(N), within a stepto indicate abstraction execution of N application-specificabstract operations– e.g., adds, compares, stencil ops, data structure ops—Multiple calls add to the execution time of the stepEnabling abstract metrics s.propertyKey(), "true");! !18If an HJ program is executed with this option, abstract metrics areprinted at end of program execution with WORK(G), CPL(G), IdealSpeedup WORK(G)/ CPL(G)PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

seq clause for async statements"(pseudocode)1. // Async task2. async seq(size thresholdSize) computeSum(X, lo,mid);3.4. // Future example5. final future int sum1 future seq(size thresholdSize)6.{ return computeSum(X, lo,mid); };!! “seq” clause specifies condition under which async should be!19PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Parallel Quicksort using asyncSeq1.protected static void quicksort(2.final Comparable[] A, final int M, final int N) {3.if (M N) {4.// A point in HJ is an integer tuple5.HjPoint p partition(A, M, N);6.int I p.get(0);7.int J p.get(1);8.asyncSeq(I - M 100, () - quicksort(A, M, I));9.asyncSeq(N - J 100, () - quicksort(A, J, N));10.}11.}!20PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Sequential Algorithm forMatrix Multiplicationc[i,j] a[i,k] * b[k,j]0 k n1.// Sequential version!2.for (int i 0 ; i n ; i )!3.4.5.6.7.8.9.for (int j 0 ; j n ; j )!c[i][j] 0;!for (int i 0 ; i n ; i )!for (int j 0 ; j n ; j )!for (int k 0 ; k n ; k )!c[i][j] a[i][k] * b[k][j];!// Print first element of output matrix!10. System.out.println(c[0][0]);!21PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Parallelizing the loops in MatrixMultiplication example using finish & asyncc[i,j] a[i,k] * b[k,j]0 k n1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.!22// Parallel version using finish & async!finish(() - {!for (int i 0 ; i n ; i )!for (int j 0 ; j n ; j )!async(() - {c[i][j] 0; });!});!finish(() - { !for (int i 0 ; i n ; i )!for (int j 0 ; j n ; j )!async(() - {!for (int k 0 ; k n ; k )!c[i][j] a[i][k] * b[k][j];!});!});!// Print first element of output matrix!PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Observations on finish-for-async version !23finish and async are general constructs, and are notspecific to loops—Not easy to discern from a quick glance which loopsare sequential vs. parallelLoops in sequential version of matrix multiplication are“perfectly nested”—e.g., no intervening statement between “for(i .)”and “for(j .)”The ordering of loops nested between finish and asyncis arbitrary—They are parallel loops and their iterations can beexecuted in any orderPPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Parallelizing the loops in MatrixMultiplication example using forallc[i,j] a[i,k] * b[k,j]0 k n1.2.3.4.5.6.7.8.9.10.11.!24// Parallel version using finish & forall!forall(0, n-1, 0, n-1, (i, j) - {!c[i][j] 0;!});!forall(0, n-1, 0, n-1, (i, j) - {!forseq(0, n-1, (k) - {!c[i][j] a[i][k] * b[k][j];!});!}); !// Print first element of output matrix!System.out.println(c[0][0]);PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

forall API’s in HJlib(http://www.cs.rice.edu/ vs3/hjlib/doc/edu/rice/hj/Module1.html) static voidforall(edu.rice.hj.api.HjRegion.HjRegion1D hjRegion,edu.rice.hj.api.HjProcedureInt1D body) static voidforall(edu.rice.hj.api.HjRegion.HjRegion2D hjRegion,edu.rice.hj.api.HjProcedureInt2D body) static voidforall(edu.rice.hj.api.HjRegion.HjRegion3D hjRegion,edu.rice.hj.api.HjProcedureInt3D body) static void forall(int s0, int e0,edu.rice.hj.api.HjProcedure java.lang.Integer body) static void forall(int s0, int e0, int s1, int e1,edu.rice.hj.api.HjProcedureInt2D body) static T void forall(java.lang.Iterable T iterable,edu.rice.hj.api.HjProcedure T body) NOTE: all forall API’s include an implicit finish. forasync is likeforall, but without the finish!25PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Observations on forall version The combination of perfectly nested for–for–async constructs isreplaced by a single API, forall Multiple loops can be collapsed into a single forall with a multidimensional iteration space (can be 1D, 2D, 3D, .) The iteration variable for a forall is a HjPoint (integer tuple), e.g.,(i,j) The loop bounds can be specified as a rectangular HjRegion(product of dimension ranges), e.g., (0:n 1) x (0:n 1) HJlib also provides a sequential for API that can also be used toiterate sequentially over a rectangular region—Simplifies conversion between for and forall!26PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

forall examples: updates to atwo-dimensional Java array// Case 1: loops i,j can run in parallel !forall(0, m-1, 0, n-1, (i, j) - { A[i][j] F(A[i][j]);});!!// Case 2: only loop i can run in parallel !forall(1, m-1, (i) - { !forseq(1, n-1, (j) - { // Equivalent to “for (j 1;j n;j )”!A[i][j] F(A[i][j-1]) ;!}); });!// Case 3: only loop j can run in parallel !forseq(1, m-1, (i) - { // Equivalent to “for (i 1;i m;j )”!forall(1, n-1, (j) - {!A[i][j] F(A[i-1][j]) ; !}); });!27PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

What about overheads? We learned in Lecture 10 that it is inefficient to create async The “seq” clause doesn’t help in this case because it will just An alternate approach is “loop chunking”tasks that do little worksequentialize the entire forasync loop— e.g., replaceforall(0, 99, (i) - BODY(i)); // 100 tasks!— byforall(0, 3, (ii) - { // 4 tasks!// Each task executes a “chunk” of 25 iterations!forseq(25*ii, 25*(ii 1)-1], (i) - BODY(i));});!28PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

forallChunked APIs forallChunked(int s0, int e0, int chunkSize,edu.rice.hj.api.HjProcedure java.lang.Integer body) Like forall(int s0, int e0,edu.rice.hj.api.HjProcedure java.lang.Integer body)but forallChunked includes chunkSize as the third parameter— e.g., replaceforall(0, 99, (i) - BODY(i)); // 100 tasks— byforall(0, 99, 100/4, (i) - BODY(i));!29PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Outline !30Part 1 (Parallelism) Task and Loop-level Parallelism (Async, Finish, Forall)FuturesData-driven TasksData Races and DeterminismBREAKPart 2 (Concurrency) Global and Object-based IsolationActorsHJ-Lib ImplementationPPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Extending Async Tasks withReturn Values Example Scenario in PseudoCode1.2.3.4.5.6.// Parent task creates child async task!final future container !async { return computeSum(X, low, mid); };!. . .!// Later, parent examines the return value!int sum container.get();!Two issues to be addressed:1) Distinction between container and value in container (box)2) Synchronization to avoid race condition in container accessesParent TaskChild Taskcontainer async {.}!. . .!container.get()!container!31computeSum(.)!return .return valuePPoPP 2014 Tutorial (S. Imam, V. Sarkar)

HJ Futures: Tasks with Return Valuesasync { Stmt-Block }" Creates a new child task thatexecutes Stmt-Block, whichmust terminate with a returnstatement and return value Async expression returns areference to a container oftype future!32Expr.get()" Evaluates Expr, and blocks ifExpr’s value is unavailable Unlike finish which waits forall tasks in the finish scope, aget() operation only waits forthe specified asyncexpressionPPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Example: Two-way Parallel Array Sumusing Future Tasks in HJ-Lib1.2.3.4.5.6.7.8.9.10.11.12.13.14.!33// Parent Task T1 (main program)!// Compute sum1 (lower half) and sum2 (upper half) in parallel!final HjFuture sum1 future (() - { // Future Task T2!int sum 0; !for(int i 0 ; i X.length/2 ; i ) sum X[i];!return sum;!}); !final HjFuture sum2 future (() - { // Future Task T3!int sum 0; !for(int i X.length/2 ; i X.length ; i ) sum X[i];!return sum;!}); !//Task T1 waits for Tasks T2 and T3 to complete!int total sum1.get() sum2.get();PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Future Task Declarations and Uses Variable of type future is a reference to a future object—Container for return value from future task—The reference to the container is also known as a “handle” Two operations that can be performed on variable V of typefuture:— Assignment: V1 can be assigned value of type future— Blocking read: V1.get() waits until the future task referred toby V1 has completed, and then propagates the return value!34PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Comparison of Future Task and RegularAsync Versions of Two-Way Array Sum Futuretask version initializes two references tofuture objects, sum1 and sum2, and both aredeclared as final Nofinish construct needed in this example—Instead parent task waits for child tasks by performingsum1.get() and sum2.get() Easierto guarantee absence of race conditions inFuture Task version—No race on sum because it is a local variable in tasks T2 andT3—No race on future variables, sum1 and sum2, because ofblocking-read semantics!35PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Reduction Tree Schema for computing ArraySum in parallel (beyond two-way parallelism)?Question: How can we implement this schema using future tasks?!36PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Array Sum using Future Tasks(Seq version)Recursive divide-and-conquer pattern1.2.3.4.5.static int computeSum(int[] X, int lo, int hi) {!if ( lo hi ) return 0;!else if ( lo hi ) return X[lo];!else {!int mid (lo hi)/2;!final sum1 computeSum(X, lo, mid);!6.final sum2 computeSum(X, mid 1, hi);!7.// Parent now waits for the container values!8.return sum1 sum2;!9.}!10. } // computeSum!11. int sum computeSum(X, 0, X.length-1); // main program!37PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Array Sum using Future Tasks(two futures per method call)Recursive divide-and-conquer pattern1.2.3.4.5.static int computeSum(int[] X, int lo, int hi) {!if ( lo hi ) return 0;!else if ( lo hi ) return X[lo];!else {!int mid (lo hi)/2;!final HjFuture sum1 !6.future(() - { return computeSum(X, lo, mid); });!7.final HjFuture sum2 !8.future(() - { return computeSum(X, mid 1, hi); });!9.// Parent now waits for the container values!10.return sum1.get() sum2.get();!11.}!12. } // computeSum!13. int sum computeSum(X, 0, X.length-1); // main program!38PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Exercise: Why must Future References bedeclared as final?Consider the pseudocode on the right withfutures declared as non-final static fields.Is there a possible execution in which adeadlock situation may occur betweentasks T1 and T2 with this code (with eachtask waiting on the other due to get()operations)? Explain why or why not.Yes, a deadlock can occur when future f1does f2.get() and future f2 does f1.get().!WARNING: such “spin” loops are anexample of bad parallel programmingpractice in application code. Theirsemantics depends on the “memorymodel”. In the Java memory model,there’s no guarantee that the above spinloops will ever terminate.!391. static HjFuture f1 null; !2. static HjFuture f2 null;!3. !4. void main(String[] args) {!5.f1 future(() - {return a1();});!6.f2 future(() - {return a2();});!7. }!8. !9. int a1() { // Task T1!10. while (f2 null); // spin loop!11. return f2.get(); //T1 waits for T2!12. }!13. !14. int a2() { // Task T2!15. while (f1 null); // spin loop!16. return f1.get(); //T2 waits for T1!17. }deadlockPPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Why should Future References bedeclared as final (contd)?Now consider a modified version of theabove code in which futures are declaredas final local variables (which is permittedin HJ). Can you add get() operations tomethods a1() and a2() to create a deadlockbetween tasks T1 and T2 with this code?Explain why or why not.!No, the final declarations make itimpossible for future f1’s task (T1) toreceive a reference to f2.!Will your answer be different if f1 and f2are final fields in objects or final staticfields?!No.1.final future f1 !2.async {return a1();};!3.final future f2 !4.async {return a2(f1);};!5. }!6. !7.int a1() {8.// Task T1 cannot receive a !9.// reference to f2!10.!. . . !11. }!12. !13. int a2(HjFuture f1) { !14. // Task T2 can refer!15. // to f1 but that won’t cause!16. // a deadlock.!17. . f1.get() .!18. }!40PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Outline !41Part 1 (Parallelism) Task and Loop-level Parallelism (Async, Finish, Forall)FuturesData-driven TasksData Races and DeterminismBREAKPart 2 (Concurrency) Global and Object-based IsolationActorsHJ-Lib ImplementationPPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Dataflow Computing Original idea: replace machine instructions by a smallset of dataflow operatorsForkPrimitive Ops SwitchTMergeTTFFTF T !42TTTFPPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Example instruction sequence and itsdataflow graphbax a b;y b * 7;z (x-y) * (x y);An operator executes when all itsinput values are present; copies ofthe result value are distributed tothe destination operators.!43172yx345No separate control flowPPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Macro-Dataflow unication via singleassignment variable “Macro-dataflow” extension of dataflow model from instruction-levelto task-level operations General idea: build an arbitrary task graph, but restrict all inter-taskcommunications to single-assignment variables Static dataflow graph fixed when program execution starts Dynamic dataflow graph can grow dynamically Semantic guarantees: race-freedom, determinism Deadlocks are possible due to unavailable inputs (but they aredeterministic) 44PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Extending HJ Futures for Macro-Dataflow:Data-Driven Futures (DDFs) and Data-Driven Tasks (DDTs)HjDataDrivenFuture T1 ddfA newDataDrivenFuture();! Allocate an instance of a data-driven-future object (container)Object in container must be of type T1asyncAwait(ddfA, ddfB, , () - Stmt);! Create a new data-driven-task to start executing Stmt after all ofddfA, ddfB, become available (i.e., after task becomes “enabled”)ddfA.put(V) ;! Store object V (of type T1) in ddfA, thereby making ddfA availableSingle-assignment rule: at most one put is permitted on a given DDFddfA.get() !45Return value (of type T1) stored in ddfACan only be performed by async’s that contain ddfA in their awaitclause (hence no blocking is necessary for DDF gets)PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Implementing Future Tasks using DDFs Future version1.2.3.4.5.6.7. DDF version1.2.3.4.5.6.7.8.!46final HjFuture T f future(() - { return g(); }); !S1 !async(() - {!. f.get();!S2;!S3;!});!HjDataDrivenFuture T f newDataDrivenFuture(); !async(() - { f.put(g()) });!S1 !asyncAwait(f, () - { !. f.get();!S2;!S3;!});PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Use of DDFs with dummy objects(like future void )1. finish(() - {!2.HjDataDrivenFuture Void ddfA newDataDrivenFuture();!3.HjDataDrivenFuture Void ddfB newDataDrivenFuture();!4.HjDataDrivenFuture Void ddfC newDataDrivenFuture();!5.HjDataDrivenFuture Void ddfD newDataDrivenFuture();!6.HjDataDrivenFuture Void ddfE newDataDrivenFuture();!7.async(() - { . ; ddfA.put(null); }); // Task A!8.asyncAwait(ddfA, () - { . ;ddfB.put(null); }); // Task B!9.asyncAwait(ddfA, () - { . ;ddfC.put(null); }); // Task C!10.asyncAwait(ddfB, ddfC, ()- { . ; ddfD.put(null); }); // Task D!11.asyncAwait(ddfC, () - { . ;12.asyncAwait(ddfD, ddfE, () - { . }); // Task F!ddfE.put(null); }); // Task E!13. }); // finish!! This example uses an empty string as a dummy object!47PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Differences between Futures and DDFs/DDTs Consumer task blocks on get() for each future that it reads, whereasasync-await does not start execution till all DDFs are availableFuture tasks cannot deadlock, but it is possible for a DDT to blockindefinitely (“deadlock”) if one of its input DDFs never becomesavailableDDTs and DDFs are more general than futures—Producer task can only write to a single future object, where as aDDT can write to multiple DDF objects—The choice of which future object to write to is tied to a future taskat creation time, where as the choice of output DDF can be deferredto any point with a DDTDDTs and DDFs can be more implemented more efficiently thanfutures—An “asyncAwait” statement does not block the worker, unlike afuture.get()—You will never see the following message with “asyncAwait”– “ERROR: Maximum number of hj threads per place reached”!48PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Two Exception (error) cases for DDFsthat do not occur in futures Case 1: If two put’s are attempted on the same DDF, anexception is thrown because of the violation of thesingle-assignment rule—There can be at most one value provided for a futureobject (since it comes from the producer task’s returnstatement)! !49Case 2: If a get is attempted by a task on a DDF thatwas not in the task’s await list, then an exception isthrown because DDF’s do not support blocking gets—Futures support blocking getsPPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Deadlock example with DDTs1.2.3.4.5.6.7.8.HjDataDrivenFuture left newDataDrivenFuture();!HjDataDrivenFuture right newDataDrivenFuture();!finish(() - {!asyncAwait(left, () - { !right.put(rightWriter()); });!asyncAwait(right, () - { !left.put(leftWriter()); });!});!! HJ-Lib has deadlock detection mode! Enabled using:! .propertyKey(), “true");! Reports an edu.rice.hj.runtime.util.DeadlockException when deadlock detected!50PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Outline !51Part 1 (Parallelism) Task and Loop-level Parallelism (Async, Finish, Forall)FuturesData-driven TasksData Races and DeterminismBREAKPart 2 (Concurrency) Global and Object-based IsolationActorsHJ-Lib ImplementationPPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Parallel Programming Challenges Correctness—New classes of bugs can arise in parallel programming, relative to sequentialprogramming– Data races, deadlock, nondeterminism Performance—Performance of parallel program depends on underlying parallel system– Language compiler and runtime system– Processor structure and memory hierarchy– Degree of parallelism in program vs. hardware Portability—A buggy program that runs correctly on one system may not run correctly onanother (or even when re-executed on the same system)—A parallel program that performs well on one system may perform poorly onanother!52PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

What happens if we forget a finish?1.// Start of Task T0 (main program)2.sum1 0; sum2 0; // sum1 & sum2 are static fields3.async(() - { // Task T0 computes sum of lower half of array4.for(int i 0; i X.length/2; i )5.sum1 X[i];6.});7.async(() - { // Task T1 computes sum of upper half of array8.for(int i X.length/2; i X.length; i )9.sum2 X[i];10. });11. // Task T0 waits for Task T1 (join)12. return sum1 sum2;Data race between accesses of sum1 in async and in main program!53PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Formal Definition of Data RacesA data race occurs on location L in a program executionwith computation graph CG if there exist steps (nodes) S1and S2 in CG such that:1. S1 does not depend on S2 and S2 does not depend onS1 i.e., there is no path of dependence edges from S1 toS2 or from S2 to S1 in CG, and2. Both S1 and S2 read or write L, and at least one of theaccesses is a write. (L must be a shared location i.e., astatic field, instance field, or array element.) A program is data-race-free it cannot exhibit a data race forany input !54Above definition includes all “potential” data races i.e.,it’s considered a data race even if S1 and S2 execute on thesame processorPPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Four Observations related to Data Races1.Immutability property: there cannot be a data race on sharedimmutable data.— A location, L, is immutable if it is only written during initialization, and canonly be read after initialization. In this case, no read can potentially executein parallel with the write.Parallel programming tip: use immutable objects and arrays toavoid data races— Will require making copies of objects and arrays for updates— Copying overhead may be prohibitive in some cases, but acceptable inothers— NOTE: future values are also immutable Example with java.lang.String1. finish(() - {!2.String s1 "XYZ";!3.async(() - { String s2 s1.toLowerCase(); . });!4.System.out.println(s1);!5. });!55PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Example of a Mutable Object If an object is modified, all references to the objectsee the new valueStack FramesbtbStringBuilder sb new (“hi”);StringBuilder tb sb;tb.append (“gh”);Heap PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Observations2.Single-task ownership property: there cannot be a data race on alocation that is only read or written by a single task.—— Define: step S in computation graph CG “owns” location L if S performs aread or write access on L. If step S belongs to Task T, we can also say thatTask T owns L when executing S.Consider a location L that is only owned by steps that belong to the sametask, T. Since all steps in Task T must be connected by continue edges inCG, all reads and writes to L must be ordered by the dependences in CG.Therefore, no data race is possible on location L.Parallel programming tip: if an object or array needs to be writtenmultiple times after initialization, then try and restrict itsownership to a single task.— Will require making copies when sharing the object or array with othertasks.!57PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Example of Single-task ownershipwith Copying If an object or array needs to be written multiple times afterinitialization, then try and restrict its ownership to a single task.— Entails making copies when sharing the object with other tasks.— As with Immutability, copying overhead may be prohibitive in some cases, butacceptable in others. Example1.finish(() - { // Task T1 owns A!2. int[] A new int[n]; // . initialize array A .!3. // create a copy of array A in B!4. int[] B new int[A.length]; !5. System.arraycopy(A, 0, B, 0, A.length);!6. async(() - { // Task T2 owns B!7.int sum computeSum(B,0,B.length-1);// Modifies B as in ArraySum1!8.System.out.println("sum " sum);!9. });!10. // . update Array A .!11. System.out.println(Arrays.toString(A)); //printed by task T1!12.});!58PPoPP 2014 Tutorial (S. Imam, V. Sarkar)

Observations (contd)3.Ownership-transfer property: there cannot be a data race on alocation if all steps that read or write it are totally ordered in CG(i.e., if the steps belong to a single directed path)— Think of the ownership of L being transferred'' from one step to another,even across task boundaries, as execution follows the path of dependenceedges in the total or

!12 PPoPP 2014 Tutorial (S. Imam, V. Sarkar) Java 8 Lambda Expressions Behave like anonymous classes lambda expressions can capture local variables Relies on Functional Interfaces One abstract method. Can omit the name of that method when you implement it.