Programming On Parallel Machines - UC Davis

Transcription

Programming on Parallel MachinesNorm MatloffUniversity of California, DavisGPU, Multicore, Clusters and MoreSee Creative Commons license at http://heather.cs.ucdavis.edu/ matloff/probstatbook.htmlThis book is often revised and updated, latest edition available at http://heather.cs.ucdavis.edu/ matloff/158/PLN/ParProcBook.pdfCUDA and NVIDIA are registered trademarks.The author has striven to minimize the number of errors, but no guarantee is made as to accuracyof the contents of this book.

2Author’s Biographical SketchDr. Norm Matloff is a professor of computer science at the University of California at Davis, andwas formerly a professor of statistics at that university. He is a former database software developerin Silicon Valley, and has been a statistical consultant for firms such as the Kaiser PermanenteHealth Plan.Dr. Matloff was born in Los Angeles, and grew up in East Los Angeles and the San Gabriel Valley.He has a PhD in pure mathematics from UCLA, specializing in probability theory and statistics. Hehas published numerous papers in computer science and statistics, with current research interestsin parallel processing, statistical computing, and regression methodology.Prof. Matloff is a former appointed member of IFIP Working Group 11.3, an international committee concerned with database software security, established under UNESCO. He was a foundingmember of the UC Davis Department of Statistics, and participated in the formation of the UCDComputer Science Department as well. He is a recipient of the campuswide Distinguished TeachingAward and Distinguished Public Service Award at UC Davis.Dr. Matloff is the author of two published textbooks, and of a number of widely-used Web tutorialson computer topics, such as the Linux operating system and the Python programming language.He and Dr. Peter Salzman are authors of The Art of Debugging with GDB, DDD, and Eclipse.Prof. Matloff’s book on the R programming language, The Art of R Programming, was publishedin 2011. His book, Parallel Computation for Data Science, came out in 2015. His current bookproject, From Linear Models to Machine Learning: Predictive Insights through R, will be publishedin 2016. He is also the author of several open-source textbooks, including From Algorithms to ZScores: Probabilistic and Statistical Modeling in Computer Science (http://heather.cs.ucdavis.edu/probstatbook), and Programming on Parallel Machines (http://heather.cs.ucdavis.edu/ matloff/ParProcBook.pdf).

3About This BookWhy is this book different from all other parallel programming books? It is aimed more on thepractical end of things, in that: There is very little theoretical content, such as O() analysis, maximum theoretical speedup,PRAMs, directed acyclic graphs (DAGs) and so on. Real code is featured throughout. We use the main parallel platforms—OpenMP, CUDA and MPI—rather than languages thatat this stage are largely experimental or arcane. The running performance themes—communications latency, memory/network contention,load balancing and so on—are interleaved throughout the book, discussed in the contextof specific platforms or applications. Considerable attention is paid to techniques for debugging.The main programming language used is C/C , but some of the code is in R, which has becomethe pre-eminent language for data analysis. As a scripting language, R can be used for rapidprototyping. In our case here, it enables me to write examples which are much less less clutteredthan they would be in C/C , thus easier for students to discern the fundamental parallelixationprinciples involved. For the same reason, it makes it easier for students to write their own parallelcode, focusing on those principles. And R has a rich set of parallel libraries.It is assumed that the student is reasonably adept in programming, and has math backgroundthrough linear algebra. An appendix reviews the parts of the latter needed for this book. Anotherappendix presents an overview of various systems issues that arise, such as process scheduling andvirtual memory.It should be note that most of the code examples in the book are NOT optimized. The primaryemphasis is on simplicity and clarity of the techniques and languages used. However, there isplenty of discussion on factors that affect speed, such cache coherency issues, network delays, GPUmemory structures and so on.Here’s how to get the code files you’ll see in this book: The book is set in LaTeX, and the raw .texfiles are available in http://heather.cs.ucdavis.edu/ matloff/158/PLN. Simply download therelevant file (the file names should be clear), then use a text editor to trim to the program code ofinterest.In order to illustrate for students the fact that research and teaching (should) enhance each other,I occasionally will make brief references here to some of my research work.

4Like all my open source textbooks, this one is constantly evolving. I continue to add new topics,new examples and so on, and of course fix bugs and improve the exposition. For that reason, itis better to link to the latest version, which will always be at http://heather.cs.ucdavis.edu/ matloff/158/PLN/ParProcBook.pdf, rather than to copy it.For that reason, feedback is highly appreciated. I wish to thank Stuart Ambler, Matt Butner,Stuart Hansen, Bill Hsu, Sameer Khan, Mikel McDaniel, Richard Minner, Lars Seeman, MarcSosnick, and Johan Wikström for their comments. I’m also very grateful to Professor Hsu for hismaking available to me advanced GPU-equipped machines.You may also be interested in my open source textbook on probability and statistics, at http://heather.cs.ucdavis.edu/probstatbook.This work is licensed under a Creative Commons Attribution-No Derivative Works 3.0 UnitedStates License. Copyright is retained by N. Matloff in all non-U.S. jurisdictions, but permission touse these materials in teaching is still granted, provided the authorship and licensing informationhere is displayed in each unit. I would appreciate being notified if you use this book for teaching,just so that I know the materials are being put to use, but this is not required.

Contents1 Introduction to Parallel Processing1.11.21.3Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11.1.1Why R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1Why Use Parallel Systems? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11.2.1Execution Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11.2.2Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21.2.3Distributed Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31.2.4Our Focus Here . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3Parallel Processing Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31.3.1Shared-Memory Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31.3.1.1Basic Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . .31.3.1.2Multiprocessor Topologies . . . . . . . . . . . . . . . . . . . . . . . .41.3.1.3Memory Issues Etc. . . . . . . . . . . . . . . . . . . . . . . . . . . .4Message-Passing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . .51.3.2.1Basic Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . .51.3.2.2Example: Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . .5SIMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5Programmer World Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .61.4.161.3.21.3.31.41Example: Matrix-Vector Multiply . . . . . . . . . . . . . . . . . . . . . . . .i

iiCONTENTS1.4.21.51.6Shared-Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .61.4.2.1Programmer View . . . . . . . . . . . . . . . . . . . . . . . . . . . .6Shared Memory Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71.5.1Low-Level Threads Systems: Pthreads . . . . . . . . . . . . . . . . . . . . . .71.5.1.181.5.2Compiling and Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.5.3The Notion of a “Critical Section” . . . . . . . . . . . . . . . . . . . . . . . . 101.5.4Role of the OS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.5.5Debugging Threads Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . 13A Note on Global Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.6.11.6.21.6.3Higher-Level Threads Programming: OpenMP . . . . . . . . . . . . . . . . . 151.6.1.1Example: Sampling Bucket Sort . . . . . . . . . . . . . . . . . . . . 151.6.1.2Debugging OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Message Passing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.6.2.1Programmer View . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.6.2.2Example: MPI Prime Numbers Finder . . . . . . . . . . . . . . . . 19Scatter/Gather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221.6.3.11.7Pthreads Example: Finding Primes . . . . . . . . . . . . . . . . . .R snow Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Threads Programming in R: Rdsm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271.7.1Example: Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . 271.7.2Example: Maximal Burst in a Time Series . . . . . . . . . . . . . . . . . . . . 282 Recurring Performance Issues312.1Communication Bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.2Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.3“Embarrassingly Parallel” Applications2.3.1. . . . . . . . . . . . . . . . . . . . . . . . . 32What People Mean by “Embarrassingly Parallel” . . . . . . . . . . . . . . . . 32

CONTENTSiii2.3.2Iterative Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.4Static (But Possibly Random) Task Assignment Typically Better Than Dynamic . . 342.4.1Example: Matrix-Vector Multiply . . . . . . . . . . . . . . . . . . . . . . . . 342.4.2Load Balance, Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.4.3Example: Mutual Web Outlinks . . . . . . . . . . . . . . . . . . . . . . . . . 372.4.4Work Stealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.4.5Timing Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.4.6Example: Extracting a Submatrix . . . . . . . . . . . . . . . . . . . . . . . . 382.5Latency and Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392.6Relative Merits: Performance of Shared-Memory Vs. Message-Passing . . . . . . . . 402.7Memory Allocation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.8Issues Particular to Shared-Memory Systems . . . . . . . . . . . . . . . . . . . . . . 413 Shared Memory Parallelism433.1What Is Shared? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.2Memory Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.33.2.1Interleaving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.2.2Bank Conflicts and Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.2.3Example: Code to Implement Padding . . . . . . . . . . . . . . . . . . . . . . 47Interconnection Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.3.1SMP Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.3.2NUMA Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.3.3NUMA Interconnect Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . 503.3.3.1Crossbar Interconnects . . . . . . . . . . . . . . . . . . . . . . . . . 503.3.3.2Omega (or Delta) Interconnects . . . . . . . . . . . . . . . . . . . . 523.3.4Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.3.5Why Have Memory in Modules? . . . . . . . . . . . . . . . . . . . . . . . . . 54

ivCONTENTS3.4Synchronization Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.4.13.5Test-and-Set Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.4.1.1LOCK Prefix on Intel Processors . . . . . . . . . . . . . . . . . . . . 553.4.1.2Locks with More Complex Interconnects . . . . . . . . . . . . . . . 573.4.2May Not Need the Latest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.4.3Fetch-and-Add Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57Cache Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.5.1Cache Coherency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.5.2Example: the MESI Cache Coherency Protocol . . . . . . . . . . . . . . . . . 613.5.3The Problem of “False Sharing” . . . . . . . . . . . . . . . . . . . . . . . . . 633.6Memory-Access Consistency Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . 643.7Fetch-and-Add Combining within Interconnects . . . . . . . . . . . . . . . . . . . . . 663.8Multicore Chips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663.9Optimal Number of Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.10 Processor Affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.11 Illusion of Shared-Memory through Software . . . . . . . . . . . . . . . . . . . . . . . 683.11.1 Software Distributed Shared Memory . . . . . . . . . . . . . . . . . . . . . . 683.11.2 Case Study: JIAJIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703.12 Barrier Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 733.12.1 A Use-Once Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 743.12.2 An Attempt to Write a Reusable Version . . . . . . . . . . . . . . . . . . . . 753.12.3 A Correct Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753.12.4 Refinements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763.12.4.1 Use of Wait Operations . . . . . . . . . . . . . . . . . . . . . . . . . 763.12.4.2 Parallelizing the Barrier Operation . . . . . . . . . . . . . . . . . . . 783.12.4.2.1 Tree Barriers . . . . . . . . . . . . . . . . . . . . . . . . . . 783.12.4.2.2 Butterfly Barriers . . . . . . . . . . . . . . . . . . . . . . . 78

CONTENTSv4 Introduction to OpenMP814.1Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.2Example: Dijkstra Shortest-Path Algorithm . . . . . . . . . . . . . . . . . . . . . . . 814.34.2.1The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.2.2The OpenMP parallel Pragma . . . . . . . . . . . . . . . . . . . . . . . . . 844.2.3Scope Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 854.2.4The OpenMP single Pragma4.2.5The OpenMP barrier Pragma . . . . . . . . . . . . . . . . . . . . . . . . . . 864.2.6Implicit Barriers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 874.2.7The OpenMP critical Pragma . . . . . . . . . . . . . . . . . . . . . . . . . 87. . . . . . . . . . . . . . . . . . . . . . . . . . 86The OpenMP for Pragma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 874.3.1Example: Dijkstra with Parallel for Loops . . . . . . . . . . . . . . . . . . . . 874.3.2Nested Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904.3.3Controlling the Partitioning of Work to Threads: the schedule Clause . . . . 904.3.4Example: In-Place Matrix Transpose . . . . . . . . . . . . . . . . . . . . . . . 924.3.5The OpenMP reduction Clause . . . . . . . . . . . . . . . . . . . . . . . . . 934.4Example: Mandelbrot Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.5The Task Directive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 974.5.14.6Example: Quicksort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98Other OpenMP Synchronization Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 994.6.1The OpenMP atomic Clause . . . . . . . . . . . . . . . . . . . . . . . . . . . 994.6.2Memory Consistency and the flush Pragma . . . . . . . . . . . . . . . . . . 1004.7Combining Work-Sharing Constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . 1014.8The Rest of OpenMP4.9Compiling, Running and Debugging OpenMP Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102. . . . . . . . . . . . . . . . . . 1024.9.1Compiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1024.9.2Running . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

viCONTENTS4.9.3Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1034.10 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1044.10.1 The Effect of Problem Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1044.10.2 Some Fine Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1044.10.3 OpenMP Internals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1084.11 Example: Root Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1094.12 Example: Mutual Outlinks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1104.13 Example: Transforming an Adjacency Matrix . . . . . . . . . . . . . . . . . . . . . . 1124.14 Example: Finding the Maximal Burst in a Time Series . . . . . . . . . . . . . . . . . 1154.15 Locks with OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1174.16 Other Examples of OpenMP Code in This Book . . . . . . . . . . . . . . . . . . . . 1175 Introduction to GPU Programming with CUDA1195.1Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1195.2Some Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1205.3Example: Calculate Row Sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1205.4Understanding the Hardware Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 1245.4.1Processing Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1245.4.2Thread Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1255.4.35.4.2.1SIMT Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1255.4.2.2The Problem of Thread Divergence . . . . . . . . . . . . . . . . . . 1255.4.2.3“OS in Hardware” . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125Memory Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1265.4.3.1Shared and Global Memory . . . . . . . . . . . . . . . . . . . . . . . 1265.4.3.2Global-Memory Performance Issues . . . . . . . . . . . . . . . . . . 1305.4.3.3Shared-Memory Performance Issues . . . . . . . . . . . . . . . . . . 1305.4.3.4Host/Device Memory Transfer Performance Issues . . . . . . . . . . 131

CONTENTSvii5.4.3.5Other Types of Memory . . . . . . . . . . . . . . . . . . . . . . . . . 1315.4.4Threads Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1335.4.5What’s NOT There . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1355.5Synchronization, Within and Between Blocks . . . . . . . . . . . . . . . . . . . . . . 1355.6More on the Blocks/Threads Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . 1365.7Hardware Requirements, Installation, Compilation, Debugging . . . . . . . . . . . . 1375.8Example: Improving the Row Sums Program . . . . . . . . . . . . . . . . . . . . . . 1395.9Example: Finding the Mean Number of Mutual Outlinks . . . . . . . . . . . . . . . 1415.10 Example: Finding Prime Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1425.11 Example: Finding Cumulative Sums . . . . . . . . . . . . . . . . . . . . . . . . . . . 1455.12 When Is It Advantageous to Use Shared Memory . . . . . . . . . . . . . . . . . . . . 1475.13 Example: Transforming an Adjacency Matrix . . . . . . . . . . . . . . . . . . . . . . 1475.14 Error Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1505.15 Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1515.16 Short Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1525.17 Newer Generations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1525.17.1 True Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1525.17.2 Unified Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1525.18 CUDA from a Higher Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1535.18.1 CUBLAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1535.18.1.1 Example: Row Sums Once Again . . . . . . . . . . . . . . . . . . . 1535.18.2 Thrust . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1555.18.3 CUFFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1565.19 Other CUDA Examples in This Book. . . . . . . . . . . . . . . . . . . . . . . . . . 1566 Introduction to Thrust Programming1576.1Compiling Thrust Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

viiiCONTENTS6.1.1Compiling to CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1576.1.2Compiling to OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1586.2Example: Counting the Number of Unique Values in an Array6.3Example: A Plain-C Wrapper for Thrust sort() . . . . . . . . . . . . . . . . . . . . . 1626.4Example: Calculating Percentiles in an Array . . . . . . . . . . . . . . . . . . . . . . 1636.5Mixing Thrust and CUDA Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1656.6Example: Doubling Every kth Element of an Array . . . . . . . . . . . . . . . . . . . 1666.7Scatter and Gather Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1676.7.16.8Example: Matrix Transpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169Advanced (“Fancy”) Iterators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1706.8.16.9. . . . . . . . . . . . 158Example: Matrix Transpose Again . . . . . . . . . . . . . . . . . . . . . . . . 170A Timing Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1726.10 Example: Transforming an Adjacency Matrix . . . . . . . . . . . . . . . . . . . . . . 1766.11 Prefix Scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1786.12 More on Use of Thrust for a CUDA Backend . . . . . . . . . . . . . . . . . . . . . . 1796.12.1 Synchronicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1796.13 Error Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1796.14 Other Examples of Thrust Code in This Book . . . . . . . . . . . . . . . . . . . . . . 1807 Message Passing Systems1817.1Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1817.2A Historical Example: Hypercubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1827.2.17.37.4Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182Networks of Workstations (NOWs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1847.3.1The Network Is Literally the Weakest Link . . . . . . . . . . . . . . . . . . . 1847.3.2Other Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185Scatter/Gather Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

CONTENTSix8 Introduction to MPI8.1187Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1878.1.1History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1878.1.2Structure and Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1888.1.3Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1888.1.4Performance Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1888.2Review of Earlier Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1898.3Example: Dijkstra Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1898.3.1The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1898.3.2The MPI Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1908.3.3Introduction to MPI APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1938.3.3.1MPI Init() and MPI Finalize() . . . . . . . . . . . . . . . . . . . . . 1938.3.3.2MPI Comm size() and MPI Comm rank() . . . . . . . . . . . . . . 1938.3.3.3MPI Send() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1948.3.3.4MPI Recv(). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1958.4Example: Removing 0s from an Array . . . . . . . . . . . . . . . . . . . . . . . . . . 1968.5Debugging MPI Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1978.6Collective Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1988.6.1Example: Refined Dijkstra Code . . . . . . . . . . . . . . . . . . . . . . . . . 1988.6.2MPI Bcast() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2018.6.3MPI Reduce()/MPI Allreduce()8.6.4MPI Gather()/MPI Allgather() . . . . . . . . . . . . . . . . . . . . . . . . . . 2038.6.5The MPI Scatter() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2038.6.6Example: Count the Number of Edges in a Directed Graph . . . . . . . . . . 2048.6.7Example: Cumulative Sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2048.6.8Example: an MPI Solution to the Mutual Outlinks Problem . . . . . . . . . . 2068.6.9The MPI Barrier() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208. . . . . . . . . . . . . . . . . . . . . . . . . 202

xCONTENTS8.6.10 Creating Communicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2088.7Buffering, Synchrony and Related Issues . . . . . . . . . . . . . . . . . . . . . . . . . 2098.7.1Buffering, Etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2098.7.2Safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2108.7.3Living Dangerously . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2118.7.4Safe Exchange Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2118.8Use of MPI from Other Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2118.9Other MPI Examples in This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2129 MapReduce Computation9.1213Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2149.1.1Hadoop Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2149.1.2Example: Word Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2149.1.3Running the Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2159.1.4Analysis of the Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2169.1.5Role of Disk Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2179.2Other MapReduce Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2189.3R Interfaces to MapReduce Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 2189.4An Alternative: “Snowdoop” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2189.4.1Example: Snowdoop Word Count . . . . . . . . . . . . . . . . . . . . . . . . . 2199.4.2Example: Snowdoop k-Means Clustering . . . . . . . . . . . . . . . . . . . . . 22010 The Parallel Prefix Problem22310.1 Example: Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22310.2 General Strategies for Parallel Scan Computation . . . . . . . . . . . . . . . . . . . . 22410.3 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22710.4 Example: Parallel Prefix Summing in OpenMP . . . . . . . . . . . . . . . . . . . . . 227

CONTENTSxi10.5 Example: Run-Length Coding Decompression in OpenMP . . . . . . . . . . . . . . . 22810.6 Example: Run-Length Coding Decompression in Thrust . . . . . . . . . . . . . . . . 22910.7 Example: Moving Average . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23010.7.1 Rth Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23010.7.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23210.7.3 Use of Lambda Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23211 Introduction to Parallel Matrix Operations23511.1 “We’re Not in Physicsland Anymore, Toto” . . . . . . . . . . . . . . . . . . . . . . . 23511.2 Partitioned Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23511.3 Parallel Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23711.3.1 Message-Passing Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23711.3.1.1 Fox’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23811.3.1.2 Performance Issues . . . . . . . . . . . . . . .

About This Book Why is this book di erent from all other parallel programming books? It is aimed more on the practical end of things, in that: There is very little theoretical content, such as O() analysis, maximum theoretical speedup, PRAMs, directed acyclic graphs (DAGs) and so on. Real code is featured throughout.