Spin Locks And Contention - Brown University

Transcription

Spin Locks and ContentionCompanion slides forThe Art of Multiprocessor Programmingby Maurice Herlihy & Nir Shavit

Focus so far: Correctness andProgress Models– Accurate (we never lied to you)– But idealized (so we forgot to mention a few things) Protocols– Elegant– Important– But naïveArt of Multiprocessor Programming2

New Focus: Performance Models– More complicated (not the same as complex!)– Still focus on principles (not soon obsolete) Protocols– Elegant (in their fashion)– Important (why else would we pay attention)– And realistic (your mileage may vary)Art of Multiprocessor Programming3

Kinds of Architectures SISD (Uniprocessor)– Single instruction stream– Single data stream SIMD (Vector)– Single instruction– Multiple data MIMD (Multiprocessors)– Multiple instruction– Multiple data.Art of Multiprocessor Programming4

Kinds of Architectures SISD (Uniprocessor)– Single instruction stream– Single data stream SIMD (Vector)Our space– Single instruction– Multiple data MIMD (Multiprocessors)– Multiple instruction– Multiple data.(1)Art of Multiprocessor Programming5

MIMD ArchitecturesmemoryShared BusDistributed Memory Contention Communication Contention Communication LatencyArt of Multiprocessor Programming6

Today: Revisit Mutual Exclusion Performance, not just correctness Proper use of multiprocessorarchitectures A collection of locking algorithms Art of Multiprocessor Programming7 (1)

What Should you do if you can’tget a lock? Keep trying– “spin” or “busy-wait”– Good if delays are short Give up the processor– Good if delays are long– Always good on uniprocessorArt of Multiprocessor Programming8 (1)

What Should you do if you can’tget a lock? Keep trying– “spin” or “busy-wait”– Good if delays are short Give up the processor– Good if delays are long– Always good on uniprocessorour focusArt of Multiprocessor Programming9

Basic Spin-Lock.CSspinlockcriticalsectionArt of Multiprocessor ProgrammingResets lockupon exit10

Basic Spin-Lock lock introducessequential bottleneck.CSspinlockcriticalsectionArt of Multiprocessor ProgrammingResets lockupon exit11

Basic Spin-Lock lock suffers from contention.CSspinlockcriticalsectionArt of Multiprocessor ProgrammingResets lockupon exit12

Basic Spin-Lock lock suffers from contention.CSspinlockcriticalsectionResets lockupon exitNotice: these are distinctphenomenaArt of Multiprocessor Programming13

Basic Spin-Lock lock suffers from contention.CSspinlockcriticalsectionResets lockupon exitSeq Bottleneck no parallelismArt of Multiprocessor Programming14

Basic Spin-Lock lock suffers from contention.CSspinlockcriticalsectionResets lockupon exitContention ?Art of Multiprocessor Programming15

Review: Test-and-Set Boolean value Test-and-set (TAS)– Swap true with current value– Return value tells if prior value was true orfalse Can reset just by writing false TAS aka “getAndSet”Art of Multiprocessor Programming16

Review: Test-and-Setpublic class AtomicBoolean {boolean value;public synchronized booleangetAndSet(boolean newValue) {boolean prior value;value newValue;return prior;}}Art of Multiprocessor Programming17 (5)

Review: Test-and-Setpublic class AtomicBoolean {boolean value;}public synchronized booleangetAndSet(boolean newValue) {boolean prior value;value newValue;return prior;}Packagejava.util.concurrent.atomicArt of Multiprocessor Programming18

Review: Test-and-Setpublic class AtomicBoolean {boolean value;public synchronized booleangetAndSet(boolean newValue) {boolean prior value;value newValue;return prior;}}Swap old and newvaluesArt of Multiprocessor Programming19

Review: Test-and-SetAtomicBoolean lock new AtomicBoolean(false) boolean prior lock.getAndSet(true)Art of Multiprocessor Programming20

Review: Test-and-SetAtomicBoolean lock new AtomicBoolean(false) boolean prior lock.getAndSet(true)Swapping in true is called“test-and-set” or TASArt of Multiprocessor Programming21 (5)

Test-and-Set Locks Locking– Lock is free: value is false– Lock is taken: value is true Acquire lock by calling TAS– If result is false, you win– If result is true, you lose Release lock by writing falseArt of Multiprocessor Programming22

Test-and-set Lockclass TASlock {AtomicBoolean state new AtomicBoolean(false);void lock() {while (state.getAndSet(true)) {}}void unlock() {state.set(false);}}Art of Multiprocessor Programming23

Test-and-set Lockclass TASlock {AtomicBoolean state new AtomicBoolean(false);void lock() {while (state.getAndSet(true)) {}}void unlock() {state.set(false);Lock state is}}AtomicBooleanArt of Multiprocessor Programming24

Test-and-set Lockclass TASlock {AtomicBoolean state new AtomicBoolean(false);void lock() {while (state.getAndSet(true)) {}}void unlock() {state.set(false);Keep trying until}}lock acquiredArt of Multiprocessor Programming25

Test-and-set Lockclass TASlock {ReleaselockAtomicBooleanstate by resettingnew AtomicBoolean(false);state to falsevoid lock() {while (state.getAndSet(true)) {}}void unlock() {state.set(false);}}Art of Multiprocessor Programming26

Space Complexity TAS spin-lock has small “footprint”N thread spin-lock uses O(1) spaceAs opposed to O(n) Peterson/BakeryHow did we overcome the W(n) lowerbound? We used a RMW operation Art of Multiprocessor Programming27

Performance Experiment– n threads– Increment shared counter 1 million times How long should it take? How long does it take?Art of Multiprocessor Programming28

timeGraphno speedupbecause ofsequentialbottleneckidealthreadsArt of Multiprocessor Programming29

Mystery #1timeTAS lockIdealthreadsArt of Multiprocessor ProgrammingWhat isgoingon?30

Test-and-Test-and-Set Locks Lurking stage– Wait until lock “looks” free– Spin while read returns true (lock taken) Pouncing state– As soon as lock “looks” available– Read returns false (lock free)– Call TAS to acquire lock– If TAS loses, back to lurkingArt of Multiprocessor Programming31

Test-and-test-and-set Lockclass TTASlock {AtomicBoolean state new AtomicBoolean(false);void lock() {while (true) {while (state.get()) {}if (!state.getAndSet(true))return;}}Art of Multiprocessor Programming32

Test-and-test-and-set Lockclass TTASlock {AtomicBoolean state new AtomicBoolean(false);void lock() {while (true) {while (state.get()) {}if (!state.getAndSet(true))return;}}Wait until lock looks freeArt of Multiprocessor Programming33

Test-and-test-and-set Lockclass TTASlock {AtomicBoolean state new AtomicBoolean(false);Then try toacquire itvoid lock() {while (true) {while (state.get()) {}if (!state.getAndSet(true))return;}}Art of Multiprocessor Programming34

Mystery #2TAS locktimeTTAS lockIdealthreadsArt of Multiprocessor Programming35

Mystery Both– TAS and TTAS– Do the same thing (in our model) Except that– TTAS performs much better than TAS– Neither approaches idealArt of Multiprocessor Programming36

Opinion Our memory abstraction is broken TAS & TTAS methods– Are provably the same (in our model)– Except they aren’t (in field tests) Need a more detailed model Art of Multiprocessor Programming37

Bus-Based ArchitecturescachecachecacheBusmemoryArt of Multiprocessor Programming38

Bus-Based ArchitecturesRandom access memory(10s of cycles)cachecachecacheBusmemoryArt of Multiprocessor Programming39

Bus-Based ArchitecturesShared Bus Broadcast medium One broadcaster at a time Processors and memory all“snoop”cachecachecacheBusmemoryArt of Multiprocessor Programming40

Per-Processor CachesBus-Based Small Architectures Fast: 1 or 2 cycles Address & state informationcachecachecacheBusmemoryArt of Multiprocessor Programming41

Granularity Caches operate at a larger granularitythan a word Cache line: fixed-size block containingthe address (today 64 or 128 bytes)Art of Multiprocessor Programming42

Locality If you use an address now, you willprobably use it again soon– Fetch from cache, not memory If you use an address now, you willprobably use a nearby address soon– In the same cache lineArt of Multiprocessor Programming43

L1 and L2 CachesL2L1Art of Multiprocessor Programming44

L1 and L2 CachesL2L1Art of Multiprocessor ProgrammingSmall & fast1 or 2 cycles45

L1 and L2 CachesLarger and slower10s of cycles 128 byte lineL2L1Art of Multiprocessor Programming46

Jargon Watch Cache hit– “I found what I wanted in my cache”– Good Thing Art of Multiprocessor Programming47

Jargon Watch Cache hit– “I found what I wanted in my cache”– Good Thing Cache miss– “I had to shlep all the way to memory forthat data”– Bad Thing Art of Multiprocessor Programming48

Cave Canem This model is still a simplification– But not in any essential way– Illustrates basic principles Will discuss complexities laterArt of Multiprocessor Programming49

When a Cache BecomesFull Need to make room for new entry By evicting an existing entry Need a replacement policy– Usually some kind of least recently usedheuristicArt of Multiprocessor Programming50

Fully Associative Cache Any line can be anywhere in the cache– Advantage: can replace any line– Disadvantage: hard to find linesArt of Multiprocessor Programming51

Direct Mapped Cache Every address has exactly 1 slot– Advantage: easy to find a line– Disadvantage: must replace fixed lineArt of Multiprocessor Programming52

K-way Set Associative Cache Each slot holds k lines– Advantage: pretty easy to find a line– Advantage: some choice in replacing lineArt of Multiprocessor Programming53

Multicore Set Associativity k is 8 or even 16 and growing – Why? Because cores share sets– Threads cut effective size if accessingdifferent dataArt of Multiprocessor Programming54

Cache Coherence A and B both cache address x A writes to x– Updates cache How does B find out? Many cache coherence protocols inliteratureArt of Multiprocessor Programming55

MESI Modified– Have modified cached data, must writeback to memoryArt of Multiprocessor Programming56

MESI Modified– Have modified cached data, must writeback to memory Exclusive– Not modified, I have only copyArt of Multiprocessor Programming57

MESI Modified– Have modified cached data, must writeback to memory Exclusive– Not modified, I have only copy Shared– Not modified, may be cached elsewhereArt of Multiprocessor Programming58

MESI Modified– Have modified cached data, must write back tomemory Exclusive– Not modified, I have only copy Shared– Not modified, may be cached elsewhere Invalid– Cache contents not meaningfulArt of Multiprocessor Programming59

Processor Issues Load Requestload xcachecachecacheBusmemoryArt of Multiprocessor Programmingdata60

Memory RespondsEcachecachecacheBusBusGot it!memoryArt of Multiprocessor Programmingdata61

Processor Issues Load RequestLoad xEdatacachecacheBusmemoryArt of Multiprocessor Programmingdata62

Other Processor RespondsGot itSEdataScachecacheBusmemoryArt of Multiprocessor ProgrammingBusdata63

Modify Cached DataSdataSdatadatacacheBusmemoryArt of Multiprocessor Programmingdata64

Write-Through CacheWrite x!SdataSdatacacheBusmemoryArt of Multiprocessor Programmingdata65

Write-Through Caches Immediately broadcast changes Good– Memory, caches always agree– More read hits, maybe Bad– Bus traffic on all writes– Most writes to unshared data– For example, loop indexes Art of Multiprocessor Programming66

Write-Through Caches Immediately broadcast changes Good“show stoppers”– Memory, caches always agree– More read hits, maybe Bad– Bus traffic on all writes– Most writes to unshared data– For example, loop indexes Art of Multiprocessor Programming67

Write-Back Caches Accumulate changes in cache Write back when line evicted– Need the cache for something else– Another processor wants itArt of Multiprocessor Programming68

ryArt of Multiprocessor Programmingdata69

InvalidatecachedatacacheBusThis cache acquires write permissionmemoryArt of Multiprocessor Programmingdata70

InvalidateOther caches lose read permissioncachedatacacheBusThis cache acquires write permissionmemoryArt of Multiprocessor Programmingdata71

InvalidateMemory provides data only if not presentin any cache, so no need to change itdatacachecache now (expensive)BusmemoryArt of Multiprocessor Programmingdata72

Mutual Exclusion What do we want to optimize?– Bus bandwidth used by spinning threads– Release/Acquire latency– Acquire latency for idle lockArt of Multiprocessor Programming73

Simple TASLock TAS invalidates cache lines Spinners– Miss in cache– Go to bus Thread wants to release lock– delayed behind spinnersArt of Multiprocessor Programming74

Test-and-test-and-set Wait until lock “looks” free– Spin on local cache– No bus use while lock busy Problem: when lock is released– Invalidation storm Art of Multiprocessor Programming75

Local Spinning while Lock isBusybusybusybusyBusmemoryArt of Multiprocessor Programmingbusy76

On ReleaseinvalidinvalidfreeBusmemoryArt of Multiprocessor Programmingfree77

On ReleaseEveryone rt of Multiprocessor Programmingfree78 (1)

On ReleaseEveryone tries TASTAS( )invalidTAS( )invalidfreeBusmemoryArt of Multiprocessor Programmingfree79 (1)

Problems Everyone misses– Reads satisfied sequentially Everyone does TAS– Invalidates others’ caches Eventually quiesces after lock acquired– How long does this take?Art of Multiprocessor Programming80

Measuring Quiescence Time Acquire lock Pause without using bus Use bus heavilyP1P2PnIf pause quiescence time,critical section duration independent of number of threadsIf pause quiescence time,critical section duration slower with more threadsArt of Multiprocessor Programming81

Quiescence TimetimeIncreseslinearly withthe number ofprocessors forbus architecturethreadsArt of Multiprocessor Programming82

Mystery ExplainedTAS locktimeTTAS lockIdealBetter than TASthreads but still not asgood as idealArt of Multiprocessor Programming83

Solution: Introduce Delay If the lock looks free But I fail to get it There must be contention Better to back off than to collide againtimer2dr1dArt of Multiprocessor Programmingdspin lock84

Dynamic Example:Exponential Backofftime4d2ddspin lockIf I fail to get lock– Wait random duration before retry– Each subsequent failure doublesexpected waitArt of Multiprocessor Programming85

Exponential Backoff Lockpublic class Backoff implements lock {public void lock() {int delay MIN DELAY;while (true) {while (state.get()) {}if (!lock.getAndSet(true))return;sleep(random() % delay);if (delay MAX DELAY)delay 2 * delay;}}}Art of Multiprocessor Programming86

Exponential Backoff Lockpublic class Backoff implements lock {public void lock() {int delay MIN DELAY;while (true) {while (state.get()) {}if (!lock.getAndSet(true))return;sleep(random() % delay);if (delay MAX DELAY)delay 2 * delay;Fix minimum delay}}}Art of Multiprocessor Programming87

Exponential Backoff Lockpublic class Backoff implements lock {public void lock() {int delay MIN DELAY;while (true) {while (state.get()) {}if (!lock.getAndSet(true))return;sleep(random() % delay);if (delay MAX DELAY)delay 2 * delay;Wait until lock looks free}}}Art of Multiprocessor Programming88

Exponential Backoff Lockpublic class Backoff implements lock {public void lock() {int delay MIN DELAY;while (true) {while (state.get()) {}if (!lock.getAndSet(true))return;sleep(random() % delay);if (delay MAX DELAY)delay 2 * delay;If we win, return}}}Art of Multiprocessor Programming89

Exponential Backoff Lockpublic class Backoff implements lock {Backlock()off for{ random durationpublic voidint delay MIN DELAY;while (true) {while (state.get()) {}if (!lock.getAndSet(true))return;sleep(random() % delay);if (delay MAX DELAY)delay 2 * delay;}}}Art of Multiprocessor Programming90

Exponential Backoff Lockpublic class Backoff implements lock {Doubledelay,within reasonpublicvoid maxlock(){int delay MIN DELAY;while (true) {while (state.get()) {}if (!lock.getAndSet(true))return;sleep(random() % delay);if (delay MAX DELAY)delay 2 * delay;}}}Art of Multiprocessor Programming91

Spin-Waiting OverheadtimeTTAS LockBackoff lockthreadsArt of Multiprocessor Programming92

Backoff: Other Issues Good– Easy to implement– Beats TTAS lock Bad– Must choose parameters carefully– Not portable across platformsArt of Multiprocessor Programming93

Idea Avoid useless invalidations– By keeping a queue of threads Each thread– Notifies next in line– Without bothering the othersArt of Multiprocessor Programming95

Anderson Queue LockidlenextflagsTFFFFArt of Multiprocessor ProgrammingFFF96

Anderson Queue LockacquiringnextgetAndIncrementflagsTFFFFArt of Multiprocessor ProgrammingFFF97

Anderson Queue LockacquiringnextgetAndIncrementflagsTFFFFArt of Multiprocessor ProgrammingFFF98

Anderson Queue LockacquirednextMine!flagsTFFFFArt of Multiprocessor ProgrammingFFF99

Anderson Queue LockacquiringacquirednextflagsTFFFFArt of Multiprocessor ProgrammingFFF100

Anderson Queue Art of Multiprocessor ProgrammingFFF101

Anderson Queue Art of Multiprocessor ProgrammingFFF102

Anderson Queue LockacquiringacquirednextflagsTFFFFArt of Multiprocessor ProgrammingFFF103

Anderson Queue LockacquiredreleasednextflagsTTFFFArt of Multiprocessor ProgrammingFFF104

Anderson Queue LockacquiredreleasednextYow!flagsTTFFFArt of Multiprocessor ProgrammingFFF105

Anderson Queue Lockclass ALock implements Lock {boolean[] flags {true,false, ,false};AtomicInteger next new AtomicInteger(0);ThreadLocal Integer mySlot;Art of Multiprocessor Programming106

Anderson Queue Lockclass ALock implements Lock {boolean[] flags {true,false, ,false};AtomicInteger next new AtomicInteger(0);ThreadLocal Integer mySlot;One flag per threadArt of Multiprocessor Programming107

Anderson Queue Lockclass ALock implements Lock {boolean[] flags {true,false, ,false};AtomicInteger next new AtomicInteger(0);ThreadLocal Integer mySlot;Next flag to useArt of Multiprocessor Programming108

Anderson Queue Lockclass ALock implements Lock {boolean[] flags {true,false, ,false};AtomicInteger next new AtomicInteger(0);ThreadLocal Integer mySlot;Thread-local variableArt of Multiprocessor Programming109

Anderson Queue Lockpublic lock() {mySlot next.getAndIncrement();while (!flags[mySlot % n]) {};flags[mySlot % n] false;}public unlock() {flags[(mySlot 1) % n] true;}Art of Multiprocessor Programming110

Anderson Queue Lockpublic lock() {mySlot next.getAndIncrement();while (!flags[mySlot % n]) {};flags[mySlot % n] false;}public unlock() {flags[(mySlot 1) % n] true;Take next}Art of Multiprocessor Programmingslot111

Anderson Queue Lockpublic lock() {mySlot next.getAndIncrement();while (!flags[mySlot % n]) {};flags[mySlot % n] false;}public unlock() {flags[(mySlot 1) % n] true;Spin until told}Art of Multiprocessor Programmingto go112

Anderson Queue Lockpublic lock() {myslot next.getAndIncrement();while (!flags[myslot % n]) {};flags[myslot % n] false;}public unlock() {flags[(myslot 1) % n] true;}Prepare slot forArt of Multiprocessor Programmingre-use113

Anderson Queue Lockpublic lock() {Tell next thread tomySlot next.getAndIncrement();while (!flags[mySlot % n]) {};flags[mySlot % n] false;}gopublic unlock() {flags[(mySlot 1) % n] true;}Art of Multiprocessor Programming114

Local FFUnfortunately many bits share cache lineArt of Multiprocessor Programming115

False SharingacquiredreleasednextSpinonmybitSpinning threadResult:contentionflagsTFFFFgets cacheinvalidation onFF of storeFaccountby threads it isnot waiting forLine 1Art of Multiprocessor Programming Line 2116

The Solution: Line 1Art of Multiprocessor Programming Line 2//117

PerformanceTTAS Shorter handover thanbackoffqueue Curve is practically flat Scalable performanceArt of Multiprocessor Programming118

Anderson Queue LockGood– First truly scalable lock– Simple, easy to implement– Back to FCFS order (like Bakery)Art of Multiprocessor Programming119

Anderson Queue LockBad– Space hog – One bit per thread one cache lineper thread What if unknown number of threads? What if small number of actualcontenders?Art of Multiprocessor Programming120

CLH Lock FCFS order Small, constant-size overhead perthreadArt of Multiprocessor Programming121

InitiallyidletailfalseArt of Multiprocessor Programming122

InitiallyidletailfalseArt of Multiprocessor ProgrammingQueue tail123

InitiallyidleLock is freetailfalseArt of Multiprocessor Programming124

InitiallyidletailfalseArt of Multiprocessor Programming125

Purple Wants the LockacquiringtailfalseArt of Multiprocessor Programming126

Purple Wants the LockacquiringtailfalsetrueArt of Multiprocessor Programming127

Purple Wants the LockacquiringSwaptailfalsetrueArt of Multiprocessor Programming128

Purple Has the LockacquiredtailfalsetrueArt of Multiprocessor Programming129

Red Wants the LockacquiredacquiringtailfalsetrueArt of Multiprocessor Programmingtrue130

Red Wants the LockacquiredacquiringSwaptailfalsetrueArt of Multiprocessor Programmingtrue131

Red Wants the LockacquiredacquiringtailfalsetrueArt of Multiprocessor Programmingtrue132

Red Wants the LockacquiredacquiringtailfalsetrueArt of Multiprocessor Programmingtrue133

Red Wants the LockacquiredacquiringImplicitLinked listtailfalsetrueArt of Multiprocessor Programmingtrue134

Red Wants the LockacquiredacquiringtailfalsetrueArt of Multiprocessor Programmingtrue135

Red Wants the LockacquiredacquiringActually, itspins oncached copytruetailfalsetrueArt of Multiprocessor Programmingtrue136

Purple rt of Multiprocessor Programmingtrue137

Purple ReleasesreleasedacquiredtailtrueArt of Multiprocessor Programming138

Space Usage Let– L number of locks– N number of threads ALock– O(LN) CLH lock– O(L N)Art of Multiprocessor Programming139

CLH Queue Lockclass QNode {AtomicBoolean locked new AtomicBoolean(true);}Art of Multiprocessor Programming140

CLH Queue Lockclass QNode {AtomicBoolean locked new AtomicBoolean(true);}Not released yetArt of Multiprocessor Programming141

CLH Queue Lockclass CLHLock implements Lock {AtomicReference QNode tail;ThreadLocal QNode myNode new QNode();public void lock() {QNode pred tail.getAndSet(myNode);while (pred.locked) {}}}Art of Multiprocessor Programming142

CLH Queue Lockclass CLHLock implements Lock {AtomicReference QNode tail;ThreadLocal QNode myNode new QNode();public void lock() {QNode pred tail.getAndSet(myNode);while (pred.locked) {}}}Queue tailArt of Multiprocessor Programming143

CLH Queue Lockclass CLHLock implements Lock {AtomicReference QNode tail;ThreadLocal QNode myNode new QNode();public void lock() {QNode pred tail.getAndSet(myNode);while (pred.locked) {}}}Thread-local QNodeArt of Multiprocessor Programming144

CLH Queue Lockclass CLHLock implements Lock {AtomicReference QNode tail;ThreadLocal QNode myNode new QNode();Swap in my nodepublic void lock() {QNode pred tail.getAndSet(myNode);while (pred.locked) {}}}Art of Multiprocessor Programming145

CLH Queue Lockclass CLHLock implements Lock {AtomicReference QNode tail;ThreadLocal QNode myNode new QNode();Spin until predecessorreleases lockpublic void lock() {QNode pred tail.getAndSet(myNode);while (pred.locked) {}}}Art of Multiprocessor Programming146

CLH Queue LockClass CLHLock implements Lock { public void unlock() {myNode.locked.set(false);myNode pred;}}Art of Multiprocessor Programming147

CLH Queue LockClass CLHLock implements Lock { public void unlock() {myNode.locked.set(false);myNode pred;}}Notify successorArt of Multiprocessor Programming148

CLH Queue LockClass CLHLock implements Lock { public void unlock() {myNode.locked.set(false);myNode pred;}}Recyclepredecessor’s nodeArt of Multiprocessor Programming149

CLH Queue LockClass CLHLock implements Lock { public void unlock() {myNode.locked.set(false);myNode pred;}}(we don’t actually reuse myNode.Code in book shows how it’s done.)Art of Multiprocessor Programming150

CLH Lock Good– Lock release affects predecessor only– Small, constant-sized space Bad– Doesn’t work for uncached NUMAarchitecturesArt of Multiprocessor Programming151

NUMA and cc-NUMAArchitectures Acronym:– Non-Uniform Memory Architecture– ccNUMA cache coherent NUMA Illusion:– Flat shared memory Truth:– No caches (sometimes)– Some memory regions faster than othersArt of Multiprocessor Programming152

NUMA MachinesSpinning on localmemory is fastArt of Multiprocessor Programming153

NUMA MachinesSpinning on remotememory is slowArt of Multiprocessor Programming154

CLH Lock Each thread spins on predecessor’smemory Could be far away Art of Multiprocessor Programming155

MCS Lock FCFS order Spin on local memory only Small, Constant-size overheadArt of Multiprocessor Programming156

InitiallyidletailfalseArt of Multiprocessor Programming157

Acquiringacquiring(allocate QNode)truetailfalseArt of Multiprocessor Programming158

AcquiringacquiredswaptruetailfalseArt of Multiprocessor Programming159

AcquiringacquiredtruetailfalseArt of Multiprocessor Programming160

AcquiredacquiredtruetailfalseArt of Multiprocessor Programming161

AcquiringacquiredacquiringfalsetailswapArt of Multiprocessor Programmingtrue162

AcquiringacquiredacquiringfalsetailtrueArt of Multiprocessor Programming163

AcquiringacquiredacquiringfalsetailtrueArt of Multiprocessor Programming164

AcquiringacquiredacquiringfalsetailtrueArt of Multiprocessor Programming165

AcquiringacquiredacquiringtruetailtruefalseArt of Multiprocessor Programming166

AcquiringacquiredacquiringYes!truetailfalsetrueArt of Multiprocessor Programming167

MCS Queue Lockclass QNode {volatile boolean locked false;volatile qnodenext null;}Art of Multiprocessor Programming168

MCS Queue Lockclass MCSLock implements Lock {AtomicReference tail;public void lock() {QNode qnode new QNode();QNode pred tail.getAndSet(qnode);if (pred ! null) {qnode.locked true;pred.next qnode;while (qnode.locked) {}}}}Art of Multiprocessor Programming169

MCS Queue Lockclass MCSLock implements Lock {Make aAtomicReference tail;QNodepublic void lock() {QNode qnode new QNode();QNode pred tail.getAndSet(qnode);if (pred ! null) {qnode.locked true;pred.next qnode;while (qnode.locked) {}}}}Art of Multiprocessor Programming170

MCS Queue Lockclass MCSLock implements Lock {AtomicReference tail;public void lock() {QNode qnode new QNode();QNode pred tail.getAndSet(qnode);if (pred ! null) {qnode.locked true;add my Node topred.next qnode;the tail ofwhile (qnode.locked) {}queue}}}Art of Multiprocessor Programming171

MCS Queue Lockclass MCSLock implements Lock {Fix if queue wasAtomicReference tail;non-emptypublic void lock() {QNode qnode new QNode();QNode pred tail.getAndSet(qnode);if (pred ! null) {qnode.locked true;pred.next qnode;while (qnode.locked) {}}}}Art of Multiprocessor Programming172

MCS Queue Lockclass MCSLock implements Lock {AtomicReference tail;Wait untilpublic void lock() {unlockedQNode qnode new QNode();QNode pred tail.getAndSet(qnode);if (pred ! null) {qnode.locked true;pred.next qnode;while (qnode.locked) {}}}}Art of Multiprocessor Programming173

Purple ReleasereleasingswapfalsefalseArt of Multiprocessor Programming174

I don’t Releasesee a successor. But byPurplelooking at the queue, I seeanother thread is activereleasingswapfalsefalseArt of Multiprocessor Programming175

I don’t Releasesee a successor. But byPurplelooking at the queue, I seeanother thread is activereleasingswapfalsefalseI have to release thatthread so must wait for itArt of Multiprocessor Programming to identify its node176

Purple Releasereleasingprepare to spintruefalseArt of Multiprocessor Programming177

Purple ReleasereleasingspinningtruefalseArt of Multiprocessor Programming178

Purple ReleasereleasingspinningfalsetruefalseArt of Multiprocessor Programming179

Purple ReleasereleasingAcquired lockfalsetruefalseArt of Multiprocessor Programming180

MCS Queue Unlockclass MCSLock implements Lock {AtomicReference tail;public void unlock() {if (qnode.next null) {if (tail.CAS(qnode, null)return;while (qnode.next null) {}}qnode.next.locked false;}}Art of Multiprocessor Programming181

MCS Queue Lockclass MCSLock implements Lock {AtomicReference tail;public void unlock() {if (qnode.next null) {if (tail.CAS(qnode, null)return;while (qnode.next null) {}}qnode.next.locked false; Missingsuccessor}}Art of Multiprocessor Programming?182

MCS Queue Lockclass MCSLock implements Lock {If really no successor,AtomicReference tail;returnpublic void unlock() {if (qnode.next null) {if (tail.CAS(qnode, null)return;while (qnode.next null) {}}qnode.next.locked false;}}Art of Multiprocessor Programming183

MCS Queue Lockclass MCSLock implements Lock {Otherwise wait forAtomicReference tail;successortocatchuppublic void unlock() {if (qnode.next null) {if (tail.CAS(qnode, null)return;while (qnode.next null) {}}qnode.next.locked false;}}Art of Multiprocessor Programming184

MCS Queue Lockclass MCSLock implements Lock {AtomicReference queue;lock{to successorpublic voidPassunlock()if (qn

(we never lied to you) -But idealized (so we forgot to mention a few things) Protocols -Elegant -Important -But naïve. Art of Multiprocessor Programming 3 New Focus: Performance . probably use it again soon -Fetch from cache, not memory If you use an address now, you will