Chapter 2 Memory Hierarchy Design PDF Free Download

2y ago

35 Views

1 Downloads

1.64 MB

76 Pages

Report/dmca

Download PDF

Transcription

Programmers want unlimited amounts of memory withlow latencyFast memory technology is more expensive per bit thanslower memorySolution: organize memory system into a hierarchy IntroductionIntroductionEntire addressable memory space available in largest, slowestmemoryIncrementally smaller and faster memories, each containing asubset of the memory below it, proceed in steps up toward theprocessorTemporal and spatial locality insures that nearly allreferences can be found in smaller memories Gives the illusion of a large, fast memory being presented to theprocessorCopyright 2012, Elsevier Inc. All rights reserved.2

Memory hierarchy design becomes more crucialwith recent multi-core processors: IntroductionMemory Hierarchy DesignAggregate peak bandwidth grows with # cores: Intel Core i7 can generate two references per core per clockFour cores and 3.2 GHz clock 25.6 billion 64-bit data references/second 12.8 billion 128-bit instruction references 409.6 GB/s!DRAM bandwidth is only 6% of this (25 GB/s)Requires: Multi-port, pipelined cachesTwo levels of cache per coreShared third-level cache on chipCopyright 2012, Elsevier Inc. All rights reserved.5

High-end microprocessors have 10 MB on-chipcache IntroductionPerformance and PowerConsumes large amount of area and power budgetStatic power due to leakage when not operating anddynamic power when performing reads/writes.Power consumption in PMDs due to caches can bebetween 25% and 50% of the total powerconsumption.Copyright 2012, Elsevier Inc. All rights reserved.6

When a word is not found in the cache, a missoccurs: Fetch word from lower level in hierarchy, requiring ahigher latency referenceLower level may be another cache or the mainmemoryAlso fetch the other words contained within the block IntroductionMemory Hierarchy BasicsTakes advantage of spatial localityPlace block into cache in any location within its set,determined by address block address MOD number of sets in the cacheCopyright 2012, Elsevier Inc. All rights reserved.7

IntroductionMemory Hierarchy Basics Placement of blocks in a cache Set associative: block is mapped into a set and theblock is placed anywhere in the setFinding a block: map block address to set and searchset (usually in parallel) to find block.n blocks in a set: n-way associative cacheOne block per set (n 1): direct-mapped cacheOne set per cache: fully associative cacheCopyright 2012, Elsevier Inc. All rights reserved.14

Memory block 3 can go anywhere in the cache.IntroductionFully Associative CacheBlock 0CACHEBlock 1Block 2Block 3Block 0Block 1Block 2000005009001006010003007011004008012Block 3Block 4Block ht 2013, Daniel A. Menasce. All rights reserved.15

Memory block 3 can only go into cache set (3 mod 2) 1in the cache.SET 0SET 1Introduction2-Way Associative CacheBlock 0CACHEBlock 1Block 2Block 3Block 0Block 1Block 2Block 3Block 4Block 17018019020021022023024MEMORYCopyright 2013, Daniel A. Menasce. All rights reserved.16

Memory block 5 can only go into cache block (5 mod 4) 1in the cache.SET 0SET 1SET 2SET 3IntroductionDirect-mapped CacheBlock 0CACHEBlock 1Block 2Block 3Block 0Block 1Block 2Block 3Block 4Block 17018019020021022023024MEMORYCopyright 2013, Daniel A. Menasce. All rights reserved.17

IntroductionProcessor/Cache Addressing2-way associative cache.MEMORYProcessor110VB TAG00Cache data1 1101110010 101000111 1011101111 10001001CACHETAG 1 IndexSet 1: B1Set 0: B2 1 01011011100101010001111001101Set 1: B3Set 0: B41 2 00000001001000110100Set 0: B0Set 1: B5Set 0: B611101111Set 1: B7address 8 INDEX: selects setTAG: used to check for cache hit if valid bit (VB) is 1.Copyright 2013, Daniel A. Menasce. All rights reserved.18

IntroductionProcessor/Cache Addressing2-way associative cache.MEMORYProcessor110VB TAG00 Cache data1 1101110010 101000111 1011101111 10001001CACHE HITTAGCACHE 1 IndexSet 1: B1Set 0: B2 1 01011011100101010001111001101Set 1: B3Set 0: B41 2 00000001001000110100Set 0: B0Set 1: B5Set 0: B611101111Set 1: B7address 8 INDEX: selects setTAG: used to check for cache hit if valid bit (VB) is 1.Copyright 2013, Daniel A. Menasce. All rights reserved.19

IntroductionProcessor/Cache Addressing2-way associative cache.MEMORYProcessor110VB TAG00 Cache data0 1101001110 101101111 1011101111 10001001CACHE MISSWHY?TAGCACHE 1 IndexSet 1: B1Set 0: B2 1 01011011100101010001111001101Set 1: B3Set 0: B41 2 00000001001000110100Set 0: B0Set 1: B5Set 0: B611101111Set 1: B7address 8 INDEX: selects setTAG: used to check for cache hit if valid bit (VB) is 1.Copyright 2013, Daniel A. Menasce. All rights reserved.20

IntroductionProcessor/Cache Addressing2-way associative cache.MEMORYProcessor010VB TAG00 Cache data0 1101001110 101101111 1011101111 10001001CACHE MISSTAGCACHE 1 IndexSet 1: B1Set 0: B2 1 01011011100101010001111001101Set 1: B3Set 0: B41 2 00000001001000110100Set 0: B0Set 1: B5Set 0: B611101111Set 1: B7address 8 INDEX: selects setTAG: used to check for cache hit if valid bit (VB) is 1.Copyright 2013, Daniel A. Menasce. All rights reserved.21

Which Block to Replace? Direct-mapped cache: IntroductionCache Miss: Block ReplacementOnly one option: replace the block in the location where theincoming block has to go.Fully Associative or Set Associative Random: spreads allocation uniformly.Least Recently Used (LRU): the block replaced is the onethat has been unused for the longest time. Can be expensiveto implement in hardware. Pseudo-LRU provides anapproximation using bits associated to each set to recordwhen blocks in a set were accessed.First In First Out (FIFO): selects the oldest rather than theLRU block.Copyright 2012, Elsevier Inc. All rights reserved.22

IntroductionMemory Hierarchy Basics Writing to cache: two strategies Write-through Write-back Immediately update lower levels of hierarchyOnly update lower levels of hierarchy when an updated blockis replacedBoth strategies use write buffer to make writesasynchronousCopyright 2012, Elsevier Inc. All rights reserved.23

IntroductionWrite-back – Part IIICPUBLOCK 1REPLACEMENTRead 024Block 0Block 1Block 2005021006022024007023008024CACHEBlock 15016017018019020021022023024023024MEMORYCopyright 2013, Daniel A. Menasce. All rights reserved.27

Miss rate IntroductionMemory Hierarchy BasicsFraction of cache access that result in a missCauses of misses (3 C’s) Compulsory Capacity First reference to a block. Would happen even for infinitecaches.Blocks discarded and later retrievedConflict Program makes repeated references to multiple addressesfrom different blocks that map to the same location in thecacheCopyright 2012, Elsevier Inc. All rights reserved.28

IntroductionMemory Hierarchy BasicsMemoryStallCycles NumberOfMisses MissPenalty Misses IC MissPenaltyInstructionMemoryAccesses IC MissRate MissPenaltyInstruction Note that speculative and multithreaded processorsmay execute other instructions duringa miss Reduces performance impact of missesCopyright 2012, Elsevier Inc. All rights reserved.29

IntroductionMemory PerformanceExample: Assume that the CPI of a computer is 1.0 when all memoryaccesses hit in the cache. Data access instructions (loads and stores)account for 50% of all instructions. The miss penalty is 25 cycles and themiss rate is 4%. Assume that 70% instructions are cache hits.What is the execution time in clock cycles (CC) and Instruction Counts (IC)?InstructionData0.5 memory0.7 HitCache Hit/Miss0.04 miss0.96 hit0.5 no memory0.5 memory0.3 Miss0.5 no memory0.04 miss0.96 hitProbabilityNo. CCs0.014 1 25 260.336 110.3510.006 25 25 1 510.144 25 1 2625 1 260.1525 1 26IC*(0.014*26 (0.336 0.35)*1 0.006*51 (0.144 0.15)*26)*CC 9 * IC * CCCopyright 2013, Daniel A. Menasce All rights reserved.30

Example: How would the execution time in the previous examplechange if the miss rate were reduced to 2%?InstructionData0.5 memory0.7 HitCache Hit/Miss0.02 miss0.980.5 no memory0.5 memory0.3 Miss0.5 no memory0.020.98IntroductionMemory PerformanceProbabilityNo. CCs0.007 1 25 260.343 110.3510.003 25 25 1 510.147 25 1 2625 1 260.1525 1 26IC*(0.007*26 (0.334 0.35)*1 0.003*51 (0.147 0.15)*26)*CC 8.741 * IC * CCCopyright 2013, Daniel A. Menasce All rights reserved.31

Example: What is the miss rate were kept at 4% but memory accesstime were improved so that the miss penalty were reduced to 10 cycles?InstructionData0.5 memory0.7 HitCache Hit/Miss0.04 miss0.96 hit0.5 no memory0.5 memory0.3 Miss0.5 no memory0.04 miss0.96 hitIntroductionMemory PerformanceProbabilityNo. CCs0.014 1 10 110.336 110.3510.006 10 10 1 210.144 10 1 1110 1 110.1510 1 11IC*(0.014*11 (0.336 0.35)*1 0.006*21 (0.144 0.15)*11)*CC 4.2 * IC * CCAvg. of 3.2 stall cycles/instructionCopyright 2013, Daniel A. Menasce All rights reserved.32

Six basic cache optimizations: Larger block size Increases hit time, increases power consumptionHigher associativity Reduces compulsory missesIncreases capacity and conflict misses, increases miss penaltyLarger total cache capacity to reduce miss rate IntroductionMemory Hierarchy BasicsReduces conflict missesIncreases hit time, increases power consumptionHigher number of cache levels (Multi-level caches) Reduces overall memory access time:HitTime L1 MissRate L1 (HitTime L2 MissRate L2 MissPenalty L2 ) Giving priority to read misses over writes Reduces miss penaltyAvoiding address translation in cache indexing Reduces hit timeCopyright 2012, Elsevier Inc. All rights reserved.33

Small and simple first level caches Critical timing path: addressing tag memory, thencomparing tags, thenselecting correct setAdvanced OptimizationsTen Advanced Optimizations (1)Direct-mapped caches can overlap tag compare andtransmission of dataLower associativity reduces power because fewercache lines are accessedCopyright 2012, Elsevier Inc. All rights reserved.35

Determine if a 32-KB 4-way set associative L1 cache has a faster memoryaccess time than a 32-KB 2-way set associative L1 cache. Assume thatthe miss penalty to L2 is 15 clock cycles and that the hit time for thethe faster L1 cache is 1 clock cycle. Ignore misses beyond the L2 cacheand assume miss rates for the L1 cache as below.IntroductionMemory Performance ExampleMiss rate for 2-way associative 0.038Miss rate for 4-way associative 0.037Hit time 2-way associative 1.Avg. access time for 2-way associative HitTime MissRate * MissPenalty 1 0.038 * 15 1.57The figure on slide 36 shows that the access time for the 4-way L1 cache is approximately 1.4 that of2-way associative.Avg. access time for 4-way associative HitTime MissRate * MissPenalty 1.4 * 1 0.037 * 15 1.955Copyright 2013, Daniel A. Menasce All rights reserved.39

To improve hit time, predict the way to pre-setmux (i.e., the block within the set of next cacheaccesses). Mis-prediction gives longer hit timePrediction accuracy Advanced OptimizationsWay Prediction (2) 90% for two-way 80% for four-wayI-cache has better accuracy than D-cacheFirst used on MIPS R10000 in mid-90sUsed on ARM Cortex-A8Copyright 2012, Elsevier Inc. All rights reserved.40

Pipeline cache access to improve bandwidth Examples: Pentium: 1 cyclePentium Pro – Pentium III: 2 cyclesPentium 4 – Core i7: 4 cyclesAdvanced OptimizationsPipelining Cache (3)Increases branch mis-prediction penaltyMakes it easier to increase associativityCopyright 2012, Elsevier Inc. All rights reserved.41

Allow hits beforeprevious missescomplete “Hit under miss”“Hit under multiplemiss”Advanced OptimizationsNonblocking Caches (4)L2 must support thisIn general,processors can hideL1 miss penalty butnot L2 miss penaltyCopyright 2012, Elsevier Inc. All rights reserved.42

Organize cache as independent banks tosupport simultaneous access ARM Cortex-A8 supports 1-4 banks for L2Intel i7 supports 4 banks for L1 and 8 banks for L2Advanced OptimizationsMultibanked Caches (5)Interleave banks according to block addressCopyright 2012, Elsevier Inc. All rights reserved.43

Critical word first Early restart Request missed word from memory firstSend it to the processor as soon as it arrivesAdvanced OptimizationsCritical Word First, Early Restart (6)Request words in normal orderSend missed work to the processor as soon as itarrivesEffectiveness of these strategies depends onblock size and likelihood of another access tothe portion of the block that has not yet beenfetchedCopyright 2012, Elsevier Inc. All rights reserved.44

When storing to a block that is already pending in thewrite buffer, update write bufferReduces stalls due to full write bufferDo not apply to I/O addressesAdvanced OptimizationsMerging Write Buffer (7)No writebufferingWrite bufferingCopyright 2012, Elsevier Inc. All rights reserved.45

Loop Interchange Swap nested loops to access memory insequential orderAdvanced OptimizationsCompiler Optimizations (8)Blocking (see examples on pages 89-90) Instead of accessing entire rows or columns,subdivide matrices into blocksRequires more memory accesses but improveslocality of accessesCopyright 2012, Elsevier Inc. All rights reserved.46

Assume [5000,100] array stored in row majororder (i.e., laid out by rows)Before: x[0,0], x[1,0], x[2,0] for (j 0; j 100; j j 1)for (i 0; i 5000; i i 1)x[i][j] 2 * x[i][j]; Advanced OptimizationsLoop InterchangeAfter: x[0,0], x[0,1], x[0,1], for (i 0; i 5000; i i 1)for (j 0; j 100; j j 1)x[i][j] 2 * x[i][j]; Which one produces less misses?Copyright 2012, Elsevier Inc. All rights reserved.47

Fetch two blocks on miss (include nextsequential block)Advanced OptimizationsHardware Prefetching (9)Pentium 4 Pre-fetching (subset of SPEC 2000 programs)Copyright 2012, Elsevier Inc. All rights reserved.48

Insert prefetch instructions before data is neededNon-faulting: prefetch doesn’t cause exceptions Register prefetch Loads data into registerCache prefetch Advanced OptimizationsCompiler Prefetching (10)Loads data into cacheCombine with loop unrolling and softwarepipeliningCopyright 2012, Elsevier Inc. All rights reserved.49

Performance metrics Latency is concern of cacheBandwidth is concern of multiprocessors and I/OAccess time Time between read request and when desired wordarrivesCycle time Memory TechnologyMain Memory TechnologyMinimum time between unrelated requests to memoryDRAM used for main memory, SRAM usedfor cacheCopyright 2012, Elsevier Inc. All rights reserved.51

SRAM Does not require refreshRequires low power to retain bitRequires 6 transistors/bitMemory TechnologyMemory TechnologyDRAM (used for main memory) Must be re-written after being readMust also be periodically refreshed Every 8 msEach row can be refreshed simultaneously by readingthe rowOne transistor/bitAddress lines are multiplexed: Upper half of address: row access strobe (RAS)Lower half of address: column access strobe (CAS)Copyright 2012, Elsevier Inc. All rights reserved.52

Access time Time between a memory request and whenthe desired word arrives.Memory TechnologyMemory TechnologyCycle time Minimum time between unrelated requestto memorySRAMs don’t need to refresh cycletime and access times are very close.Copyright 2012, Elsevier Inc. All rights reserved.53

Amdahl: Memory capacity should grow linearly with processor speedUnfortunately, memory capacity and speed has not keptpace with processorsMemory TechnologyMemory TechnologySome optimizations: Multiple accesses to same row (using row buffer)Synchronous DRAM (SDRAM) Added clock to DRAM interfaceBurst mode with critical word firstWider interfaces (DDR2 and DDR3)Double data rate (DDR): transfer data at the rising and fallingedges of the clock cycle.Multiple independent banks on each DRAM device address (bank #, row #, column #)Copyright 2012, Elsevier Inc. All rights reserved.54

DDR: DDR2 DDR3 1.5 VMaximum clock frequency: 800 MHzDDR4 Lower power (2.5 V - 1.8 V)Higher clock rates (266 MHz, 333 MHz, 400 MHz)Memory TechnologyMemory Optimizations1-1.2 VMaximum clock frequency: 1600 MHzGDDR5 is graphics memory based on DDR3 Used in GPUs.Copyright 2012, Elsevier Inc. All rights reserved.57

Graphics memory: Achieve 2-5 X bandwidth per DRAM vs. DDR3 Wider interfaces (32 vs. 16 bit)Higher clock rate Memory TechnologyMemory OptimizationsPossible because they are attached via soldering instead ofsocketted DIMM modulesReducing power in SDRAMs: Lower voltage (1.35V or 1.5V)Low power mode (ignores clock, continues torefresh)Copyright 2012, Elsevier Inc. All rights reserved.58

Type of EEPROM (Electronically ErasableProgrammable Read-Only Memory)Must be erased (in blocks) before beingoverwrittenNon volatileLimited number of write cyclesCheaper than SDRAM, more expensive thandiskSlower than SRAM, faster than diskCopyright 2012, Elsevier Inc. All rights reserved.Memory TechnologyFlash Memory60

Memory is susceptible to cosmic raysSoft errors: dynamic errors Hard errors: permanent errors Detected and fixed by error correcting codes(ECC)Memory TechnologyMemory DependabilityUse sparse rows to replace defective rowsChipkill: a RAID-like error recoverytechnique: Data and ECC is distributed so that loss of asingle memory chip can be tolerated and itscontent reconstructed from the others.Copyright 2012, Elsevier Inc. All rights reserved.61

The address space of a process is dividedinto equal sized pages (e.g., 4KB in size).Main memory is divided into page frames thatcan hold one page each.The size of a process’ address space can belarger than main memory. At any time, some pages of a process maynot be in main memory. All pages are onthe paging disk. When a page that is not in memory isreferenced, it must be brought from disk tomemory. This is a page fault.Copyright 2013, D.A. Menasce. All rights reserved.Virtual Memory and Virtual MachinesVirtual Memory Basics62

If all page frames are occupied when a pagefault occurs, a page has to evicted frommemory using a page replacement algorithm. Least Recently Used (LRU) is an exampleof a PRA.There is a Page Table per process,maintained by the operating system.Translation from a virtual address to aphysical address is done by thehardware.A page fault invokes the operatingsystem.Copyright 2013, D.A. Menasce. All rights reserved.Virtual Memory and Virtual MachinesVirtual Memory Basics63

Virtual AddressCPNBVPN Page OffsetPage Frame 0Page Frame 1Page Frame 2Page Frame 3Main memoryPhysical AddressPFN Page OffsetVPNPage 0Page 1Page 2Frame Number30 Page 13Page 14Page 15 21Page TableAPage 0Page 1BCPage 2DPage 3Page 4EPage 5FGPage 6Page 7HIPage 8JPage 9Page 10KPage 11LMPage 12Page 13N0Page 14PPage 15Paging Disk(process addressspace)Copyright 2013, D.A. Menasce. All rights reserved.Virtual Memory and Virtual MachinesVirtual Memory Basics64

Virtual AddressVPN2 Page OffsetCPNBPage Frame 0Page Frame 1Page Frame 2Page Frame 3Main memoryPhysical Address0VPNPage 0Page 1Page 2Frame Number30 Page 13Page 14Page 15 21Page TablePage OffsetAPage 0Page 1BCPage 2DPage 3Page 4EPage 5FGPage 6Page 7HIPage 8JPage 9Page 10KPage 11LMPage 12Page 13N0Page 14PPage 15Paging Disk(process addressspace)Copyright 2013, D.A. Menasce. All rights reserved.Virtual Memory and Virtual MachinesVirtual Memory Basics65

Virtual AddressVPN14 Page OffsetCPNBPage Frame 0Page Frame 1Page Frame 2Page Frame 3Main memoryPhysical AddressPage OffsetVPNPage 0Page 1Page 2Frame Number30 Page 13Page 14Page 15 21Page TablePage Fault!APage 0Page 1BCPage 2DPage 3Page 4EPage 5FGPage 6Page 7HIPage 8JPage 9Page 10KPage 11LMPage 12Page 13N0Page 14PPage 15Paging Disk(process addressspace)Copyright 2013, D.A. Menasce. All rights reserved.Virtual Memory and Virtual MachinesVirtual Memory Basics66

Virtual AddressVPN14 Page OffsetCPOBPage Frame 0Page Frame 1Page Frame 2Page Frame 3Main memoryPhysical Address2VPNPage 0Page 1Page 2Frame Number30 Page 13Page 14Page 15 21Page TablePage OffsetAPage 0Page 1BCPage 2DPage 3Page 4EPage 5FGPage 6Page 7HIPage 8JPage 9Page 10KPage 11LMPage 12Page 13N0Page 14PPage 15Paging Disk(process addressspace)Copyright 2013, D.A. Menasce. All rights reserved.Virtual Memory and Virtual MachinesVirtual Memory Basics67

A computer system has 4 GB of mainmemory and a virtual address space equal to64 GB. Each page is 8 KB long.How many pages are there in the virtualaddress space?How many page frames are there in memory?How many bits are needed for the VPN fieldof an address?How many bits are needed for the page framebit of an address?How many bits are needed for the page offsetfield?Copyright 2013, D.A. Menasce. All rights reserved.Virtual Memory and Virtual MachinesVirtual Memory Example68

A computer system has 4 GB of mainmemory and a virtual address space equal to64 GB. Each page is 8 KB long.How many pages are there in the virtualaddress space?64 * 2 30 / (8 * 2 10) 8 * 2 20 2 23 pagesCopyright 2013, D.A. Menasce. All rights reserved.Virtual Memory and Virtual MachinesVirtual Memory Example69

A computer system has 4 GB of mainmemory and a virtual address space equal to64 GB. Each page is 8 KB long.How many page frames are there in memory?Virtual Memory and Virtual MachinesVirtual Memory Example4 * 2 30 / (8 * 2 10) 4 * 2 17 2 15 pageframesCopyright 2013, D.A. Menasce All rights reserved.70

A computer system has 4 GB of mainmemory and a virtual address space equal to64 GB. Each page is 8 KB long.How many bits are needed for the VPN fieldof an address?Virtual Memory and Virtual MachinesVirtual Memory Example2 23 pages 23 bitsCopyright 2013, D.A. Menasce. All rights reserved.71

A computer system has 4 GB of mainmemory and a virtual address space equal to64 GB. Each page is 8 KB long.How many bits are needed for the page framebit of an address?Virtual Memory and Virtual MachinesVirtual Memory Example2 15 page frames 15 bitsCopyright 2013, D.A. Menasce. All rights reserved.72

A computer system has 4 GB of mainmemory and a virtual address space equal to64 GB. Each page is 8 KB long.How many bits are needed for the page offsetfield?Virtual Memory and Virtual MachinesVirtual Memory ExampleIf a word is 8 bytes long, then each page has2 3 * 2 10 / 2 3 2 10 words 10 bits are needed for the offsetCopyright 2013, D.A. Menasce. All rights reserved.73

Protection via virtual memory Keeps processes in their own memory spaceRole of architecture: Provide user mode and supervisor modeProtect certain aspects of CPU stateProvide mechanisms for switching between usermode and supervisor modeProvide mechanisms to limit memory accessesProvide TLB to translate addressesCopyright 2012, Elsevier Inc. All rights reserved.Virtual Memory and Virtual MachinesVirtual Memory74

Supports isolation and securitySharing a computer among many unrelated usersEnabled by raw speed of processors, making theoverhead more acceptableAllows different ISAs and operating systems to bepresented to user programs Virtual Memory and Virtual MachinesVirtual Machines“System Virtual Machines”SVM software is called “virtual machine monitor” or“hypervisor”Individual virtual machines run under the monitor are called“guest VMs”Copyright 2012, Elsevier Inc. All rights reserved.75

Each guest OS maintains its own set of pagetables VMM adds a level of memory between physicaland virtual memory called “real memory”VMM maintains shadow page table that mapsguest virtual addresses to physical addresses Requires VMM to detect guest’s changes to its own pagetableOccurs naturally if accessing the page table pointer is aprivileged operationCopyright 2012, Elsevier Inc. All rights reserved.Virtual Memory and Virtual MachinesImpact of VMs on Virtual Memory76

Memory Hierarchy Design Memory hierarchy design becomes more crucial with recent multi-core processors: Aggregate peak bandwidth grows with # cores: Intel Core i7 can generate two references per core per clock Four cores and 3