REVISITING VIRTUAL MEMORY - University Of Wisconsin-Madison

Transcription

REVISITING VIRTUAL MEMORYByArkaprava BasuA dissertation submitted in partial fulfillment of the requirements for the degree ofDoctor of Philosophy(Computer Sciences)at theUNIVERSITY OF WISCONSIN-MADISON2013Date of final oral examination: 2nd December 2013.The dissertation is approved by the following members of the Final Oral Committee:Prof. Remzi H. Arpaci-Dusseau, Professor, Computer SciencesProf. Mark D. Hill (Advisor), Professor, Computer SciencesProf. Mikko H. Lipasti, Professor, Electrical and Computer EngineeringProf. Michael M. Swift (Advisor), Associate Professor, Computer SciencesProf. David A. Wood, Professor, Computer Sciences

Copyright by Arkaprava Basu 2013All Rights Reserved

Dedicated to my parents Susmita and Paritosh Basufor their selfless and unconditional love and support.

iAbstractPage-based virtual memory (paging) is a crucial piece of memory management in today’scomputing systems. However, I find that need, purpose and design constraints of virtual memoryhave changed dramatically since translation lookaside buffers (TLBs) were introduced to cacherecently-used address translations: (a) physical memory sizes have grown more than a millionfold, (b) workloads are often sized to avoid swapping information to and from secondary storage,and (c) energy is now a first-order design constraint. Nevertheless, level-one TLBs haveremained the same size and are still accessed on every memory reference. As a result, largeworkloads waste considerable execution time on TLB misses and all workloads spend energy onfrequent TLB accesses.In this thesis I argue that it is now time to reevaluate virtual memory management. Ireexamine virtual memory subsystem considering the ever-growing latency overhead of addresstranslation and considering energy dissipation, developing three results.First, I proposed direct segments to reduce the latency overhead of address translation foremerging big-memory workloads. Many big-memory workloads allocate most of their memoryearly in execution and do not benefit from paging. Direct segments enable hardware-OSmechanisms to bypass paging for a part of a process’s virtual address space, eliminating nearly99% of TLB miss for many of these workloads.Second, I proposed opportunistic virtual caching (OVC) to reduce the energy spent ontranslating addresses. Accessing TLBs on each memory reference burns significant energy, andvirtual memory’s page size constrains L1-cache designs to be highly associative -- burning yet

iimore energy. OVC makes hardware-OS modifications to expose energy-efficient virtual cachingas a dynamic optimization. This saves 94-99% of TLB lookup energy and 23% of L1-cachelookup energy across several workloads.Third, large pages are likely to be more appropriate than direct segments to reduce TLBmisses under frequent memory allocations/deallocations. Unfortunately, prevalent chip designslike Intel’s, statically partition TLB resources among multiple page sizes, which can lead toperformance pathologies for using large pages. I proposed the merged-associative TLB to avoidsuch pathologies and reduce TLB miss rate by up to 45% through dynamic aggregation of TLBresources across page sizes.

iiiAcknowledgementsIt is unimaginable for me to come this far to write the acknowledgements for my PhDthesis without the guidance and the support of my wonderful advisors – Prof. Mark Hill and Prof.Mike Swift.I am deeply indebted to Mark not only for his astute technical advice, but also for hissage life-advices. He taught me how to conduct research, how to communicate research ideas toothers and how to ask relevant research questions. He has been a pillar of support for me duringtough times that I had to endure in the course of my graduate studies. It would not have beenpossible for me to earn my PhD without Mark’s patient support. Beyond academics, Mark hasalways been a caring guardian to me for past five years. I fondly remember how Mark took me tomy first football game at the Camp Randall stadium a few weeks before my thesis defense sothat I do not miss out an important part of the Wisconsin Experience. Thanks Mark for being myadvisor!I express my deep gratitude to Mike. I have immense admiration for his deep technicalknowledge across the breadth of computer science. His patience, support and guidance have beeninstrumental in my learning. My interest in OS-hardware coordinated design is in many waysshaped by Mike’s influence. I am indebted to Mike for his diligence in helping me shaperesearch ideas and present them for wider audience. It is hard to imagine for me to do my PhD invirtual memory management without Mike’s help. Thanks Mike for being my advisor!I consider myself lucky to be able to interact with great faculty of this department. Ialways found my discussions with Prof. David Wood to be great learning experiences. I will

ivhave fond memories of interactions with Prof. Remzi Arpaci-Dusseau, Prof. Shan Lu, Prof. KaruSankaralingam, Prof. Guri Sohi.I thank my student co-authors with whom I had opportunity to do research. I learnt a lotfrom my long and deep technical discussion with Jayaram Bobba, Derek Hower and JayneelGandhi. I have greatly benefited from bouncing ideas off them. In particular, I acknowledgeJayneel’s help for a part of my thesis.I would like to thank former and current and students of the department with whom I hadmany interactions, including Mathew Allen, Shoaib Altaf, Newsha Ardalani, RaghuBalasubramanian, Yasuko Eckert, Dan Gibson, Polina Dudnik, Venkataram Govindaraju, GaganGupta, Asim Kadav, Jai Menon, Lena Olson, Sankaralingam Panneerselvam, Jason Power,Somayeh Sardashti, Mohit Saxena, Rathijit Sen, Srinath Sridharan, Nilay Vaish, VenkatanathanVaradarajan, James Wang. They made my time at Wisconsin enjoyable.I would like to extend special thanks to Haris Volos, with whom I shared an office formore than four years. We shared many ups and downs of the graduate student life. Haris helpedme in taking my first steps in hacking Linux kernel during my PhD. I am also thankful to Harisfor gifting his car to me when he graduated and left Madison!I thank AMD Research for my internship that enabled me to learn great deal aboutresearch in the industrial setup. In particular, I want to thank Brad Beckmann and SteveReinhardt for making the internship both an enjoyable and learning experience. I would like tothank Wisconsin Computer Architecture Affiliates for their feedbacks and suggestions on myresearch works. I want to extend special thanks to Jichuan Chang with whom I had opportunityto collaborate for a part of my thesis work. Jichuan has also been great mentor to me.

vI want to thank my personal friends Rahul Chatterjee, Moitree Laskar, Uttam Manna,Tumpa MannaJana, Anamitra RayChoudhury, Subarna Tripathi for their support during mygraduate studies.This work was supported in part by the US National Science Foundation (CNS-0720565,CNS-0834473, CNS-0916725, CNS-1117280, CNS-1117280, CCF-1218323, and CNS1302260), Sandia/DOE (#MSN 123960/DOE890426), and donations from AMD and Google.And finally, I want to thank my dear parents Paritosh and Susmita Basu – I cannotimagine a life without their selfless love and support.

viTable of ContentsChapter 1Introduction. 1Chapter 2Virtual Memory Basics. 122.1Before Memory Was Virtual . 122.2Inception of Virtual Memory . 132.3Virtual Memory Usage . 152.4Virtual Memory Internals . 162.4.1Paging . 172.4.2Segmentation. 292.4.3Virtual Memory for other ISAs. 322.5In this Thesis . 34Chapter 3Reducing Address Translation Latency . 363.1Introduction . 363.2Big Memory Workload Analysis . 393.2.1Actual Use of Virtual Memory . 413.2.2Cost of Virtual Memory . 453.2.3Application Execution Environment. 483.3Efficient Virtual Memory Design . 493.3.1Hardware Support: Direct Segment . 503.3.2Software Support: Primary Region . 533.4Software Prototype Implementation . 583.4.1Architecture-Independent Implementation . 583.4.2Architecture-Dependent Implementation. 603.5Evaluation . 623.5.1Methodology . 623.5.2Results . 663.6Discussion . 693.7Limitations . 753.8Related Work . 76Chapter 4Reducing Address Translation Energy . 804.1Introduction . 80

vii4.2Motivation: Physical Caching Vs. Virtual Caching . 844.2.1Physically Addressed Caches . 844.2.2Virtually Addressed Caches . 874.3Analysis: Opportunity for Virtual Caching. 894.3.1Synonym Usage . 894.3.2Page Mapping and Protection Changes . 914.4Opportunistic Virtual Caching: Design and Implementation . 924.4.1OVC Hardware . 934.4.2OVC Software. 984.5Evaluation . 1014.5.1Baseline Architecture . 1014.5.2Methodology and Workloads . 1024.5.3Results . 1034.6OVC and Direct Segments: Putting it Together . 1074.7Related Work . 109Chapter 5TLB Resource Aggregation . 1135.1Introduction . 1135.2Problem Description and Analysis. 1215.2.1Recap: Large pages in x86-64. 1215.2.2TLB designs for multiple page sizes . 1225.2.3Problem Statement . 1275.3Design and Implementation . 1285.3.1Hardware: merged-associative TLB . 1285.3.2Software . 1335.4Dynamic page size promotion and demotion. 1365.5Evaluation . 1385.5.1Baseline . 1385.5.2Workloads . 1385.5.3Methodology . 1395.6Results . 1395.6.1Enhancing TLB Reach . 1405.6.2TLB Performance Unpredictability with Large Pages. 1415.6.3Performance benefits of merged TLB. 1425.7Related Work . 144

viii5.8Conclusion . 148Chapter 6Summary, Future Work, and Lessons Learned . 1496.1Summary . 1496.2Future Research Directions . 1516.2.1Virtual Machines and IOMMU . 1526.2.2Non-Volatile Memory . 1536.2.3Heterogeneous Computing. 1546.3Lessons Learned. 155Bibliography .159Appendix: Raw Data Numbers 163

11Chapter 1Introduction“Virtual memory was invented at the time of scarcity. Is it still a good idea?”--Charles Thacker, 2010 ACM Turing award lecture.Page-based virtual memory (paging) is a crucial piece of memory management in today’scomputing systems. software accesses memory using a virtual address that must be translated toa physical address before the memory access can be completed. This virtual-to-physical addresstranslation process goes through the page-based virtual memory subsystem in every currentcommercial general-purpose processor that I am aware of. Thus, efficient address translationmechanism is prerequisite for efficient memory access and thus ultimately for efficientcomputing. Notably though, virtual address translation mechanism’s basic formulation remainslargely unchanged since the late 1960s when translation lookaside buffers (TLB) were

2Figure 1-1. Growth of physical memory.introduced to efficiently cache recently used address translations. However, the purpose, usageand the design constraints of virtual memory have witnessed a sea change in the last decade.In this thesis I argue that it is now time to reevaluate the virtual memory management.There are at least two key motivations behind the need to revisit virtual memory managementtechniques. First, there has been significant change in the needs and the purpose of virtualmemory. For example, the amount of memory that needs address translation is a few orders ofmagnitude larger than a decade ago. Second, there are new key constraints on how one designscomputing systems today. For example today’s systems are most often power limited, unlikethose from a decade ago.Evolved Needs and Purposes: The steady decline in the cost of physical memoryenabled a million-times larger physical memory in today’s systems then during the inception ofthe page-based virtual memory. Figure 1-1 shows the amount of physical memory (DRAM) that

3could be purchased in 10,000 inflation-adjusted US dollar since 1980. One can observe thatphysical memory has become exponentially cheaper over the years. This has enabled installedphysical memory in a system to grow from few megabytes to a few gigabytes and now even to afew terabytes. Indeed, HP’s DL980 server currently ships with up to 4TB of physical memoryand Windows Server 2012 supports 4TB memories, up from 64GB a decade ago. Not only canmodern computer systems have terabytes of physical memory but the emerging big memoryworkloads also need to access terabytes of memory at low latency. In the enterprise space, thesize of the largest data warehouse has been increasing at a cumulative annual growth rate of 173percent — significantly faster than Moore’s law [77]. Thus modern systems need to efficientlytranslate addresses for terabytes of memory. This ever-growing memory sizes stretches currentaddress-translation mechanisms to new limits.Unfortunately, unlike the exponential growth in the installed physical memory capacity,the size of the TLB has hardly scaled over the decades. The TLB plays a critical role to enableefficient address translation by caching recently used address translation entries. A miss in theTLB can take several memory accesses (e.g., up to 4 memory access in x86-64) and may incur100s of cycles to service. Table 1-1 shows the number of L1-DTLB (level 1 data TLB) entriesper core in different Intel processors over the years. The number of TLB entries has grown from72 entries in Intel’s Pentium III (1999) processors to 100 entries in Ivy Bridge processors (2012).L1-DTLB sizes are hard to scale since L1-TLBs are accessed on each memory reference andthus need to abide by strict latency and power budgets. While modern processors have addedsecond level TLBs (L2-TLB) to reduce performance penalty on L1-TLB misses, recent research

4Table 1-1. L1-Data-TLB sizes in Intel processors over the years.Year1999200120082012L1-DTLB72(Pentium III)64(Pentium 4)96(Nehalem)100(Ivy Bridge)entriessuggests that there is still considerable overhead due to misses in L1-TLB that hit in the L2-TLB[59]. Large pages that map larger amounts of memory with a single TLB entry can help reducethe number of TLB misses. However, efficient use of large pages remains challenging [69,87].Furthermore, my experiments show that even with use of large pages, a double-digit percentageof execution cycles can still be wasted in address translation. Further, like any cache design, theTLB needs access locality to be effective. However, many emerging big data workloads likegraph analytics or data streaming applications demonstrate low access locality and thus currentTLB mechanisms may be less suitable for many future workloads [77].In summary, the every-increasing size of memory, growing data footprint of workloads,slow scaling of TLBs and low access locality of emerging workloads leads to an ever-increasingaddress translation overhead of page-based virtual memory. For example, my experiments on anIntel Sandy Bridge machine showed that up to 51% of the execution cycles could be wasted inaddress translation for the graph-analytics workloads graph500 [36].

5Figure 1-2. TLB power contribution to on-chip power budget. Data from AvinashSodani's (Intel) MICRO 2011 Keynote talk.New Design Constraint: Power dissipation is a first-class design constraint today. It washardly the case when the virtual memory subsystems were first designed. The current focus onenergy efficiency motivates reexamining processor design decisions from the previousperformance-first era, including the crucial virtual memory subsystem.Virtual memory’s address translation mechanism, especially the TLB accesses, cancontribute significantly to the power usage of a processor. Figure 1-2 shows the breakdown ofpower dissipation of a core (including caches) as reported by Intel [83]. TLBs can account up to13% of the core’s power budget. My own experiments find that 6.6-13% of on-chip cachehierarchy’s dynamic energy is attributed to TLB accesses. Further, TLBs also show up as ahotspot due to high energy density [75]. The primary reason behind the substantial energy budgetof TLB’s is frequent accesses to TLB. Most, if not all, commercial general-purpose processors

6today access caches using physical addresses. Thus every memory reference needs to complete aTLB access before the memory access is completed. Furthermore, since a TLB is on the criticalpath of every memory access, fast and thus often energy-hungry transistors are used to designTLBs.The energy dissipation is further exacerbated by the designs from performance-first erathat hide TLB lookup latency from the critical path of the memory accesses. They do so byaccessing the TLB in parallel to indexing into a set-associative L1 cache with page offset of thevirtual address. The TLB output is used only during the tag comparison at the L1 cache.However, such a virtually indexed physically tagged cache design requires that the page offset bepart of L1 cache indexing bits – forcing the L1 cache to be more highly associative than requiredfor low cache miss rates. For example, a typical 32KB L1 cache needs to be at least 8-way setassociative to satisfy this constrain with 4KB pages. Each access to a higher-associativitystructure burns more energy and thus ultimately adds to the power consumption.In summary, many aspects of current virtual memory’s address translation mechanismswarrant a fresh cost-benefit analysis considering energy dissipation as a first-class designconstraint.Proposals: In this thesis I aim to reduce the latency and the energy overheads virtualmemory’s address translation primarily through three pieces of work. First, I propose directsegments [8] to reduce TLB miss overheads for big memory workloads. Second, I proposeopportunistic virtual caching [9] to reduce address translation energy. Finally, I also propose amerged-associative TLB, which aims to improve TLB designs for large page sizes by eliminating

7performance unpredictability with use of large pages in commercially prevalent TLB designs. Inthe following, I briefly describe these three works.1. Direct Segments (Chapter 3): I find that emerging big-memory workloads like in-memoryobject-caches, graph analytics, databases and some HPC workloads incur high addresstranslation overheads in conventional page-based virtual memory (paging) and this overheadprimarily stems from TLB misses. For example, on a test machine with 96GB physical memory,graph500 [36] spends 51% of execution cycles servicing TLB misses with 4KB pages and 10%of execution cycles with 2 MB large pages. Future big-memory-workload trends like evergrowing memory footprint and low access locality are likely to worsen this further.My memory-usage analysis of a few representative big memory workloads revealed thatdespite the cost of address translation, many key features of paging, such as swapping, fine-grainpage protection, and external-fragmentation minimization, are not necessary for most of theirmemory usage. For example, databases carefully size their buffer pool according to the installedphysical memory and thus rarely swap it. I find that only a small fraction of memory allocations,like those for memory-mapped files and executable code, benefit from page-based virtualmemory. Unfortunately, current systems enforce page-based virtual memory for all memory,irrespective of its usage, and incur page-based address translation cost for all memory accesses.To address this mismatch between the big-memory workloads’ needs, what the systemssupport, and the high cost of address translation, I propose that processors support two types ofaddress translation for non-overlapping regions of a process’s virtual address space: 1)conventional paging using of TLB, page table walker etc., 2) a new fast translation mechanism

8that uses a simple form of segmentation (without paging) called a direct segment. Direct segmenthardware can map an arbitrarily large contiguous range of virtual addresses having uniformaccess permissions to a contiguous physical address range with a small, fixed hardware: base,limit and offset registers for each core (or context). If a virtual address is between the base andlimit register values then the corresponding physical address is calculated by adding the value ofthe offset register to the virtual address. Since addresses translated using a direct segment needno TLB lookup; no TLB miss is possible. Virtual addresses outside the direct segment’s rangeare mapped using conventional paging through TLBs and are useful for memory allocations thatbenefit from page-based virtual memory. The OS then provides a software abstraction for directsegment to the applications – called a primary region. The primary region captures memoryusage that may not benefit from paging in a contiguous virtual address range, and thus could bemapped using direct segment.My results show that direct segments can often eliminate 99% of TLB misses across mostof the big memory workloads to reduce time wasted on TLB misses to 0.5% of execution cycles.2. Opportunistic Virtual Caching (Chapter 4): I proposed Opportunistic Virtual Caching(OVC) to reduce energy dissipated due to address translation. I find that looking up TLBs oneach memory access can account for 7-13% of the dynamic energy dissipation of whole on-chipmemory hierarchy. Further, the L1 cache energy dissipation is exacerbated by designs that hidesTLB lookup latency from the critical path.OVC addresses this energy wastage by enabling energy-efficient virtual caching as adynamic optimization under software control. The OVC hardware allows some of the memory

9blocks be cached in the L1 cache with virtual addresses (virtual caching) to avoid energy-hungryTLB lookups on L1 cache hits and to lower the associativity of L1 cache lookup. The rest of theblocks can be cached using conventional physical addressing, if needed.The OS, with optional hints from applications, determines which memory regions areconducive to virtual caching and uses virtual caching or conventional physical caching hardwareaccordingly. My analysis shows that many of challenges to efficient virtual caching, likeinconsistencies due to read-write synonyms (different virtual addresses mapping to samephysical address), occur rarely in practice. Thus, the vast majority of memory allocation canmake use of energy-efficient virtual caching, while falling back to physical caching for the rest,as needed.My evaluation shows that OVC can eliminate 94-99% of TLB lookup energy and 23% ofL1 cache dynamic lookup energy.3. Merged-Associative TLB (Chapter 5): While the proposed direct segments can eliminatemost of DTLB misses for big memory workloads that often have fairly predictable memoryusage and allocate most memory early in execution, it may be less suitable when there arefrequent memory allocation/deallocations. In contrast, support for large pages is currently themost widely employed mechanism to reduce TLB misses and could be more flexible underfrequent memory allocation/deallocation by enabling better mitigation of memory fragmentation.In the third piece of work I try to improve large-page support in commercially prevalentchip designs like those from Intel (e.g., Ivy Bridge, Sandy Bridge) that support multiple page

10sizes by providing separate sub-TLBs for each distinct page sizes (called a split-TLB design).Such a static allocation of TLB resources based on page sizes can lead to performancepathologies where use of larger pages can increase TLB miss rates and disallow TLB resourceaggregation. A few competing commercial designs, like AMD’s, instead employ a single fullyassociative TLB, which can hold entries for any page size. However, fully associative designsare often slower and more power-hungry than a set-associative one. Thus, a single set-associativeTLB that can hold translations for any page size is desirable. Unfortunately, such a design ischallenging as the correct index into a set-associative TLB for a given virtual address dependsupon the page size of translation, which is unknown till the TLB lookup itself completes.I proposed a merged-associative TLB to address this challenge by partitioning theabundant virtual address space of a 64-bit system among the page sizes instead of partitioningscarce hardware TLB resources. The OS divides a process’s virtual address space into a fixednumber of non-overlapping regions. Each of these regions contains memory mapped using asingle page size. This allows the hardware to decipher page size by examining a few high-orderbits

REVISITING VIRTUAL MEMORY By Arkaprava Basu A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Sciences) . It is unimaginable for me to come this far to write the acknowledgements for my PhD thesis without the guidance and the support of my wonderful advisors - Prof. Mark .