Mitosis: Transparently Self-Replicating Page-Tables for Large-Memory Machines

Achermann, Reto; Panwar, Ashish; Bhattacharjee, Abhishek; Roscoe, Timothy; Gandhi, Jayneel

doi:10.1145/3373376.3378468

Cited by 39 publications

(31 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There has been a tremendous amount of work aimed at improving translation range and efficiency (and thereby reducing the number of page walks) [8,16,21,23,25,27,33,33,37,38,[42][43][44][45][46]53]. Other works have focused on reducing the TLB miss penalty by improving the page table walk caches [14,17,18], using speculation to hide latency [5,8,15,47], optimizing hash page tables [52], and replicating page tables across NUMA nodes [3]. For virtualized systems, Gandhi et al proposed merging the 2D page table into a single dimension where possible [26].…”

Section: Related Workmentioning

confidence: 99%

“…Biasing the replacement policy to favor page table entries means evicting more data, but we find that applications with high TLB miss rates also exhibit high data miss rates (L2 and L3 data miss ratios of 95% and 80%). This, combined with the page table access being on the critical path to the data access, suggests that allocating more cache space to the (much smaller) page table over the data itself is likely to be more beneficial than caching the data 3 .…”

Section: Cache Prioritizationmentioning

confidence: 99%

See 1 more Smart Citation

Every walk’s a hit: making page walks single-access cache hits

Park

Vougioukas

Sandberg

et al. 2022

Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

View full text Add to dashboard Cite

As memory capacity has outstripped TLB coverage, large data applications suffer from frequent page table walks. We investigate two complementary techniques for addressing this cost: reducing the number of accesses required and reducing the latency of each access. The first approach is accomplished by opportunistically "flattening" the page table: merging two levels of traditional 4 KB page table nodes into a single 2 MB node, thereby reducing the table's depth and the number of indirections required to traverse it. The second is accomplished by biasing the cache replacement algorithm to keep page table entries during periods of high TLB miss rates, as these periods also see high data miss rates and are therefore more likely to benefit from having the smaller page table in the cache than to suffer from increased data cache misses.We evaluate these approaches for both native and virtualized systems and across a range of realistic memory fragmentation scenarios, describe the limited changes needed in our kernel implementation and hardware design, identify and address challenges related to self-referencing page tables and kernel memory allocation, and compare results across server and mobile systems using both academic and industrial simulators for robustness.We find that flattening does reduce the number of accesses required on a page walk (to 1.0), but its performance impact (+2.3%) is small due to Page Walker Caches (already 1.5 accesses). Prioritizing caching has a larger effect (+6.8%), and the combination improves performance by +9.2%. Flattening is more effective on virtualized systems (4.4 to 2.8 accesses, +7.1% performance), due to 2D page walks. By combining the two techniques we demonstrate a state-ofthe-art +14.0% performance gain and -8.7% dynamic cache energy and -4.7% dynamic DRAM energy for virtualized execution with very simple hardware and software changes. CCS CONCEPTS• Computer systems organization → Architectures; • Software and its engineering → Virtual memory.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Cache Prioritizationmentioning

confidence: 99%

Every walk’s a hit: making page walks single-access cache hits

Park

Vougioukas

Sandberg

et al. 2022

Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

View full text Add to dashboard Cite

show abstract

“…Migration and replication of data pages and page-tables are commonly used to ameliorate the performance impact of NUMA [2,23,87,94] eects, but policies depend critically on access frequency metadata. When a single access bit is read periodically to determine the hotness of an entire 2MB region, pages can easily appear articially hot.…”

Section: Motivationmentioning

confidence: 99%

“…We periodically pause and capture memory metadata at points in the execution that are the same in both execution cases. 2 For GPUs, capturing metadata and establishing correspondence is considerably simpler, as the prototype runs in a simulator: a single trace of memory references and instruction counts is captured and post-processed to produce a set of epochs and snapshots of per-page metadata.…”

Section: Metadata Fidelitymentioning

confidence: 99%

“…Additionally, TLB miss overheads can be reduced by accelerating page table walks [12,14,18] or reducing their frequency [33]; by reducing the number of TLB misses (e.g. through prefetching [19,21,50,81], prediction [7,72], or structural change to the TLB [74,75,93], TLB hierarchy [4,5,15,20,34,52,61,89]) or the page table structure [2,84,85].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Prism

Ausavarungnirun

Merrifield

Gandhi

et al. 2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

Self Cite

View full text Add to dashboard Cite

Modern architectures track memory accesses using page granularity metadata such as access and dirty bits, leading to fundamental tradeos for system software that uses this metadata. Larger page sizes reduce address translation overheads and page table footprints. However, coarse metadata bits for larger pages limit software's visibility into application-level memory usage, resulting in memory bloat and performance pathologies. As DRAM capacity continues to expand, we expect software to react by aggressively mapping with larger page sizes, making this tradeo space more challenging to navigate. We study the relationship between metadata granularity and delity, the degree to which metadata correctly approximates actual access patterns. We focus on 2MB page support on x86-64 and GPUs, measuring delity across a wide range of benchmarks. Fidelity can be poor at a coarse granularity, and high variance occurs even within applications. To address this problem, we propose P, which provides architectural support for variable-granularity access and dirty bits. Evaluation of Linux/x86-64 and GPU prototypes of P show modest additional hardware can reduce metadata delity loss by up to 65% and 55% at a performance cost of less than 1% and 2% on CPUs and GPUs respectively. We show that the recovered delity can eliminate performance pathologies and improve the performance of GPGPU applications using demand paging by 29.8% on average. CCS CONCEPTS • Software and its engineering ! Virtual memory; • Computer systems organization ! Parallel architectures.

show abstract

Pinning Page Structure Entries to Last-Level Cache for Fast Address Translation

2022

View full text Add to dashboard Cite

As the memory footprint of emerging applications keeps increasing, the address translation becomes a critical performance bottleneck due to frequent misses on TLB. In addition, the TLB miss penalty becomes more critical in modern computer systems because the levels of the hierarchical page table (a.k.a. radix page table) are increasing to extend address space. In order to reduce the TLB misses, modern highperformance processors employ a multi-level TLB structure that uses a large last-level TLB. Employing a large last-level TLB might reduce the TLB misses. However, its capacity is still limited, and it can incur chip area overheads. In this paper, we propose a PSE Pinning mechanism that provides a large PSE (Page Structure Entry) store by dedicating some space of the last-level cache for only storing the page structure entries. PSE Pinning is based on three key observations. First, memory-intensive applications suffer from frequent misses on the last-level cache. Thus, most space of the last-level cache is not well utilized. Second, most PSEs are fetched from the main memory during the page table walk process, meaning the cache lines for the PSEs are frequently evicted from the on-chip caches. Lastly, a small number of PSEs are frequently accessed while others are not. By exploiting these three observations, PSE Pinning pins the frequently accessed page structure entries to the last-level caches so that they can reside on the cache. Experimental results show that PSE Pinning improves the performance of memory-intensive workloads suffering from frequent L2 TLB misses by 7.8% on average.

show abstract

Mitosis: Transparently Self-Replicating Page-Tables for Large-Memory Machines

Cited by 39 publications

References 51 publications

Every walk’s a hit: making page walks single-access cache hits

Every walk’s a hit: making page walks single-access cache hits

Prism

Pinning Page Structure Entries to Last-Level Cache for Fast Address Translation

Contact Info

Product

Resources

About