Po-An Tsai scite author profile

Cache partitioning is now available in commercial hardware. In theory, software can leverage cache partitioning to use the last-level cache better and improve performance. In practice, however, current systems implement way-partitioning, which offers a limited number of partitions and often hurts performance. These limitations squander the performance potential of smart cache management.We present KPart, a hybrid cache partitioning-sharing technique that sidesteps the limitations of way-partitioning and unlocks significant performance on current systems. KPart first groups applications into clusters, then partitions the cache among these clusters. To build clusters, KPart relies on a novel technique to estimate the performance loss an application suffers when sharing a partition. KPart automatically chooses the number of clusters, balancing the isolation benefits of waypartitioning with its potential performance impact. KPart uses detailed profiling information to make these decisions. This information can be gathered either offline, or online at low overhead using a novel profiling mechanism.We evaluate KPart in a real system and in simulation. KPart improves throughput by 24% on average (up to 79%) on an Intel Broadwell-D system, whereas prior per-application partitioning policies improve throughput by just 1.7% on average and hurt 30% of workloads. Simulation results show that KPart achieves most of the performance of more advanced partitioning techniques that are not yet available in hardware.

show abstract

Scaling distributed cache hierarchies through computation and data co-scheduling

Beckmann

Tsai

Sánchez

2015

View full text Add to dashboard Cite

Abstract-Cache hierarchies are increasingly non-uniform, so for systems to scale efficiently, data must be close to the threads that use it. Moreover, cache capacity is limited and contended among threads, introducing complex capacity/latency tradeoffs. Prior NUCA schemes have focused on managing data to reduce access latency, but have ignored thread placement; and applying prior NUMA thread placement schemes to NUCA is inefficient, as capacity, not bandwidth, is the main constraint.We present CDCS, a technique to jointly place threads and data in multicores with distributed shared caches. We develop novel monitoring hardware that enables fine-grained space allocation on large caches, and data movement support to allow frequent full-chip reconfigurations. On a 64-core system, CDCS outperforms an S-NUCA LLC by 46% on average (up to 76%) in weighted speedup and saves 36% of system energy. CDCS also outperforms state-of-the-art NUCA schemes under different thread scheduling policies.

show abstract

Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies

Tsai

Chen

Sánchez

2018

View full text Add to dashboard Cite

Conventional multicores rely on deep cache hierarchies to reduce data movement. Recent advances in die stacking have enabled near-data processing (NDP) systems that reduce data movement by placing cores close to memory. NDP cores enjoy cheaper memory accesses and are more area-constrained, so they use shallow cache hierarchies instead. Since neither shallow nor deep hierarchies work well for all applications, prior work has proposed systems that incorporate both. These asymmetric memory hierarchies can be highly beneficial, but they require scheduling computation to the right hierarchy. We present AMS, an adaptive scheduler that automatically finds high-quality thread-to-hierarchy mappings. AMS monitors threads, accurately models their performance under different hierarchies and core types, and adapts algorithms first proposed for cache partitioning to produce high-quality schedules. AMS is cheap enough to use online, so it adapts to program phases, and performs within 1% of an exhaustive-search scheduler. As a result, AMS outperforms asymmetry-oblivious schedulers by up to 37% and by 18% on average.

show abstract

Mind mappings: enabling efficient algorithm-accelerator mapping space search

Hegde

Tsai

Huang

et al. 2021

View full text Add to dashboard Cite

Modern day computing increasingly relies on specialization to satiate growing performance and efficiency requirements. A core challenge in designing such specialized hardware architectures is how to perform mapping space search, i.e., search for an optimal mapping from algorithm to hardware. Prior work shows that choosing an inefficient mapping can lead to multiplicative-factor efficiency overheads. Additionally, the search space is not only large but also non-convex and non-smooth, precluding advanced search techniques. As a result, previous works are forced to implement mapping space search using expert choices or sub-optimal search heuristics.This work proposes Mind Mappings, a novel gradient-based search method for algorithm-accelerator mapping space search. The key idea is to derive a smooth, differentiable approximation to the otherwise non-smooth, non-convex search space. With a smooth, differentiable approximation, we can leverage efficient gradient-based search algorithms to find high-quality mappings. We extensively compare Mind Mappings to black-box optimization schemes used in prior work. When tasked to find mappings for two important workloads (CNN and MTTKRP), the proposed search finds mappings that achieve an average 1.40×, 1.76×, and 1.29× (when run for a fixed number of steps) and 3.16×, 4.19×, and 2.90× (when run for a fixed amount of time) better energy-delay product (EDP) relative to Simulated Annealing, Genetic Algorithms and Reinforcement Learning, respectively. Meanwhile, Mind Mappings returns mappings with only 5.32× higher EDP than a possibly unachievable theoretical lower-bound, indicating proximity to the global optima. CCS CONCEPTS• Computer systems organization → Special purpose systems; • Software and its engineering → Compilers.

show abstract

Rethinking the Memory Hierarchy for Modern Languages

Tsai

Gan

Sánchez

2018

View full text Add to dashboard Cite

We present Hotpads, a new memory hierarchy designed from the ground up for modern, memory-safe languages like Java, Go, and Rust. Memory-safe languages hide the memory layout from the programmer. This prevents memory corruption bugs and enables automatic memory management. Hotpads extends the same insight to the memory hierarchy: it hides the memory layout from software and takes control over it, dispensing with the conventional flat address space abstraction. This avoids the need for associative caches. Instead, Hotpads moves objects across a hierarchy of directly addressed memories. It rewrites pointers to avoid most associative lookups, provides hardware support for memory allocation, and unifies hierarchical garbage collection and data placement. As a result, Hotpads improves memory performance and efficiency substantially, and unlocks many new optimizations.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Po-An Tsai

KPart: A Hybrid Cache Partitioning-Sharing Technique for Commodity Multicores

Scaling distributed cache hierarchies through computation and data co-scheduling

Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies

Mind mappings: enabling efficient algorithm-accelerator mapping space search

Rethinking the Memory Hierarchy for Modern Languages

Contact Info

Product

Resources

About