Neighborhood-Aware Address Translation for Irregular GPU Applications

Shin, Seunghee; LeBeane, Michael; Solihin, Yan; Basu, Abhik

doi:10.1109/micro.2018.00036

Cited by 23 publications

(18 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Sensitivity of Valkyrie to L1-TLB size: While 128-entry L1-TLBs have been shown to provide the best performance [37], 64entry L1-TLBs for GPUs are also quite common [9,41,42]. In a spirit similar to how we propose to split the L1-TLB area into a smaller L1-TLB and a prefetch buer, we evaluated a design with 52entry L1-TLB and a 12-entry prefetch buer (we scaled down both the L1-TLB size and prefetch buer size).…”

Section: Evaluation Resultsmentioning

confidence: 99%

“…To further improve the performance of Valkyrie, we plan to integrate it with other approaches that improve the page-walk performance [41,42]. We also plan to study eective communication mechanisms and hierarchical network designs such as fat-trees [32] to support more ecient inter-L1 TLB communication schemes.…”

Section: Resultsmentioning

confidence: 99%

“…Valkyrie accelerates address translation at the L1-TLB level by exploiting the TLB sharing behavior in GPU applications. Our scheme complements the approaches proposed in [9,22,41,42] and can be integrated with theirs to further improve performance.…”

Section: Related Workmentioning

confidence: 86%

“…The IOMMU has a multithreaded page table walker which can perform 8 searches in the page table in parallel. All of our experiments are run with a 4KB page size, which is the common page size used in prior studies on address translation hardware design on GPUs [9,41,42]. While larger pages (e.g., 2MB) have the potential of reducing L1-TLB misses, they have large page migration latencies [8,11,19] and can also increase the average number of stalled wavefronts on TLB misses to 100% [8,9] and hence are not always optimal to use.…”

Section: Evaluation Methodologymentioning

confidence: 99%

“…Shin et al [41,42] proposed architectural changes to the IOMMU scheduler so that page-table walk requests are served more eciently. Valkyrie accelerates address translation at the L1-TLB level by exploiting the TLB sharing behavior in GPU applications.…”

Section: Related Workmentioning

confidence: 99%

See 4 more Smart Citations

Valkyrie

Baruah

Mojumder

Abellán

et al. 2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

Programming on a GPU has been made considerably easier with the introduction of Virtual Memory features, which support common pointer-based semantics between the CPU and the GPU. However, supporting virtual memory on a GPU comes with some additional costs and overhead, with the largest being from the support for address translation. The fact that a massive number of threads run concurrently on a GPU means that the translation lookaside buers (TLBs) are oversubscribed most of the time. Our investigation into a diverse set of GPU workloads shows that TLB misses can be extremely high (up to 99%), which inevitably leads to signicant performance degradation due to long-latency page-table walks. Our proling of TLB-sensitive workloads reveals a high degree of page sharing across the dierent cores of a GPU. In many applications, a page can be accessed in temporal proximity by multiple cores, following similar memory access patterns. To support the inherent sharing present in GPU workloads, we propose Valkyrie, an integrated cooperative TLB prefetching mechanism and an inter L1-TLB probing scheme that can eciently reduce TLB bottlenecks in GPUs. Our evaluation using a diverse set of GPU workloads reveals that Valkyrie is able to achieve an average speedup of 1.95⇥, while adding modest hardware overhead. CCS CONCEPTS • Computing methodologies ! Graphics processors; • Software and its engineering ! Virtual memory.

show abstract

Section: Evaluation Resultsmentioning

confidence: 99%