Trinayan Baruah scite author profile

The rapidly growing popularity and scale of dataparallel workloads demand a corresponding increase in raw computational power of GPUs (Graphics Processing Units). As single-GPU systems struggle to satisfy the performance demands, multi-GPU systems have begun to dominate the high-performance computing world. The advent of such systems raises a number of design challenges, including the GPU microarchitecture, multi-GPU interconnect fabrics, runtime libraries and associated programming models. The research community currently lacks a publically available and comprehensive multi-GPU simulation framework and benchmark suite to evaluate multi-GPU system design solutions.In this work, we present MGSim, a cycle-accurate, extensively validated, multi-GPU simulator, based on AMD's Graphics Core Next 3 (GCN3) instruction set architecture. We complement MGSim with MGMark, a suite of multi-GPU workloads that explores multi-GPU collaborative execution patterns. Our simulator is scalable and comes with in-built support for multithreaded execution to enable fast and efficient simulations. In terms of performance accuracy, MGSim differs 5.5% on avarage when compared against actual GPU hardware. We also achieve a 3.5× and a 2.5× average speedup in function emulation and architectural simulation with 4 CPU cores, while delivering the same accuracy as the serial simulation.We illustrate the novel simulation capabilities provided by our simulator through a case study exploring programming models based on a unified multi-GPU system (U-MGPU) and a discrete multi-GPU system (D-MGPU) that both utilize unified memory space and cross-GPU memory access. We evaluate the design implications from our case study, suggesting that D-MGPU is an attractive programming model for future multi-GPU systems.

show abstract

HALCONE : A Hardware-Level Timestamp-based Cache Coherence Scheme for Multi-GPU systems

Mojumder¹,

Delshadtehrani²,

Ma³

et al. 2020

Preprint

View full text Add to dashboard Cite

While multi-GPU (MGPU) systems are extremely popular for compute-intensive workloads, several inefficiencies in the memory hierarchy and data movement result in a waste of GPU resources and difficulties in programming MGPU systems. First, due to the lack of hardware-level coherence, the MGPU programming model requires the programmer to replicate and repeatedly transfer data between the GPUsâ Ȃ Ź memory. This leads to inefficient use of precious GPU memory. Second, to maintain coherency across an MGPU system, transferring data using low-bandwidth and high-latency off-chip links leads to degradation in system performance. Third, since the programmer needs to manually maintain data coherence, the programming of an MGPU system to maximize its throughput is extremely challenging. To address the above issues, we propose a novel lightweight timestampbased coherence protocol, HALCONE , for MGPU systems and modify the memory hierarchy of the GPUs to support physically shared memory. HALCONE replaces the Compute Unit (CU) level logical time counters with cache level logical time counters to reduce coherence traffic. Furthermore, HALCONE introduces a novel timestamp storage unit (TSU) with no additional performance overhead in the main memory to perform coherence actions. Our proposed HAL-CONE protocol maintains the data coherence in the memory hierarchy of the MGPU with minimal performance overhead (less than 1%). Using a set of standard MGPU benchmarks, we observe that a 4-GPU MGPU system with shared memory and HALCONE performs, on average, 4.6× and 3× better than a 4-GPU MGPU system with existing RDMA and with the recently proposed HMG coherence protocol, respectively. We demonstrate the scalability of HALCONE using different GPU counts (2, 4, 8, and 16) and different CU counts (32, 48, and 64 CUs per GPU) for 11 standard benchmarks. Broadly, HALCONE scales well with both GPU count and CU count. Furthermore, we stress test our HALCONE protocol using a custom synthetic benchmark suite to evaluate its impact on the overall performance. When running our synthetic benchmark suite, the HALCONE protocol slows down the execution time by only 16.8% in the worst case.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Trinayan Baruah

Griffin: Hardware-Software Support for Efficient Page Migration in Multi-GPU Systems

GNNMark: A Benchmark Suite to Characterize Graph Neural Network Training on GPUs

Evaluating Performance Tradeoffs on the Radeon Open Compute Platform

Characterizing the Microarchitectural Implications of a Convolutional Neural Network (CNN) Execution on GPUs

MGPUSim

Energy efficient execution of heterogeneous applications

MGSim + MGMark: A Framework for Multi-GPU System Research

HALCONE : A Hardware-Level Timestamp-based Cache Coherence Scheme for Multi-GPU systems

Contact Info

Product

Resources

About