Low Overhead Instruction Latency Characterization for NVIDIA GPGPUs

Arafa, Yehia; Badawy, Abdel-Hameed A.; Chennupati, Gopinath; Santhi, Nandakishore; Eidenbenz, Stephan

doi:10.1109/hpec.2019.8916466

Cited by 15 publications

(10 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The layer-wise compression overhead of compression algorithms is non-negligible. There are some fixed overheads to launch and execute kernels in CUDA (Arafa et al, 2019) and we observe that the encoding and decoding overheads remain quite stable across a wide range of tensor sizes. For many algorithms, the compression overhead increases by less than 50% from the tensor size of 2 6 to 2 20 elements.…”

Section: An Opportunity and A Challengementioning

confidence: 84%

MergeComp: A Compression Scheduler for Scalable Communication-Efficient Distributed Training

Wang,

Wu,

2021

Preprint

View full text Add to dashboard Cite

Large-scale distributed training is increasingly becoming communication bound. Many gradient compression algorithms have been proposed to reduce the communication overhead and improve scalability. However, it has been observed that in some cases gradient compression may even harm the performance of distributed training.In this paper, we propose MergeComp, a compression scheduler to optimize the scalability of communication-efficient distributed training. It automatically schedules the compression operations to optimize the performance of compression algorithms without the knowledge of model architectures or system parameters. We have applied MergeComp to nine popular compression algorithms. Our evaluations show that MergeComp can improve the performance of compression algorithms by up to 3.83× without losing accuracy. It can even achieve a scaling factor of distributed training up to 99% over high-speed networks.

show abstract

Section: An Opportunity and A Challengementioning

confidence: 84%

MergeComp: A Compression Scheduler for Scalable Communication-Efficient Distributed Training

Wang,

Wu,

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…We model the router latency and energy consumption using BookSim2's model [29], and the TSV and on/offchip buses adopt parameters from previous studies [15], [59], [63]. For the ALU, we use the measured results from PTX instructions [8], [9]. For area evaluation, we use design compiler [19] to analyse pre-layout area of the vector ALU and the SIMT core pipeline [31].…”

Section: Discussionmentioning

confidence: 99%

MPU: Towards Bandwidth-abundant SIMT Processor via Near-bank Computing

Xie

Ding

et al. 2021

Preprint

View full text Add to dashboard Cite

With the growing number of data-intensive workloads, GPU, which is the state-of-the-art single-instructionmultiple-thread (SIMT) processor, is hindered by the memory bandwidth wall. To alleviate this bottleneck, previously proposed 3D-stacking near-bank computing accelerators benefit from abundant bank-internal bandwidth by bringing computations closer to the DRAM banks. However, these accelerators are specialized for certain application domains with simple architecture data paths and customized software mapping schemes. For general purpose scenarios, lightweight hardware designs for diverse data paths, architectural supports for the SIMT programming model, and end-to-end software optimizations remain challenging.To address these issues, we propose MPU (Memory-centric Processing Unit), the first SIMT processor based on 3D-stacking near-bank computing architecture. First, to realize diverse data paths with small overheads while leveraging bank-level bandwidth, MPU adopts a hybrid pipeline with the capability of offloading instructions to near-bank compute-logic. Second, we explore two architectural supports for the SIMT programming model, including a near-bank shared memory design and a multiple activated row-buffers enhancement. Third, we present an end-to-end compilation flow for MPU to support CUDA programs. To fully utilize MPU's hybrid pipeline, we develop a backend optimization for the instruction offloading decision. The evaluation results of MPU demonstrate 3.46× speedup and 2.57× energy reduction compared with an NVIDIA Tesla V100 GPU on a set of representative data-intensive workloads.

show abstract

“…This lab provides insights into how to measure latency, throughput, data, and memory dependency stalls at the instruction level. We recommend the references [22], [23] for interested readers who wish to explore further GPU latency/performance at the instruction level by writing microbenchmark codes.…”

Section: B Instruction Latencymentioning

confidence: 99%

Google Colab CAD4U: Hands-On Cloud Laboratories for Digital Design

Canesche

Bragança

Neto

et al. 2021

2021 IEEE International Symposium on Circuits and Systems (ISCAS)

View full text Add to dashboard Cite

Google Colab is a cloud Jupyter notebook widespread used to teach machine learning by writing text explanations and Python codes through the browser. This work introduces new Colab extensions to teach logic circuit design, Verilog language, processor, and GPU architectures. Colab allows us to share reproducible experiments on the Web. The students become motivated to do laboratory assignments without download/configure software packages and dependencies on their computers. Furthermore, almost all universities had to shut down due to the COVID-19 pandemic, forcing us to adapt to virtual learning scenarios. Colab provides portability and accessibility since it can even run on smartphones. The lab assignments include intermediate guided exercises, text explanations, figures, online quizzes, problem sets, and basic hands-on tasks. We develop a simple setup for Icarus Verilog, PyEDA, CUDA, Valgrind, and Gem5 frameworks. This work presents Verilog teaching and computer architecture simulation insights by using Valgrind and Gem5, and GPU computer architecture profiling at the thread and instruction assembly level.

show abstract

Low Overhead Instruction Latency Characterization for NVIDIA GPGPUs

Cited by 15 publications

References 25 publications

MergeComp: A Compression Scheduler for Scalable Communication-Efficient Distributed Training

MergeComp: A Compression Scheduler for Scalable Communication-Efficient Distributed Training

MPU: Towards Bandwidth-abundant SIMT Processor via Near-bank Computing

Google Colab CAD4U: Hands-On Cloud Laboratories for Digital Design

Contact Info

Product

Resources

About