2019 IEEE High Performance Extreme Computing Conference (HPEC) 2019
DOI: 10.1109/hpec.2019.8916466
|View full text |Cite
|
Sign up to set email alerts
|

Low Overhead Instruction Latency Characterization for NVIDIA GPGPUs

Abstract: The last decade has seen a shift in the computer systems industry where heterogeneous computing has become prevalent. Graphics Processing Units (GPUs) are now present in supercomputers to mobile phones and tablets. GPUs are used for graphics operations as well as general-purpose computing (GPGPUs) to boost the performance of compute-intensive applications. However, the percentage of undisclosed characteristics beyond what vendors provide is not small. In this paper, we introduce a very low overhead and portabl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
10
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
6
3

Relationship

1
8

Authors

Journals

citations
Cited by 15 publications
(10 citation statements)
references
References 25 publications
0
10
0
Order By: Relevance
“…The layer-wise compression overhead of compression algorithms is non-negligible. There are some fixed overheads to launch and execute kernels in CUDA (Arafa et al, 2019) and we observe that the encoding and decoding overheads remain quite stable across a wide range of tensor sizes. For many algorithms, the compression overhead increases by less than 50% from the tensor size of 2 6 to 2 20 elements.…”
Section: An Opportunity and A Challengementioning
confidence: 84%
“…The layer-wise compression overhead of compression algorithms is non-negligible. There are some fixed overheads to launch and execute kernels in CUDA (Arafa et al, 2019) and we observe that the encoding and decoding overheads remain quite stable across a wide range of tensor sizes. For many algorithms, the compression overhead increases by less than 50% from the tensor size of 2 6 to 2 20 elements.…”
Section: An Opportunity and A Challengementioning
confidence: 84%
“…We model the router latency and energy consumption using BookSim2's model [29], and the TSV and on/offchip buses adopt parameters from previous studies [15], [59], [63]. For the ALU, we use the measured results from PTX instructions [8], [9]. For area evaluation, we use design compiler [19] to analyse pre-layout area of the vector ALU and the SIMT core pipeline [31].…”
Section: Discussionmentioning
confidence: 99%
“…This lab provides insights into how to measure latency, throughput, data, and memory dependency stalls at the instruction level. We recommend the references [22], [23] for interested readers who wish to explore further GPU latency/performance at the instruction level by writing microbenchmark codes.…”
Section: B Instruction Latencymentioning
confidence: 99%