2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) 2018
DOI: 10.1109/hpca.2018.00027
|View full text |Cite
|
Sign up to set email alerts
|

Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
32
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 33 publications
(33 citation statements)
references
References 36 publications
1
32
0
Order By: Relevance
“…We first consider when a kernel is in the stalled state, and p(k, M std ) is the probability that a kernel is stalled. In this case, the interconnection of a GPU memory system is busy in fetching or storing data, and the memory pipeline may be stalled by cache-miss-related resource saturation [6,23]. This will impose the memory operations of all other delayed corunning kernels.…”
Section: Slowdown Caused By Conflictsmentioning
confidence: 99%
See 2 more Smart Citations
“…We first consider when a kernel is in the stalled state, and p(k, M std ) is the probability that a kernel is stalled. In this case, the interconnection of a GPU memory system is busy in fetching or storing data, and the memory pipeline may be stalled by cache-miss-related resource saturation [6,23]. This will impose the memory operations of all other delayed corunning kernels.…”
Section: Slowdown Caused By Conflictsmentioning
confidence: 99%
“…Recently, contentions from the GPU memory system drew extensive attention. Work [6] shows performance improvements by reducing memory pipeline stalls. They balance the memory accesses of concurrent kernels and limit the number of in-flight memory instructions issued by each kernel.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…the assigned TLP resources per SM; hence decreasing the assigned TLP resources leads to a severe per-SM performance drop. In addition, co-executing two kernels on the same SM unavoidably leads to intra-SM contention in various resources including the L1 cache and/or the load/store units [13]. Intra-SM contention may slowdown one kernel or in some cases both kernels.…”
Section: Why Existing Solutions Failmentioning
confidence: 99%
“…As mentioned in the introduction, simultaneous multikernel (SMK) execution [9,10] is another approach to improve resource utilization in a fine-grained way within an SM. Although previous work showed that SMK works well for mixes of applications with different execution characteristics [9,10,13], Hongwen et al [31,32] more recently pointed out that even under a state-of-art intra-SM sharing scheme, performance still suffers due to interference among concurrent applications. Here we implement SMK following [10] which improves performance by dynamically partitioning SM resources.…”
Section: Comparison Against Smkmentioning
confidence: 99%