Adaptive Cooperation of Prefetching and Warp Scheduling on GPUs

Oh, Yunho; Kim, Keun Soo; Yoon, Myung Kuk; Park, Jong Hyun; Park, Yongjun; Annavaram, Murali; Ro, Won Woo

doi:10.1109/tc.2018.2878671

Cited by 9 publications

(3 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, for GPGPU applications with irregular memory references, such as BFS and DG in the ISPASS 2009, the access of threads in a warp can hardly be combined. Such unconsolidated memory access requests often lead to memory divergence, which means some threads in a warp thread have low latency due to cache hits, while others need to endure longer latency due to cache misses [14,16,22], as shown in Fig. 1.…”

Section: Motivationmentioning

confidence: 99%

See 1 more Smart Citation

WSMP: A Warp Scheduling Strategy Based on MFQ and PPF

Fang

Zhao

Cai

et al. 2022

Preprint

View full text Add to dashboard Cite

Normally, threads in a warp do not severely interfere with each other. However, the scheduler must wait until all the threads within complete before scheduling the next warp, resulting in memory divergence. The crux of the problem is scheduling the warp in a more reasonable order. Therefore, we propose a new warp scheduling strategy based on Multi-level Feedback Queue (MFQ) and Perceptron-Based Prefetch Filtering (PPF) called WSMP, where all the warps are sorted beforehand according to the latency tolerance of the warps and pushed into a certain queue in MFQ. We also remold PPF to enhance the modified underlying prefetcher. We are able to strike a balance between cache hit rate and prefetch coverage then. We verify its feasibility by GPGPU-Sim, along with exclusive GPGPU work-load. The results show that compared with baseline, WSMP improves IPC by 27.44% and reduces L2 cache miss rate by 8.53% on average.

show abstract

Section: Motivationmentioning

confidence: 99%

“…For example, the computing efficiency of GPUs may plummet. Such access behaviors cannot reasonably match the design of GPU on-chip storage hierarchy, which holds back the advantages of GPUs' architectures and greatly degrades their performances [14][15][16].…”

Section: Introductionmentioning

confidence: 99%

WSMP: A Warp Scheduling Strategy Based on MFQ and PPF

Fang

Zhao

Cai

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In the literature [8], the authors dynamically choose a LRR or GTO scheduling policy suitable for a task based on the locality of task load. Oh et al [9] proposed the adaptive anticipation and scheduling policy ARPES (Adaptive Prefetching and Scheduling), which divides the warps executing the same memory operation instructions into a group and prioritizes the execution of the group. Rogers et al [10] prioritized the warps based on the degree of data locality within the warp and proposed a cacheaware warp scheduling algorithm CCWS (Cache-Conscious Wavefront Scheduling) which tracks the invalidation of the L1 data cache, adjusts the number of active warp in time, and reduces cache contention to preserve access locality.…”

Section: Introductionmentioning

confidence: 99%

LFWS: Long-Operation First Warp Scheduling Algorithm to Effectively Hide the Latency for GPUs

LIU

Zhao

et al. 2023

IEICE Trans. Fundamentals

View full text Add to dashboard Cite

GPUs have become the dominant computing units to meet the need of high performance in various computational fields. But the long operation latency causes the underutilization of on-chip computing resources, resulting in performance degradation when running parallel tasks on GPUs. A good warp scheduling strategy is an effective solution to hide latency and improve resource utilization. However, most current warp scheduling algorithms on GPUs ignore the ability of long operations to hide latency. In this paper, we propose a long-operation-first warp scheduling algorithm, LFWS, for GPU platforms. The LFWS filters warps in the ready state to a ready queue and updates the queue in time according to changes in the status of the warp. The LFWS divides the warps in the ready queue into long and short operation groups based on the type of operations in their instruction buffers, and it gives higher priority to the long-operating warp in the ready queue. This can effectively use the long operations to hide some of the latency from each other and enhance the system's ability to hide the latency. To verify the effectiveness of the LFWS, we implement the LFWS algorithm on a simulation platform GPGPU-Sim. Experiments are conducted over various CUDA applications to evaluate the performance of LFWS algorithm, compared with other five warp scheduling algorithms. The results show that the LFWS algorithm achieves an average performance improvement of 8.01% and 5.09%, respectively, over three traditional and two novel warp scheduling algorithms, effectively improving computational resource utilization on GPU.

show abstract

File fetching in distributed file system via optimization assisted hybrid deep learning model

Soundharya,

Vadivu

2024

Multimed Tools Appl

View full text Add to dashboard Cite

Adaptive Cooperation of Prefetching and Warp Scheduling on GPUs

Cited by 9 publications

References 23 publications

WSMP: A Warp Scheduling Strategy Based on MFQ and PPF

WSMP: A Warp Scheduling Strategy Based on MFQ and PPF

LFWS: Long-Operation First Warp Scheduling Algorithm to Effectively Hide the Latency for GPUs

File fetching in distributed file system via optimization assisted hybrid deep learning model

Contact Info

Product

Resources

About