Reshaping cache misses to improve row-buffer locality in multicore systems

Kayıran, Onur; Jog, Adwait; Kandemir, Mahmut; Das, Chita R.

doi:10.1109/pact.2013.6618813

Cited by 16 publications

(3 citation statements)

References 29 publications

(38 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Prior studies on the impact of TLP on throughput-oriented processors, including References [14,18,27], are discussed in Section 4.…”

Section: Related Workmentioning

confidence: 99%

“…The inclusion of multiple levels of caches complicates the relationship between TLP and the overall performance. As studied in prior works [14,18,27], high degrees of TLP cause the cache to suffer from the contention problem, which may lead to performance degradation. For example, Figure 1 shows the impact of cache thrashing with various degrees of TLP.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

GPU Performance vs. Thread-Level Parallelism

Lin

Mantor

Zhou

2018

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…Prior studies on the impact of TLP on throughput-oriented processors, including References [14,18,27], are discussed in Section 4.…”

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

GPU Performance vs. Thread-Level Parallelism

Lin

Mantor

Zhou

2018

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…Chen et al [18] proposed a novel warp scheduling algorithm that flexibly uses the time slice round-robin feature to utilize GPU parallelism. Jog et al [19] and Kayiran et al [20] proposed CTA-aware warp scheduling algorithms to reduce cache and memory contention or improve thread-level parallelism. Rogers et al [4] analyzed how hardware scheduler influences the management to GPU cache and proposed a cache sensitive warp scheduling policy.…”

Section: Related Workmentioning

confidence: 99%

CaLRS: A Critical‐Aware Shared LLC Request Scheduling Algorithm on GPGPU

Meng

Chen

et al. 2015

The Scientific World Journal

View full text Add to dashboard Cite

Ultra high thread-level parallelism in modern GPUs usually introduces numerous memory requests simultaneously. So there are always plenty of memory requests waiting at each bank of the shared LLC (L2 in this paper) and global memory. For global memory, various schedulers have already been developed to adjust the request sequence. But we find few work has ever focused on the service sequence on the shared LLC. We measured that a big number of GPU applications always queue at LLC bank for services, which provide opportunity to optimize the service order on LLC. Through adjusting the GPU memory request service order, we can improve the schedulability of SM. So we proposed a critical-aware shared LLC request scheduling algorithm (CaLRS) in this paper. The priority representative of memory request is critical for CaLRS. We use the number of memory requests that originate from the same warp but have not been serviced when they arrive at the shared LLC bank to represent the criticality of each warp. Experiments show that the proposed scheme can boost the SM schedulability effectively by promoting the scheduling priority of the memory requests with high criticality and improves the performance of GPU indirectly.

show abstract

CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

Yang

Zhou

2015

J. Comput. Sci. Technol.

View full text Add to dashboard Cite

Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both sequential code and parallel loops. In order to leverage such parallel loops, the latest Nvidia Kepler architecture introduces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs.In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these benchmarks do not have a very high loop count or high degrees of TLP. Consequently, the benefits of leveraging such parallel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implemented our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically generates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and efficiently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our proposed CUDA-NP framework further improves the performance by up to 6.69 times and 2.18 times on average.

show abstract

Reshaping cache misses to improve row-buffer locality in multicore systems

Cited by 16 publications

References 29 publications

GPU Performance vs. Thread-Level Parallelism

GPU Performance vs. Thread-Level Parallelism

CaLRS: A Critical‐Aware Shared LLC Request Scheduling Algorithm on GPGPU

CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

Contact Info

Product

Resources

About