An Alternative Memory Access Scheduling in Manycore Accelerators

Kim, Yonggon; Lee, Hyunseok; Kim, John

doi:10.1109/pact.2011.37

Cited by 6 publications

(6 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We compare our new instructions to two related works: cache-conscious wavefront scheduling (CCWS) (Rogers et al 2012) and alternative memory access scheduling which batch requests which map to the same DRAM row (BATCH) (Kim et al 2011;Yuan et al 2009). Though CCWS and BATCH do not affect the dynamic instruction stream of the application, both techniques can reduce memory request interference, similar to our new instructions.…”

Section: Comparison To Related Workmentioning

confidence: 92%

“…Other related DRAM memory scheduling research propose methods to better schedule and prioritize requests from the SMs in order to avoid the effects of memory request interference and better exploit DRAM row buffer locality. One prior work suggests batching an SM's L1 cache miss requests by DRAM row into network packets (Kim et al 2011). Another work focuses on exposing and prioritizing row buffer hits in the on-chip network (Yuan et al 2009).…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Exposing Memory Access Patterns to Improve Instruction and Memory Efficiency in GPUs

Crago

Stephenson

Keckler

2018

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Modern computing workloads often have high memory intensity, requiring high bandwidth access to memory. The memory request patterns of these workloads vary and include regular strided accesses and indirect (pointer-based) accesses. Such applications require a large number of address generation instructions and a high degree of memory-level parallelism. This article proposes new memory instructions that exploit strided and indirect memory request patterns and improve efficiency in GPU architectures. The new instructions reduce address calculation instructions by offloading addressing to dedicated hardware, and reduce destructive memory request interference by grouping related requests together. Our results show that we can eliminate 33% of dynamic instructions across 16 GPU benchmarks. These improvements result in an overall runtime improvement of 26%, an energy reduction of 18%, and a reduction in energy-delay product of 32%. CCS Concepts: • Computer systems organization → Parallel architectures;

show abstract

Section: Comparison To Related Workmentioning

confidence: 92%

Section: Related Workmentioning

confidence: 99%

Exposing Memory Access Patterns to Improve Instruction and Memory Efficiency in GPUs

Crago

Stephenson

Keckler

2018

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…Another issue is the head-of-line problem in the memory request queue that unavoidably arises when memory requests to the same rows are grouped together [10,12]. Modern DRAM chips are usually organized into banks, and memory requests to different banks can be serviced concurrently.…”

Section: Design Issuesmentioning

confidence: 99%

“…Thus, they proposed a NoC arbitration scheme called Hold Grant to preserve the row buffer access locality of memory request streams. In [10], the idea of superpackets is proposed for the shader core to maintain row buffer locality for the memory requests out of the core. While these works focus on maintaining the row buffer locality from a single shader core, our work exploit the coalescing opportunity across the cores inside the NoC.…”

Section: Related Workmentioning

confidence: 99%

Designing Coalescing Network-on-Chip for Efficient Memory Accesses of GPGPUs

Chen

Huang

Chang

et al. 2014

Advanced Information Systems Engineering

View full text Add to dashboard Cite

The massive multithreading architecture of General Purpose Graphic Processors Units (GPGPU) makes them ideal for data parallel computing. However, designing efficient GPGPU chips poses many challenges. One major hurdle is the interface to the external DRAM, particularly the buffers in the memory controllers (MCs), which is stressed heavily by the many concurrent memory accesses from the GPGPU. Previous approaches considered scheduling the memory requests in the memory buffers to reduce switching of memory rows. The problem is that the window of requests that can be considered for scheduling is too narrow and the memory controller is very complex, affecting the critical path. In view of the massive multithreading architecture of GPGPUs that can hide memory access latencies, we exploit in this paper the novel idea of rearranging the memory requests in the network-on-chip (NoC), called packet coalescing. To study the feasibility of this idea, we have designed an expanded NoC router that supports packet coalescing and evaluated its performance extensively. Evaluation results show that this NoC-assisted design strategy can improve the row buffer hit rate in the memory controllers. A comprehensive investigation of factors affecting the performance of coalescing is also conducted and reported.

show abstract

“…• stream-specific or locality-aware arbitration within GPU, as suggested in [15] [10], -this provides marginal benefit since there are multiple arbitration points for different streams and processing elements in the internal interconnection network. Maintaining locality when requests get merged at various locations before they reach the memory is challenging with internal-to-GPU arbitration mechanisms.…”

Section: Introductionmentioning

confidence: 99%

MARS: Memory Aware Reordered Source

Bhati,

Dhawan,

Gaur

et al. 2018

Preprint

View full text Add to dashboard Cite

Memory bandwidth is critical in today's high performance computing systems. The bandwidth is particularly paramount for GPU workloads such as 3D Gaming, Imaging and Perceptual Computing, GPGPU due to their data-intensive nature. As the number of threads and data streams in the GPUs increases with each generation, along with a high available memory bandwidth, memory efficiency is also crucial in order to achieve desired performance. In presence of multiple concurrent data streams, the inherent locality in a single data stream is often lost as these streams are interleaved while moving through multiple levels of memory system. In DRAM based main memory, the poor request locality reduces rowbuffer reuse resulting in underutilized and inefficient memory bandwidth.In this paper we propose Memory-Aware Reordered Source (MARS) architecture to address memory inefficiency arising from highly interleaved data streams. The key idea of MARS is that with a sufficiently large lookahead before the main memory, data streams can be reordered based on their rowbuffer address to regain the lost locality and improve memory efficiency. We show that MARS improves achieved memory bandwidth by 11% for a set of synthetic microbenchmarks. Moreover, MARS does so without any specific knowledge of the memory configuration.

show abstract

An Alternative Memory Access Scheduling in Manycore Accelerators

Cited by 6 publications

References 2 publications

Exposing Memory Access Patterns to Improve Instruction and Memory Efficiency in GPUs

Exposing Memory Access Patterns to Improve Instruction and Memory Efficiency in GPUs

Designing Coalescing Network-on-Chip for Efficient Memory Accesses of GPGPUs

MARS: Memory Aware Reordered Source

Contact Info

Product

Resources

About