Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities

Pattnaik, Ashutosh; Tang, Xulong; Jog, Adwait; Kayıran, Onur; Mishra, Asit K.; Kandemir, Mahmut; Mutlu, Onur; Das, Chita R.

doi:10.1145/2967938.2967940

Cited by 149 publications

(98 citation statements)

References 90 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The logic considered varies in type e.g. simple in-order cores [4], [29], [32], [37], [51], [52], [54], [55], [58], graphics processing units [34], [38], [46], [48], field programmable gate arrays [41], [43], [49] and application specific accelerators [30], [39], [40], [45], [47], [50], [53], [56]. Majority of the NMC proposals are targeted towards different types of data processing applications e.g.…”

Section: Discussionmentioning

confidence: 99%

“…Each operation is issued by the main application running on the host and served by a control program loaded by the OS on each DRE engine. Similar to [48] the authors propose to invalidate the CPU caches after each fill and drain operation to keep memory consistency between the nearmemory processors and the main CPU. As pointed out earlier, this approach can introduce a significant overhead.…”

Section: Reconfigurable Unitmentioning

confidence: 99%

See 1 more Smart Citation

Near-memory computing: Past, present, and future

Singh

Chelini

Corda

et al. 2019

Microprocessors and Microsystems

View full text Add to dashboard Cite

The conventional approach of moving data to the CPU for computation has become a significant performance bottleneck for emerging scale-out data-intensive applications due to their limited data reuse. At the same time, the advancement in 3D integration technologies has made the decade-old concept of coupling compute units close to the memory -called nearmemory computing (NMC) -more viable. Processing right at the "home" of data can significantly diminish the data movement problem of data-intensive applications.In this paper, we survey the prior art on NMC across various dimensions (architecture, applications, tools, etc.) and identify the key challenges and open issues with future research directions. We also provide a glimpse of our approach to near-memory computing that includes i) NMC specific microarchitecture independent application characterization ii) a compiler framework to offload the NMC kernels on our target NMC platform and iii) an analytical model to evaluate the potential of NMC.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Reconfigurable Unitmentioning

confidence: 99%

Near-memory computing: Past, present, and future

Singh

Chelini

Corda

et al. 2019

Microprocessors and Microsystems

View full text Add to dashboard Cite

show abstract

“…These cores promise a 5× increase in Deep Learning performance compared to previous GPU generations [38]. Furthermore, GPUs have been shown to be amenable to near-memory processing as well [39].…”

Section: Gpusmentioning

confidence: 99%

A Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Datasets

Schuiki

Schaffner

Gürkaynak

et al. 2019

IEEE Trans. Comput.

View full text Add to dashboard Cite

Most investigations into near-memory hardware accelerators for deep neural networks have primarily focused on inference, while the potential of accelerating training has received relatively little attention so far. Based on an in-depth analysis of the key computational patterns in state-of-the-art gradient-based training methods, we propose an efficient near-memory acceleration engine called NTX that can be used to train state-of-the-art deep convolutional neural networks at scale. Our main contributions are: (i) a loose coupling of RISC-V cores and NTX co-processors reducing offloading overhead by 7× over previously published results; (ii) an optimized IEEE 754 compliant data path for fast high-precision convolutions and gradient propagation; (iii) evaluation of near-memory computing with NTX embedded into residual area on the Logic Base die of a Hybrid Memory Cube; and (iv) a scaling analysis to meshes of HMCs in a data center scenario. We demonstrate a 2.7× energy efficiency improvement of NTX over contemporary GPUs at 4.4× less silicon area, and a compute performance of 1.2 Tflop/s for training large state-of-the-art networks with full floating-point precision. At the data center scale, a mesh of NTX achieves above 95% parallel and energy efficiency, while providing 2.1× energy savings or 3.1× performance improvement over a GPU-based system.

show abstract

“…The proposed method by Pattnaik et al [34] also employs some metrics for the purpose of kernel classification into either the GPU-PIM or GPU-PIC class, where PIM and PIC stand for processingin-memory and processing-in-core, respectively. However, the four static metrics used in [34] are those related to the underlying hardware and so are subject to change from hardware to hardware. Therefore, a given hardware may require reading three metrics in category I, while, for another hardware, five metrics must be read.…”

Section: Related Workmentioning

confidence: 99%

Metric Selection for GPU Kernel Classification

Shekofteh

Noori

Naghibzadeh

et al. 2018

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Graphics Processing Units (GPUs) are vastly used for running massively parallel programs. GPU kernels exhibit different behavior at runtime and can usually be classified in a simple form as either "compute-bound" or "memory-bound." Recent GPUs are capable of concurrently running multiple kernels, which raises the question of how to most appropriately schedule kernels to achieve higher performance. In particular, coscheduling of compute-bound and memory-bound kernels seems promising. However, its benefits as well as drawbacks must be determined along with which kernels should be selected for a concurrent execution. Classifying kernels can be performed online by instrumentation based on performance counters. This work conducts a thorough analysis of the metrics collected from various benchmarks from Rodinia and CUDA SDK. The goal is to find the minimum number of effective metrics that enables online classification of kernels with a low overhead. This study employs a wrapper-based feature selection method based on the Fisher feature selection criterion. The results of experiments show that to classify kernels with a high accuracy, only three and five metrics are sufficient on a Kepler and a Pascal GPU, respectively. The proposed method is then utilized for a runtime scheduler. The results show an average speedup of 1.18× and 1.1× compared with a serial and a random scheduler, respectively.

show abstract

Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities

Cited by 149 publications

References 90 publications

Near-memory computing: Past, present, and future

Near-memory computing: Past, present, and future

A Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Datasets

Metric Selection for GPU Kernel Classification

Contact Info

Product

Resources

About