A case for core-assisted bottleneck acceleration in GPUs

Vijaykumar, Nandita; Pekhimenko, Gennady; Jog, Adwait; Bhowmick, Abhishek; Ausavarungnirun, Rachata; Das, Chita R.; Kandemir, Mahmut; Mowry, Todd C.; Mutlu, Onur

doi:10.1145/2749469.2750399

Cited by 76 publications

(5 citation statements)

References 86 publications

(127 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…After doubling the off-chip bandwidth, no application remains bandwidth limited, and therefore, increasing the off-chip bandwidth to 4× and 8× has little effect on performance. It may be possible to achieve, the 2× extra bandwidth by using data compression [37] with little changes to the architecture of existing GPUs. While technologies like 3D DRAM that offer significantly more bandwidth (and lower access latency) can be useful, they are not necessary for providing the offchip bandwidth requirements of NGPU for the range of applications that we studied.…”

Section: Resultsmentioning

confidence: 99%

Neural acceleration for GPU throughput processors

Yazdanbakhsh

Park

Sharma

et al. 2015

Proceedings of the 48th International Symposium on Microarchitecture

View full text Add to dashboard Cite

Graphics Processing Units (GPUs) can accelerate diverse classes of applications, such as recognition, gaming, data analytics, weather prediction, and multimedia. Many of these applications are amenable to approximate execution. This application characteristic provides an opportunity to improve GPU performance and efficiency. Among approximation techniques, neural accelerators have been shown to provide significant performance and efficiency gains when augmenting CPU processors. However, the integration of neural accelerators within a GPU processor has remained unexplored. GPUs are, in a sense, many-core accelerators that exploit large degrees of data-level parallelism in the applications through the SIMT execution model. This paper aims to harmoniously bring neural and GPU accelerators together without hindering SIMT execution or adding excessive hardware overhead. We introduce a low overhead neurally accelerated architecture for GPUs, called NGPU, that enables scalable integration of neural accelerators for large number of GPU cores. This work also devises a mechanism that controls the tradeoff between the quality of results and the benefits from neural acceleration. Compared to the baseline GPU architecture, cycle-accurate simulation results for NGPU show a 2.4× average speedup and a 2.8× average energy reduction within 10% quality loss margin across a diverse set of benchmarks. The proposed quality control mechanism retains a 1.9× average speedup and a 2.1× energy reduction while reducing the degradation in the quality of results to 2.5%. These benefits are achieved by less than 1% area overhead.

show abstract

Section: Resultsmentioning

confidence: 99%

Neural acceleration for GPU throughput processors

Yazdanbakhsh

Park

Sharma

et al. 2015

Proceedings of the 48th International Symposium on Microarchitecture

View full text Add to dashboard Cite

show abstract

“…Warped-compression architecture is also coupled to support compressed execution, that is, some instructions are processed without decompressing the operand values to further save energy. Vijaykumar et al [46] propose a core-assisted bottleneck acceleration (CABA) framework for GPUs, in which assist warps are automatically generated to perform specific tasks that speed up application execution. Instead of a hardware-based implementation, CABA uses assist warps to enable flexible data compression in the memory hierarchy.…”

Section: Related Workmentioning

confidence: 99%

MBZip

Raghavendra

Panda

Mutyam

2017

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Compression techniques at the last-level cache and the DRAM play an important role in improving system performance by increasing their effective capacities. A compressed block in DRAM also reduces the transfer time over the memory bus to the caches, reducing the latency of a LLC cache miss. Usually, compression is achieved by exploiting data patterns present within a block. But applications can exhibit data locality that spread across multiple consecutive data blocks. We observe that there is significant opportunity available for compressing multiple consecutive data blocks into one single block, both at the LLC and DRAM. Our studies using 21 SPEC CPU applications show that, at the LLC, around 25% (on average) of the cache blocks can be compressed into one single cache block when grouped together in groups of 2 to 8 blocks. In DRAM, more than 30% of the columns residing in a single DRAM page can be compressed into one DRAM column, when grouped together in groups of 2 to 6. Motivated by these observations, we propose a mechanism, namely, MBZip, that compresses multiple data blocks into one single block (called a zipped block), both at the LLC and DRAM. At the cache, MBZip includes a simple tag structure to index into these zipped cache blocks and the indexing does not incur any redirectional delay. At the DRAM, MBZip does not need any changes to the address computation logic and works seamlessly with the conventional/existing logic. MBZip is a synergistic mechanism that coordinates these zipped blocks at the LLC and DRAM. Further, we also explore silent writes at the DRAM and show that certain writes need not access the memory when blocks are zipped. MBZip improves the system performance by 21.9%, with a maximum of 90.3% on a 4-core system.

show abstract

“…• GPU Partitioning: Although a GPU is viewed as a single accelerator device by application tasks, it consists of many GPU cores that execute a given parallel workload in an aggregate manner. Hence, depending on the characteristics of workloads, only some of the GPU cores may be utilized [179]. To address this GPU underutilization problem, some of recent GPU architectures, e.g., NVIDIA Kepler [180], introduce a feature to execute multiple GPU functions concurrently.…”

Section: Architecture Support For Computational Acceleratorsmentioning

confidence: 99%

Towards Predictable Real-Time Performance on Multi-Core Platforms

Kim¹

2016

Preprint

View full text Add to dashboard Cite

Cyber-physical systems (CPS) integrate sensing, computing, communication and actuation capabilities to monitor and control operations in the physical environment. A key requirement of such systems is the need to provide predictable real-time performance: the timing correctness of the system should be analyzable at design time with a quantitative metric and guaranteed at runtime with high assurance. This requirement of predictability is particularly important for safety-critical domains such as automobiles, aerospace, defense, manufacturing and medical devices.The work in this dissertation focuses on the challenges arising from the use of modern multi-core platforms in CPS. Even as of today, multi-core platforms are rarely used in safety-critical applications primarily due to the temporal interference caused by contention on various resources shared among processor cores, such as caches, memory buses, and I/O devices. Such interference is hard to predict and can significantly increase task execution time, e.g., up to 12× on commodity quad-core platforms. To address the problem of ensuring timing predictability on multi-core platforms, we develop novel analytical and systems techniques in this dissertation. Our proposed techniques theoretically bound temporal interference that tasks may suffer from when accessing shared resources. Our techniques also involve software primitives and algorithms for real-time operating systems and hypervisors, which significantly reduce the degree of the temporal interference. Specifically, we tackle the issues of cache and memory contention, locking and synchronization, interrupt handling, and access control for computational accelerators such as general-purpose graphics processing units (GPGPUs), all of which are crucial to achieving predictable real-time performance on a modern multi-core platform. Our solutions are readily applicable to commodity multi-core platforms, and can be used not only for developing new systems but also migrating existing applications from single-core to multi-core platforms. vi This dissertation would have been impossible without the help and support of many people. First and foremost, I would like to thank my advisor, Prof. Raj Rajkumar. I was lucky to work with Raj. His guidance and expertise have made me a better thinker, writer, and researcher. Raj gave me opportunities to participate in exciting projects, demonstrate my research results, and mentor other students, all of which led me to become an independent researcher and to pursue an academic career. I am grateful to the members of my thesis committee, Prof. Onur Mutlu, Prof. Anthony Rowe, and Dr. Shige Wang for their time, effort and inputs in completing this dissertation. Thanks to Onur for his insights on various aspects of my work. I learned a lot from Onur on computer architectures, which was a great asset for my research. Thanks to Anthony for his feedback and advice, even since my very early days at CMU. I enjoyed lively conversations with Anthony and liked to hear his view on cyber-physical...

show abstract

A case for core-assisted bottleneck acceleration in GPUs

Cited by 76 publications

References 86 publications

Neural acceleration for GPU throughput processors

Neural acceleration for GPU throughput processors

MBZip

Towards Predictable Real-Time Performance on Multi-Core Platforms

Contact Info

Product

Resources

About