An energy-efficient and scalable eDRAM-based register file architecture for GPGPU

JingNaifeng,; ShenYao,; LuYao,; GanapathyShrikanth,; MaoZhigang,; GuoMinyi,; CanalRamon,; LiangXiaoyao,

doi:10.1145/2508148.2485952

Cited by 17 publications

(6 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…GPU architecture. Other related work optimizes various aspects of the GPU architecture, e.g., warp scheduling [33], [39], [43], [56], [66], L1 cache management [31], [59], [65], [68], register file design [3], [30], [32], NoC optimization [10], [35], [73], [77], [78], and SM resource virtualization [64], [72]. Recent work also provides approaches for efficient multitasking in GPUs [4], [52], [53], [62], [67], [71], [76], virtual memory management [9], and design considerations for multi-module GPUs [8], [45].…”

Section: Related Workmentioning

confidence: 99%

Selective Replication in Memory-Side GPU Caches

Zhao

Jahre

Eeckhout

2020

2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

View full text Add to dashboard Cite

Data-intensive applications put immense strain on the memory systems of Graphics Processing Units (GPUs). To cater to this need, GPU memory systems distribute requests across independent units to provide high bandwidth by servicing requests (mostly) in parallel. We find that this strategy breaks down for shared data structures because the shared Last-Level Cache (LLC) organization used by contemporary GPUs stores shared data in a single LLC slice. Shared data requests are hence serialized-resulting in data-intensive applications not being provided with the bandwidth they require. A private LLC organization can provide high bandwidth, but it is often undesirable since it significantly reduces the effective LLC capacity. In this work, we propose the Selective Replication (SelRep) LLC which selectively replicates shared read-only data across LLC slices to improve bandwidth supply while ensuring that the LLC retains sufficient capacity to keep shared data cached. The compile-time component of SelRep LLC uses dataflow analysis to identify read-only shared data structures and uses a special-purpose load instruction for these accesses. The runtime component of SelRep LLC then monitors the caching behavior of these loads. Leveraging an analytical model, SelRep LLC chooses a replication degree that carefully balances the effective LLC bandwidth benefits of replication against its capacity cost. SelRep LLC consistently provides high performance to replication-sensitive applications across different data set sizes. More specifically, SelRep LLC improves performance by 19.7% and 11.1% on average (and up to 61.6% and 31.0%) compared to the shared LLC baseline and the state-of-the-art Adaptive LLC, respectively.

show abstract

Section: Related Workmentioning

confidence: 99%

Selective Replication in Memory-Side GPU Caches

Zhao

Jahre

Eeckhout

2020

2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

View full text Add to dashboard Cite

show abstract

“…Therefore, the RAT cannot only be used to determine whether a register is allocated in register file or scratchpad memory but also to calculate the register addresses in scratchpad memory. When allocating a scratchpad memory region for a CTA (CTA_ID), the value of SBR of that region is calculated using Equation (11). In Equation 11, S is the capacity of scratchpad memory:…”

Section: Register Allocationmentioning

confidence: 99%

“…Although Gebhart et al [7] propose an unified on-chip memory structure that the capacity of register file, scratchpad memory, and L1 cache can be partitioned at runtime according to the requirement of applications in a fine-grained way, there are still two shortcomings. First, the unified structure lacks flexibility; register file is one of the main contributors to GPU energy consumption and various power saving technologies [11,14,23,[32][33][34] are proposed for register file to save energy, which can be hard to apply to the unified structure due to the different access characteristics between register file and L1 cache. Second, the unified structure increases bank conflicts between register file, scratchpad memory and L1 cache; they use software-managed hierarchical register file [6] to reduce the required bandwidth to the main register file, however, that technology focuses on energy efficiency and may lead to resource underutilization and suboptimal performance [29,35].…”

Section: Evaluation For Advanced Architecturementioning

confidence: 99%

Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory

Bai

Sun

et al. 2018

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Modern Graphic Processing Units (GPUs) have become pervasive computing devices in datacenters due to their high performance with massive thread level parallelism (TLP). GPUs are equipped with large register files (RF) to support fast context switch between massive threads and scratchpad memory (SPM) to support inter-thread communication within the cooperative thread array (CTA). However, the TLP of GPUs is usually limited by the inefficient resource management of register file and scratchpad memory. This inefficiency also leads to register file and scratchpad memory underutilization. To overcome the above inefficiency, we propose a new resource management approach EXPARS for GPUs. EXPARS provides a larger register file logically by expanding the register file to scratchpad memory. When the available register file becomes limited, our approach leverages the underutilized scratchpad memory to support additional register allocation. Therefore, more CTAs can be dispatched to SMs, which improves the GPU utilization. Our experiments on representative benchmark suites show that the number of CTAs dispatched to each SM increases by 1.28× on average. In addition, our approach improves the GPU resource utilization significantly, with the register file utilization improved by 11.64% and the scratchpad memory utilization improved by 48.20% on average. With better TLP, our approach achieves 20.01% performance improvement on average with negligible energy overhead.

show abstract

“…(1) DVFS (dynamic voltage/frequency scaling)-based techniques Jiao et al 2010;Lee et al , 2011Ma et al 2012;Cebrian et al 2012;Sheaffer et al 2005b;Chang et al 2008;Ren 2011;Anzt et al 2011;Ren et al 2012;Zhao et al 2012;Huo et al 2012;Keller and Gruber 2010;Abe et al 2012;Park et al 2006;Paul et al 2013] (2) CPU-GPU workload division-based techniques [Takizawa et al 2008;Rofouei et al 2008;Ma et al 2012;Hamano et al 2009] and GPU workload consolidation (3) Architectural techniques for saving energy in specific GPU components, such as caches Lee et al 2011;Lashgar et al 2013;Arnau et al 2012;Rogers et al 2013;Lee and Kim 2012], global memory [Wang et al 2013;Rhu et al 2013], pixel shader [Pool et al 2011], vertex shader [Pool et al 2008], core data path, registers, pipeline and thread scheduling Chu et al 2011;Gebhart et al 2011;Jing et al 2013 We now discuss these techniques in detail. As seen through the previous classification, several techniques can be classified into more than one group.…”

Section: Overviewmentioning

confidence: 99%

Untitled

2015

CSUR

View full text Add to dashboard Cite

Recent years have witnessed phenomenal growth in the computational capabilities and applications of GPUs. However, this trend has also led to a dramatic increase in their power consumption. This article surveys research works on analyzing and improving energy efficiency of GPUs. It also provides a classification of these techniques on the basis of their main research idea. Further, it attempts to synthesize research works that compare the energy efficiency of GPUs with other computing systems (e.g., FPGAs and CPUs). The aim of this survey is to provide researchers with knowledge of the state of the art in GPU power management and motivate them to architect highly energy-efficient GPUs of tomorrow.

show abstract

An energy-efficient and scalable eDRAM-based register file architecture for GPGPU

Cited by 17 publications

References 26 publications

Selective Replication in Memory-Side GPU Caches

Selective Replication in Memory-Side GPU Caches

Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory

Untitled

Contact Info

Product

Resources

About