Architecting the Last-Level Cache for GPUs using STT-RAM Technology

Samavatian, Mohammad Hossein; Arjomand, Mohammad; Bashizade, Ramin; Sarbazi-Azad, Hamid

doi:10.1145/2764905

Cited by 12 publications

(8 citation statements)

References 47 publications

(44 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our ILP model tries to place existing data blocks during different time frames in the proper positions in the hybrid cache. At different time frames, our scheme decides how the data blocks locate in the [7,11,17,19,21,22,24,25,27,28] reducing data migration [12,18,29] using compiler [12,18,24,29] using prediction [11,17,25,27,28] hybrid cache. At each time frame, for each memory block, the number of read and write operations is considered as the problem inputs.…”

Section: Block Placement Modelmentioning

confidence: 99%

See 1 more Smart Citation

Integer linear programming model for allocation and migration of data blocks in the STT‐RAM‐based hybrid caches

Khajekarimi

Jamshidi

Vafaei

2020

IET Computers & Digital Techniques

View full text Add to dashboard Cite

Spin-transfer torque random access memory (STT-RAM) has emerged as an eminent choice for the larger on-chip caches due to high density, low static power consumption and scalability. However, this technology suffers from long latency and high energy consumption during a write operation. Hybrid caches alleviate these problems by incorporating a write-friendly memory technology such as static random access memory along with STT-RAM technology. The proper allocation of data blocks has a significant effect on both performance and energy consumption in the hybrid cache. In this study, the allocation and migration problem of data blocks in the hybrid cache is examined and then modelled using integer linear programming (ILP) formulations. The authors propose an ILP model with three different objective functions which include minimising access latency, minimising energy and minimising energy-delay product in the hybrid cache. Evaluations confirm that the proposed ILP model obtains better results in terms of energy consumption and performance compared to the existing hybrid cache architecture.

show abstract

Section: Block Placement Modelmentioning

confidence: 99%

“…EDP cost of normal read/write operations and EDP cost of migration operations. The following equation represents the EDP cost of normal read/write operations during the entire time frames: (see (21)) . The cost of EDP during the entire migration operations is computed as the following equation:…”

Section: Minimising Edpmentioning

confidence: 99%

Integer linear programming model for allocation and migration of data blocks in the STT‐RAM‐based hybrid caches

Khajekarimi

Jamshidi

Vafaei

2020

IET Computers & Digital Techniques

View full text Add to dashboard Cite

show abstract

“…It is important that improving the read/write latency does not come at the cost of increasing write energy. Moreover, write energy relates to cell wearout; that is, increasing the write energy leads to decreasing PCM lifetime [45]. Figure 16 shows the write energy of RWR, WT, RWR+FPC, and WT+FPC methods normalized to 2-bit MLC PCM baseline.…”

Section: Write Energymentioning

confidence: 99%

“…Reducing write energy corresponds to enhancing PCM lifetime [4,45,60]. We evaluated the effect of our proposed scheme and other implemented methods on memory lifetime.…”

Section: Wearoutmentioning

confidence: 99%

Improving MLC PCM Performance through Relaxed Write and Read for Intermediate Resistance Levels

Rashidi

Jalili

Sarbazi-Azad

2018

ACM Trans. Archit. Code Optim.

Self Cite

View full text Add to dashboard Cite

Phase Change Memory (PCM) is one of the most promising candidates to be used at the main memory level of the memory hierarchy due to poor scalability, considerable leakage power, and high cost/bit of DRAM. PCM is a new resistive memory that is capable of storing data based on resistance values. The wide resistance range of PCM allows for storing multiple bits per cell (MLC) rather than a single bit per cell (SLC). Unfortunately, higher density of MLC PCM comes at the expense of longer read/write latency, higher soft error rate, higher energy consumption, and earlier wearout compared to the SLC PCM. Some studies suggest removing the most error-prone level to mitigate soft error and write latency of MLC PCM, hence introducing a less dense memory called Tri-Level memory. Another scheme, called M-Metric, proposes a new read metric to address the soft error problem in MLC PCM.In order to deal with the limited lifetime of PCM, some extra storage per memory line is required to correct permanent hard errors (stuck-at faults). Since the extra storage is used only when permanent faults occur, it has a low utilization for a long time before hard errors start to occur. In this article, we utilize the extra storage to improve the read/write latency in a 2-bit MLC PCM using a relaxation scheme for reading and writing the cells for intermediate resistance levels. More specifically, we combine the most time-consuming levels (intermediate resistance levels) to reduce the number of resistance levels (making a Tri-Level PCM) and therefore improve write latency. We then store some error correction metadata in the extra storage section to successfully retrieve the exact data values in the read operation. We also modify the Tri-Level PCM cell to increase its read latency when the M-Metric scheme is used. Evaluation results show that the proposed scheme improves read latency by 57.2%, write latency by 56.1%, and overall system performance (IPC) by 26.9% over the baseline. It is noteworthy that combining the proposed scheme and FPC compression method improves read latency by 75.2%, write latency by 67%, and overall system performance (IPC) by 37.4%. With the increasing number of cores and developing sophisticated applications in today's computer systems, larger main memory capacity is increasingly demanded. The large capacity of main memory results in fewer page faults and more application parallelism. Unfortunately, DRAM cannot satisfy the increasing demand for larger main memory capacity due to its power and scalability limits that make further scaling of DRAM infeasible [31]. Therefore, emerging memory technologies have been proposed to be used in the main memory level of memory hierarchy.Phase Change Memory (PCM) is an emerging memory that is a candidate for replacing DRAM technology. A PCM device consists of Chalcogenide material (GST), capable of changing its resistance. Therefore, PCM stores data based on its GST resistance level. Compared to DRAM, PCM is more scalable [44] and denser, and consumes less standby power.The large re...

show abstract

“…For NVIDIA Pascal [57], more than 60% of the on-chip storage area, amounting to 14.3 MB is dedicated to the register file. GPU register files face the difficult challenge of optimizing latency, bandwidth, and power consumption, while having maximal capacity [2,19,20,23,25,27,28,39,43,45,46,48,65,66,78,79,80]. Larger register files are slower, take up more silicon area and consume more power.…”

Section: Register File Scalabilitymentioning

confidence: 99%

LTRF

Sadrosadati

Mirhosseini

Ehsani

et al. 2018

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Syste

Self Cite

View full text Add to dashboard Cite

Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file, to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this paper, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp's aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging high-density high-latency memory technologies, enabling 8× larger capacity and improving overall GPU performance by 31% while reducing register file power consumption by 46%.

show abstract

Architecting the Last-Level Cache for GPUs using STT-RAM Technology

Cited by 12 publications

References 47 publications

Integer linear programming model for allocation and migration of data blocks in the STT‐RAM‐based hybrid caches

Integer linear programming model for allocation and migration of data blocks in the STT‐RAM‐based hybrid caches

Improving MLC PCM Performance through Relaxed Write and Read for Intermediate Resistance Levels

LTRF

Contact Info

Product

Resources

About