Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Ausavarungnirun, Rachata; Ghose, Saugata; Kayıran, Onur; Loh, Gabriel H.; Das, Chita R.; Kandemir, Mahmut; Mutlu, Onur

doi:10.1109/pact.2015.38

Cited by 73 publications

(53 citation statements)

References 60 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In a block group, the metadata block stores the sequence ID (SID), which is the unique number in the memory log area to represent a block group, and the metadata (BLK-1. Note that memory controllers are becoming increasingly more intelligent and complex to deal with various scheduling and performance management issues in multi-core and heterogeneous systems (e.g., [5], [6], [7], [8], [11], [12], [13], [14], [21], [25], [26], [27], [32], [33], [34], [35], [38], [39], [42], [45], [46], [49], [50], [51], [52], [53], [54], [61], [62], [64], [65], [66], [67], [68], [81], [84], [85], [86], [87], [88], [89], [97], [98], [108], [110], [112], [113],…”

Section: Eager Commitmentioning

confidence: 99%

Improving the Performance and Endurance of Persistent Memory with Loose-Ordering Consistency

Shu²,

Sun³

et al. 2024

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

Abstract-Persistent memory provides high-performance data persistence at main memory. Memory writes need to be performed in strict order to satisfy storage consistency requirements and enable correct recovery from system crashes. Unfortunately, adhering to such a strict order significantly degrades system performance and persistent memory endurance. This paper introduces a new mechanism, Loose-Ordering Consistency (LOC), that satisfies the ordering requirements at significantly lower performance and endurance loss. LOC consists of two key techniques. First, Eager Commit eliminates the need to perform a persistent commit record write within a transaction. We do so by ensuring that we can determine the status of all committed transactions during recovery by storing necessary metadata information statically with blocks of data written to memory. Second, Speculative Persistence relaxes the write ordering between transactions by allowing writes to be speculatively written to persistent memory. A speculative write is made visible to software only after its associated transaction commits. To enable this, our mechanism supports the tracking of committed transaction ID and multi-versioning in the CPU cache. Our evaluations show that LOC reduces the average performance overhead of memory persistence from 66.9% to 34.9% and the memory write traffic overhead from 17.1% to 3.4% on a variety of workloads.

show abstract

Section: Eager Commitmentioning

confidence: 99%

Improving the Performance and Endurance of Persistent Memory with Loose-Ordering Consistency

Shu²,

Sun³

et al. 2024

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

show abstract

“…In the worst case, due to the lack of L2 cache capacity, it is sometimes necessary to load the evicted data from the off-chip memory. 6,31,[33][34][35][36][37][38][39][40][41] Shared memory is an alternative to the L1 cache for storing preloaded data. There are several reasons to support this.…”

Section: Preloading In the Shared Memorymentioning

confidence: 99%

“…As many previous research studies have shown, effectively hiding cache resource contention is a crucial step to achieving high performance on GPUs. 6,31,[33][34][35][36][37][38][39][40][41]43 Previous studies of resolving the resource contention problems are based on dynamic analysis methods that require hardware modification. In addition to preloading in shared memory efficiently, it is necessary to combine static analysis to avoid the L1 cache from the resource contentions effectively.…”

Section: Impact Of Various Preload Factorsmentioning

confidence: 99%

See 1 more Smart Citation

Static code transformations for thread‐dense memory accesses in GPU computing

Kim

Hong

Park

et al. 2019

Concurrency and Computation

View full text Add to dashboard Cite

Due to the GPU's complex memory system and massive thread-level parallelism, application programmers often have difficulty optimizing GPU programs. An essential approach to memory optimization is to utilize low-latency on-chip memory to avoid high latency of off-chip memory accesses. Shared memory is an on-chip memory, which is explicitly managed by programmers.Shared memory has a read/write latency similar to that of the L1 cache, but poor data management can degrade performance. In this paper, we present a static code transformation that preloads dataset in GPU's shared memory. Our static analysis primarily targets global memory requests with high thread-density for preloading in shared memory. The thread-dense memory access pattern is a pattern in which many threads efficiently manage the address space of shared memory, as well as reuse the same data in a thread block. We limit the usage of shared memory so that thread-level parallelism remains at the same level when selecting datasets for preloading. Finally, our source-to-source compiler allows to preload selected datasets in shared memory by transforming non-optimized GPU kernel code. Our methods achieve 1.26× and 1.62× speedups on average (geometric mean), respectively with GTX980 and P100 GPUs. KEYWORDScode transformation, GPU computing, shared memory, static analysis INTRODUCTIONGraphics processing units (GPUs) are very useful in accelerating scientific applications and even in accelerating machine learning applications.To take advantage of GPU computing power in these applications, it is essential to reduce the high-latency off-chip memory access. Several memory access transformation methods have been explored to optimize off-chip memory accesses by utilizing low-latency on-chip memory. 1-8The use of shared memory, which is located in on-chip and explicitly managed by user-written kernel codes, is one way to avoid the high-latency overhead within off-chip memory access. 9-16 Despite the beneficial characteristics of shared memory, applications often leave out shared memory unused, mainly due to the extra management of address space in shared memory. Domain-specific programmers prefer using the hardware-managed L1 cache rather than shared memory for programming simplicity. 2,17 None of 13 applications in PolyBench benchmark and only 14 of 23 applications in Rodinia benchmarks use shared memory. 18,19Complex memory system and massive thread-level parallelism often make it prohibitively difficult for domain-specific application programmers to optimize memory access patterns in GPU computing. Furthermore, GPU architectures are evolving rapidly, which makes developers rewrite GPU kernels for different generations. To overcome these hurdles, studies of compiler-based optimization and analysis tools are carried out to support programmers with no in-depth knowledge of the GPU architecture. 5,6,11,[20][21][22][23][24] This paper proposes a static code transformation for preloading data in shared memory of GPUs. Our software-only approach focuses primarily on off...

show abstract

“…In previous studies, many researchers proposed various ways to improve the performance of the parallel algorithm. The work in [40] mainly studies the effect of warp sizing and scheduling on performance, and the work in [41] also analyzes the impact of warp-level sizing and thread block-level resource management. Both these studies adjust the number of active warps to improve performance.…”

Section: Two-level Parallelism Optimization Modelmentioning

confidence: 99%

RGCA: a Reliable GPU Cluster Architecture for Large-Scale Internet of Things Computing Based on Effective Performance-Energy Optimization

Fang

Chen

Xiong

et al. 2017

Preprint

View full text Add to dashboard Cite

This paper aims to develop a low-cost, high-performance and high-reliability computing system to process large-scale data using common data mining algorithms in the Internet of Things computing. Considering the characteristics of IoT data processing, similar to mainstream high performance computing, we use a GPU cluster to achieve better IoT services. Firstly, we present an energy consumption calculation method (ECCM) based on WSN. Then, using the CUDA Programming model, we propose a Two-level Parallel Optimization Model (TLPOM) which exploits reasonable resource planning and common compiler optimization techniques to obtain the best blocks and threads configuration considering the resource constraints of each node. The key to this part is dynamic coupling Thread-Level Parallelism (TLP) and Instruction-Level Parallelism (ILP) to improve the performance of the algorithms without additional energy consumption. Finally, combining the ECCM and the TLPOM, we use the Reliable GPU Cluster Architecture (RGCA) to obtain a high-reliability computing system considering the nodes' diversity, algorithm characteristics, etc. The results show that the performance of the algorithms significantly increased by 34.1%, 33.96% and 24.07% for Fermi, Kepler and Maxwell on average with TLPOM and the RGCA ensures that our IoT computing system provides low-cost and high-reliability services.

show abstract

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Cited by 73 publications

References 60 publications

Improving the Performance and Endurance of Persistent Memory with Loose-Ordering Consistency

Improving the Performance and Endurance of Persistent Memory with Loose-Ordering Consistency

Static code transformations for thread‐dense memory accesses in GPU computing

RGCA: a Reliable GPU Cluster Architecture for Large-Scale Internet of Things Computing Based on Effective Performance-Energy Optimization

Contact Info

Product

Resources

About