General-Purpose Graphics Processor Architectures

Aamodt, Tor M.; Fung, Wilson Wai Lun; Rogers, Timothy G.

doi:10.1007/978-3-031-01759-9

Cited by 21 publications

(7 citation statements)

References 106 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A memory partition unit comprises L2 Cache Banks, one or more memory access schedulers, and a raster operation (ROP). Multiple memory partition units exist in a GPGPU, with L2 Cache banks serving as data caches, memory access schedulers reordering memory read and write operations and dispatching them to DRAM for enhanced access efficiency, and ROP handling graphic and atomic operations [13,14,15].…”

Section: C-prefetcher Designmentioning

confidence: 99%

Optimization Methods for Graph Applications Based on GPGPU Prefetcher

Wei,

Zhang

2024

Preprint

View full text Add to dashboard Cite

As artificial intelligence models continue to grow in size, the demand for computing and storage resources has increased significantly. Optimizing artificial intelligence models on GPUs to fully utilize the available resources is currently a major challenge in the field of High-Performance Computing (HPC). Graph structures in computers are commonly stored in matrix form, and to conserve computer storage resources, Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC) matrix compression is often employed. Previous GPGPU Prefetcher studies have primarily focused on thread-aware prefetching. However, due to the complexity of graph structures, it is challenging to determine the storage location of the next traversed node during traversal. Therefore, thread-aware prefetching methods are not suitable for graph structures. In this work, we address the application of graph structures in GPGPU architecture and design two new Prefetchers (C-Prefetcher and X-Prefetcher) based on CSR and CSC. When the SIMT Core loads Cache Line from Global memory to L2 Cache, C-Prefetcher fetches the next accessed data to L2 Cache based on CSC or CSR. Building upon C-Prefetcher, This work further propose X-Prefetcher. Through experimental comparisons in this paper involving multiple graph applications, the results indicate that the LLC (Last Level Cache) Miss rates during graph traversal are respectively reduced by 4.5% and 11% compared to the Baseline architecture, where no Prefetcher is employed, for C-Prefetcher and X-Prefetcher architectures. Additionally, the IPC (Instructions Per Cycle) is increased by 3.2% and 6.8% for C-Prefetcher and X-Prefetcher architectures, respectively, compared to the Baseline architecture during graph traversal.

show abstract

Section: C-prefetcher Designmentioning

confidence: 99%

Optimization Methods for Graph Applications Based on GPGPU Prefetcher

Wei,

Zhang

2024

Preprint

View full text Add to dashboard Cite

show abstract

“…While the cores inside the same core cluster (Streaming multiprocessor-SM) have access to the scratchpad memory (shared memory or L1 cache), all the cores can communicate through the L2 cache structure via interconnect. DRAM-based global device memory maintains larger but relatively slower data access for all threads executing in the device [22]. Not only does a modern GPU device include generalpurpose cores but also special function units (SFU) for fast transcendental function computations as well as tensor cores for efficient matrix multiplications.…”

Section: Gpu Architecturesmentioning

confidence: 99%

BLAS Kütüphanelerinin GPU Mimarilerindeki Nicel Performans Analizi

ÖZ

2024

DEUFMD

View full text Add to dashboard Cite

Basic Linear Algebra Subprograms (BLAS) are a set of linear algebra routines commonly used by machine learning applications and scientific computing. BLAS libraries with optimized implementations of BLAS routines offer high performance by exploiting parallel execution units in target computing systems. With massively large number of cores, graphics processing units (GPUs) exhibit high performance for computationally-heavy workloads. Recent BLAS libraries utilize parallel cores of GPU architectures efficiently by employing inherent data parallelism. In this study, we analyze GPU-targeted functions from two BLAS libraries, cuBLAS and MAGMA, and evaluate their performance on a single-GPU NVIDIA architecture by considering architectural features and limitations. We collect architectural performance metrics and explore resource utilization characteristics. Our work aims to help researchers and programmers to understand the performance behavior and GPU resource utilization of the BLAS routines implemented by the libraries.

show abstract

“…In this section, we provide an overview of these components involved in atomic execution. Note, since the architecture of a GPU is a black box, we explicitely refer to the work of Aamodt et al and Glasco et al [15,16] for our work. We highly recommend these articles for more insights.…”

Section: Atomics In Gpumentioning

confidence: 99%

Novel insights on atomic synchronization for sort-based group-by on GPUs

Gurumurthy

Broneske

Schäler

et al. 2023

Distrib Parallel Databases

View full text Add to dashboard Cite

Using heterogeneous processing devices, like GPUs, to accelerate relational database operations is a well-known strategy. In this context, the operation is highly interesting for two reasons. Firstly, it incurs large processing costs. Secondly, its results (i.e., aggregates) are usually small, reducing data movement costs whose compensation is a major challenge for heterogeneous computing. Generally, for computation on GPUs, one relies either on sorting or hashing. Today, empirical results suggest that hash-based approaches are superior. However, by concept, hashing induces an unpredictable memory access pattern conflicting with the architecture of GPUs. This motivates studying why current sort-based approaches are generally inferior. Our results indicate that current sorting solutions cannot exploit the full parallel power of modern GPUs. Experimentally, we show that the issue arises from the need to synchronize parallel threads that access the shared memory location containing the aggregates via . Our quantification of the optimal performance motivates us to investigate how to minimize the overhead of atomics. This results in different variants using atomics, where the best variants almost mitigate the atomics overhead entirely. The results of a large-scale evaluation reveal that our approach achieves a 3x speed-up over existing sort-based approaches and up to 2x speed-up over hash-based approaches.

show abstract

General-Purpose Graphics Processor Architectures

Cited by 21 publications

References 106 publications

Optimization Methods for Graph Applications Based on GPGPU Prefetcher

Optimization Methods for Graph Applications Based on GPGPU Prefetcher

BLAS Kütüphanelerinin GPU Mimarilerindeki Nicel Performans Analizi

Novel insights on atomic synchronization for sort-based group-by on GPUs

Contact Info

Product

Resources

About