Scaling large-data computations on multi-GPU accelerators

Sabne, Amit; Sakdhnagool, Putt; Eigenmann, Rudolf

doi:10.1145/2464996.2465023

Cited by 18 publications

(10 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…UM simplifies both out-of-core processing between GPUs and CPUs as well as multi-GPU processing, 1 and combinations of both. Previously, the applications focusing on large data processing on GPUs required algorithm-specific techniques for memory handling [Al-Saber and Kulkarni 2015;Gelado et al 2010b;Huynh et al 2012;Jablin et al 2012b;Krizhevsky et al 2012;Sabne et al 2013;Seo et al 2015;Shamoto et al 2015].…”

Section: Cuda Unified Memory For Multi-gpu Systemsmentioning

confidence: 99%

GPU Accelerated Path Tracing of Massive Scenes

et al. 2021

View full text Add to dashboard Cite

This article presents a solution to path tracing of massive scenes on multiple GPUs. Our approach analyzes the memory access pattern of a path tracer and defines how the scene data should be distributed across up to 16 GPUs with minimal effect on performance. The key concept is that the parts of the scene that have the highest amount of memory accesses are replicated on all GPUs. We propose two methods for maximizing the performance of path tracing when working with partially distributed scene data. Both methods work on the memory management level and therefore path tracer data structures do not have to be redesigned, making our approach applicable to other path tracers with only minor changes in their code. As a proof of concept, we have enhanced the open-source Blender Cycles path tracer. The approach was validated on scenes of sizes up to 169 GB. We show that only 1–5% of the scene data needs to be replicated to all machines for such large scenes. On smaller scenes we have verified that the performance is very close to rendering a fully replicated scene. In terms of scalability we have achieved a parallel efficiency of over 94% using up to 16 GPUs.

show abstract

Section: Cuda Unified Memory For Multi-gpu Systemsmentioning

confidence: 99%

GPU Accelerated Path Tracing of Massive Scenes

et al. 2021

View full text Add to dashboard Cite

show abstract

“…GPU researchers have exploited pipelining [29] to overlap data transfers with kernel computations. The distinguishing factor in the Pagoda pipelined task processing is that it overlaps spawning, which comprises the CPU finding a free task entry and performing a data copy, with GPU scheduling, which is only a sub-part of the overall task processing.…”

Section: Related Workmentioning

confidence: 99%

Pagoda

Yeh

Sabne

Sakdhnagool

et al. 2017

Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Self Cite

View full text Add to dashboard Cite

Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the hardware, workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains thousands of threads that occupy the entire GPU.GPUs face severe underutilization and their performance benefits vanish if the tasks are narrow, i.e., they contain < 500 threads. Latency-sensitive applications in network, signal, and image processing that generate a large number of tasks with relatively small inputs are examples of such limited parallelism. This paper presents Pagoda, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel. Tasks are spawned from the CPU onto Pagoda as they become available, and are scheduled by the MasterKernel at the warp granularity. Experimental results demonstrate that Pagoda achieves a geometric mean speedup of 5.70x over PThreads running on a 20-core CPU, 1.51x over CUDA-HyperQ, and 1.69x over GeMTC, the state-of-the-art runtime GPU task scheduling system.

show abstract

“…The generated pipelined code can automatically support computations with out‐of‐GPU datasets. SuperMatrix is another runtime system that supports shared‐memory systems with multiple GPUs . It uses several software cache schemes to maintain the coherence between the host RAM and the GPU memories to minimize communication.…”

Section: Related Workmentioning

confidence: 99%

“…StarPU relies on a virtual shared memory to handle data transfers and reduce communications. Eigenmann et al [25] proposed a new technique called computation splitting and used the pipelining technique to translate OpenMP programs to run on a host system attached with multiple GPUs. The generated pipelined code can automatically support computations with out-of-GPU datasets.…”

Section: Related Workmentioning

confidence: 99%

A scalable approach to solving dense linear algebra problems on hybrid CPU‐GPU systems

Song

Dongarra

2014

Concurrency and Computation

View full text Add to dashboard Cite

SUMMARYAiming to fully exploit the computing power of all CPUs and all GPUs on hybrid CPU-GPU systems to solve dense linear algebra problems, we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, as well as to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task-assignment protocol to solve data dependencies between tasks without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the double-precision Cholesky factorization and QR factorization. Our approach demonstrates performance better than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU, PLASMA) in the following four possible environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, sharedsystems with multiple GPUs, and shared-memory multicore computers.

show abstract

Scaling large-data computations on multi-GPU accelerators

Cited by 18 publications

References 32 publications

GPU Accelerated Path Tracing of Massive Scenes

GPU Accelerated Path Tracing of Massive Scenes

Pagoda

A scalable approach to solving dense linear algebra problems on hybrid CPU‐GPU systems

Contact Info

Product

Resources

About