Abstract:The computational power provided by many-core graphics processing units (GPUs) has been exploited in many applications. The programming techniques currently employed on these GPUs are not sufficient to address problems exhibiting irregular, and unbalanced workload. The problem is exacerbated when trying to effectively exploit multiple GPUs concurrently, which are commonly available in many modern systems. In this paper, we propose a task-based dynamic load-balancing solution for singleand multi-GPU systems. Th… Show more
“…However, the approach does not scale very well for large numbers of threads. Using mapped memory, Chen and Villa [6] have introduced a concept which uses non-blocking taskqueues to implement a master-worker pattern, where the main CPU is able to generate tasks after a kernel has been launched. This approach is very well suited for scenarios where multiple GPUs have to be supplied with tasks.…”
GPU compute devices have become very popular for general purpose computations. However, the SIMD-like hardware of graphics processors is currently not well suited for irregular workloads, like searching unbalanced trees. In order to mitigate this drawback, NVIDIA introduced an extension to GPU programming models called Dynamic Parallelism. This extension enables GPU programs to spawn new units of work directly on the GPU, allowing the refinement of subsequent work items based on intermediate results without any involvement of the main CPU.This work investigates methods for employing Dynamic Parallelism with the goal of improved workload distribution for tree search algorithms on modern GPU hardware. For the evaluation of the proposed approaches, a case study is conducted on the N-Queens problem. Extensive benchmarks indicate that the benefits of improved resource utilization fail to outweigh high management overhead and runtime limitations due to the very fine level of granularity of the investigated problem. However, novel memory management concepts for passing parameters to child grids are presented. These general concepts are applicable to other, more coarse-grained problems that benefit from the use of Dynamic Parallelism.
“…However, the approach does not scale very well for large numbers of threads. Using mapped memory, Chen and Villa [6] have introduced a concept which uses non-blocking taskqueues to implement a master-worker pattern, where the main CPU is able to generate tasks after a kernel has been launched. This approach is very well suited for scenarios where multiple GPUs have to be supplied with tasks.…”
GPU compute devices have become very popular for general purpose computations. However, the SIMD-like hardware of graphics processors is currently not well suited for irregular workloads, like searching unbalanced trees. In order to mitigate this drawback, NVIDIA introduced an extension to GPU programming models called Dynamic Parallelism. This extension enables GPU programs to spawn new units of work directly on the GPU, allowing the refinement of subsequent work items based on intermediate results without any involvement of the main CPU.This work investigates methods for employing Dynamic Parallelism with the goal of improved workload distribution for tree search algorithms on modern GPU hardware. For the evaluation of the proposed approaches, a case study is conducted on the N-Queens problem. Extensive benchmarks indicate that the benefits of improved resource utilization fail to outweigh high management overhead and runtime limitations due to the very fine level of granularity of the investigated problem. However, novel memory management concepts for passing parameters to child grids are presented. These general concepts are applicable to other, more coarse-grained problems that benefit from the use of Dynamic Parallelism.
“…It divides parallel computing tasks according to execution speed to achieve best overall system performance. In [2] a multi-GPU self-adaptive load balancing method was proposed. GPU can self-adaptively select tasks to execute according to local free-busy state by establishing task queue model between CPU and GPU.…”
Abstract. With the development of GPU general-purpose computing, GPU heterogeneous cluster has become a widely used parallel data processing solution in modern data center. Temperature management and controlling has become a new research hotspot in big data continuous computing. Temperature heat island in cluster has important influence on computing reliability and energy efficiency. In order to prevent the occurrence of GPU cluster temperature heat island, a big data task scheduling model for preventing temperature heat island was proposed. In this model, temperature, reliability and computing performance are taken into account to reduce node performance difference and improve throughput per unit time in cluster. Temperature heat islands caused by slow nodes are prevented by optimizing scheduling. The experimental results show that the proposed scheme can control node temperature and prevent the occurrence of temperature heat island under the premise of guaranteeing computing performance and reliability.
“…StarPU [15] is designed to be a platform for heterogeneous task scheduling. Along with StarPU, Qilin [16], Scout [17], the dynamic load balancing system created by Chen et al [18], and the work by Jiménez et al [19] forms a solid foundation for both the need and the capability for a heterogeneous task scheduler. These solutions, however, require the user to reimplement their application -in a new programming language in the case of StarPU or Scout; a new API in Qilin -or manually to create multiple copies of a function for multiple platforms to provide to the scheduler.…”
Abstract-Heterogeneous systems with CPUs and computational accelerators such as GPUs, FPGAs or the upcoming Intel MIC are becoming mainstream. In these systems, peak performance includes the performance of not just the CPUs but also all available accelerators. In spite of this fact, the majority of programming models for heterogeneous computing focus on only one of these. With the development of Accelerated OpenMP for GPUs, both from PGI and Cray, we have a clear path to extend traditional OpenMP applications incrementally to use GPUs. The extensions are geared toward switching from CPU parallelism to GPU parallelism. However they do not preserve the former while adding the latter. Thus computational potential is wasted since either the CPU cores or the GPU cores are left idle. Our goal is to create a runtime system that can intelligently divide an accelerated OpenMP region across all available resources automatically. This paper presents our proof-of-concept runtime system for dynamic task scheduling across CPUs and GPUs. Further, we motivate the addition of this system into the proposed OpenMP for Accelerators standard. Finally, we show that this option can produce as much as a two-fold performance improvement over using either the CPU or GPU alone.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.