Adaptive Runtime Selection for GPU

Dollinger, Jean-François; Loechner, Vincent

doi:10.1109/icpp.2013.16

Cited by 11 publications

(6 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Ref. [37] discusses the principles and methods of hybrid programming, where MATLAB integrates with other languages such as Visual C++. The results show that mixed programming with different tasks can be achieved by compiling different MATLAB programs, making the necessary settings and replacing the corresponding C++ code.…”

Section: Related Workmentioning

confidence: 99%

Study on Consulting Air Combat Simulation of Cluster UAV Based on Mixed Parallel Computing Framework of Graphics Processing Unit

Kung¹

2018

Electronics

View full text Add to dashboard Cite

This paper combines matrix game theory with negotiating theory and uses U-solution to study the framework of the consulting air combat of UAV cluster. The processes to determine the optimal strategy in this paper follow three points: first, the UAV cluster are grouped into fleets; second, the best paring for the joint operations of the fleet member with the enemy fleet members are calculated; thirdly, consultations within the fleet are conducted to discuss the problems of optimal tactic, roles of main/assistance, and situational assessment within the fleet. In order to improve the computing efficiency of the framework, this article explores the use of the NVIDIA graphics processor programmed through MATLAB mixed C++/CUDA toolkit to accelerate the calculations of equations of motion of unmanned aerial vehicles, the prediction of superiority values and U values, computations of consultation, the evaluation of situational assessment and the optimal strategies. The effectiveness evaluation of GPGPU and CPU can be observed by the simulation results. When the number of team air combat is small, the CPU alone has better efficiency; however, when the number of air combat clusters exceeds 6 to 6, the architecture presented in this article can provide higher performance improvements and run faster than optimized CPU-only code.

show abstract

Section: Related Workmentioning

confidence: 99%

Study on Consulting Air Combat Simulation of Cluster UAV Based on Mixed Parallel Computing Framework of Graphics Processing Unit

Kung¹

2018

Electronics

View full text Add to dashboard Cite

show abstract

“…Other techniques improve application performance on GPUs through addressing the problems of data transfer [58,59], thread divergence [60], data placement [61], synchronization overhead [62] and configuration tuning [63,64]. GPU resource sharing has been studied at both system [65,66] and architecture levels [67,68] to address the resource contention and performance interference.…”

Section: Scheduling On Acceleratormentioning

confidence: 99%

Baymax

Chen

Yang

Mars

et al. 2016

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

Modern warehouse-scale computers (WSCs) are being outfitted with accelerators to provide the significant compute required by emerging intelligent personal assistant (IPA) workloads such as voice recognition, image classification, and natural language processing. It is well known that the diurnal user access pattern of user-facing services provides a strong incentive to co-locate applications for better accelerator utilization and efficiency, and prior work has focused on enabling co-location on multicore processors. However, interference when co-locating applications on non-preemptive accelerators is fundamentally different than contention on multi-core CPUs and introduces a new set of challenges to reduce QoS violation. To address this open problem, we first identify the underlying causes for QoS violation in accelerator-outfitted servers. Our experiments show that queuing delay for the compute resources and PCI-e bandwidth contention for data transfer are the main two factors that contribute to the long tails of user-facing applications. We then present Baymax, a runtime system that orchestrates the execution of compute tasks from different applications and mitigates PCI-e bandwidth contention to deliver the required QoS for user-facing applications and increase the accelerator utilization. Using DjiNN, a deep neural network service, Sirius, an end-to-end IPA workload, and traditional applications on a Nvidia K40 GPU, our evaluation shows that Baymax improves the accelerator utilization by 91.3% while achieving the desired 99%-ile latency target for for user-facing applications. In

show abstract

“…Although StarPU has the capability to schedule tasks to run on multi-core CPUs and GPUs simultaneously, when a task is submitted to SkePU, the best performing device for the given input size is selected, but only one device will execute the job. The work presented in [12] also considers different devices without the programmer's intervention, selecting the best device to run the computation but never resorting to CPU/GPU wide computations. StreamIt [6] and Lime [7] provide linguistics constructions to express task and data-parallel computations.…”

Section: Related Workmentioning

confidence: 99%

“…The work distinguishes itself from the current state of the art by supporting the execution of arbitrary multi-kernel compound computations, having in mind data locality requirements. The current state of the art either exposes the heterogeneity to the programmer [11,5] or selectively directs the computations exclusively to one of the available CPU or GPU back-ends [1,2,3,4,12,13]. In turn, the proposals that tackle the transparent conjoint use of both CPUs and GPUs either restrict their scope to the execution of single kernels [14,15,16] or require previous knowledge on the computation to run [17].…”

Section: Introductionmentioning

confidence: 99%

Execution of compound multi‐kernel OpenCL computations in multi‐CPU/multi‐GPU environments

Soldado

Alexandre

Paulino

2015

Concurrency and Computation

View full text Add to dashboard Cite

SUMMARYCurrent computational systems are heterogeneous by nature, featuring a combination of CPUs and graphics processing units (GPUs). As the latter are becoming an established platform for high-performance computing, the focus is shifting towards the seamless programming of these hybrid systems as a whole. The distinct nature of the architectural and execution models in place raises several challenges, as the best hardware configuration is behavior and workload dependent. In this paper, we address the execution of compound, multi-kernel, open computing language computations in multi-CPU/multi-GPU environments. We address how these computations may be efficiently scheduled onto the target hardware, and how the system may adapt itself to changes in the workload to process and to fluctuations in the CPU's load. An experimental evaluation attests the performance gains obtained by the conjoined use of the CPU and GPU devices, when compared with GPU-only executions, and also by the use of data-locality optimizations in CPU environments.

show abstract

Adaptive Runtime Selection for GPU

Cited by 11 publications

References 10 publications

Study on Consulting Air Combat Simulation of Cluster UAV Based on Mixed Parallel Computing Framework of Graphics Processing Unit

Study on Consulting Air Combat Simulation of Cluster UAV Based on Mixed Parallel Computing Framework of Graphics Processing Unit

Baymax

Execution of compound multi‐kernel OpenCL computations in multi‐CPU/multi‐GPU environments

Contact Info

Product

Resources

About