Dynamic load balancing on multi-GPUs system for big data processing

Zhang, Chaolong; Xu, Yuanping; Zhou, Jiliu; Xu, Zhijie; Lü, Li; Lu, Jun

doi:10.23919/iconac.2017.8082085

Cited by 4 publications

(4 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…1 auto kernel = file_read("binomial.cl"); 2 auto samples = 16777216; auto steps = 254; 3 auto steps1 = steps + 1; auto lws = steps1; 4 auto samplesBy4 = samples / 4; 5 auto gws = lws * samplesBy4; 6 vector<cl_float4> in(samplesBy4); 7 vector<cl_float4> out(samplesBy4); 8 9 binomial_init_setup(samplesBy4, in, out); 18 program.in(in); 19 program.out(out); 20 21 program.out_pattern(1, lws); 22 23 program.kernel(kernel, "binomial_opts"); 24 program.arg(0, steps); // positional by index 25 program.arg(in); // aggregate 26 program.arg(out); 27 program.arg(steps1 * sizeof(cl_float4), 28 ecl::Arg::LocalAlloc); 29 program.arg(4, steps * sizeof(cl_float4), 30 ecl::Arg::LocalAlloc); 31 32 engine.use(std::move(program)); 33 34 engine.run(); 35 36 // if (engine.has_errors()) // [Optional lines] 37 // for (auto& err : engine.get_errors()) 38 // show or process errors Listing 1: EngineCL API used in Binomial benchmark.…”

Section: Case 1: Using Only One Devicementioning

confidence: 99%

“…The experiments have been carried out using two different machines to validate both code portability and performance of EngineCL. 1 auto kernel = file_read("nbody.cl"); 2 auto gpu_kernel = file_read("nbody.gpu.cl"); 3 auto phi_kernel_bin = 4 file_read_binary("nbody.phi.cl.bin"); 5 auto bodies = 512000; auto del_t = 0.005f; 6 auto esp_sqr = 500.0f; auto lws = 64; 7 auto gws = bodies; 8 vector<cl_float4> in_pos(bodies); 9 vector<cl_float4> in_vel(bodies); 10 vector<cl_float4> out_pos(bodies); 11 vector<cl_float4> out_vel(bodies); 12 13 nbody_init_setup(bodies, del_t, esp_sqr, in_pos, 14 in_vel, out_pos, out_vel); 15 16 ecl::EngineCL engine; 17 engine.use(ecl::Device(0, 0), 18 ecl::Device(0, 1, phi_kernel_bin), 19 ecl::Device(1, 0, gpu_kernel)); 20 21 engine.work_items(gws, lws); 22 23 auto props = { 0.08, 0.3 }; 24 engine.scheduler(ecl::Scheduler::Static(props)); 25 26 ecl::Program program; 27 program.in(in_pos); 28 program.in(in_vel); 29 program.out(out_pos); 30 program.out(out_vel); 31 32 program.kernel(kernel, "nbody"); 33 program.args(in_pos, in_vel, bodies, del_t, 34 esp_sqr, out_pos, out_vel); 35 36 engine.program(std::move(program)); 37 38 engine.run(); Listing 2: EngineCL API used in NBody benchmark.…”

Section: System Setupmentioning

confidence: 99%

“…For instance, [26], proposes a lightweight runtime based on QUARK, that uses a greedy heuristic. The authors of [27] apply fuzzy neural networks to the task distribution problem. MultiCL [28] is an OpenCL runtime based on storing execution information for each kernel-device pair for future kernel launches.…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

EngineCL: Usability and Performance in Heterogeneous Computing

Nozal

Bosque

Beivide

2020

Future Generation Computer Systems

View full text Add to dashboard Cite

Heterogeneous systems have become one of the most common architectures today, thanks to their excellent performance and energy consumption. However, due to their heterogeneity they are very complex to program and even more to achieve performance portability on different devices. This paper presents EngineCL, a new OpenCL-based runtime system that outstandingly simplifies the co-execution of a single massive data-parallel kernel on all the devices of a heterogeneous system. It performs a set of low level tasks regarding the management of devices, their disjoint memory spaces and scheduling the workload between the system devices while providing a layered API. EngineCL has been validated in two compute nodes (HPC and commodity system), that combine six devices with different architectures. Experimental results show that it has excellent usability compared with OpenCL; a maximum 2.8% of overhead compared to the native version under loads of less than a second of execution and a tendency towards zero for longer execution times; and it can reach an average efficiency of 0.89 when balancing the load.

show abstract

Section: Case 1: Using Only One Devicementioning

confidence: 99%

Section: System Setupmentioning

confidence: 99%

See 1 more Smart Citation

EngineCL: Usability and Performance in Heterogeneous Computing

Nozal

Bosque

Beivide

2020

Future Generation Computer Systems

View full text Add to dashboard Cite

show abstract

“…To optimize the load balancing problem among multi-GPU nodes for large scale applications with highly repetitive computational procedures or iterations, this paper presents a novel DLB model based on fuzzy neural network (FNN) and data set division techniques for heterogeneous multi-GPU systems, and this study is extended from our previous publication [24]. In this study, five real-time state feedback parameters closely relating to the computational performance of every GPU node are defined.…”

Section: Introductionmentioning

confidence: 99%

A Fuzzy Neural Network Based Dynamic Data Allocation Model on Heterogeneous Multi-GPUs for Large-scale Computations

Zhang

et al. 2018

Int. J. Autom. Comput.

View full text Add to dashboard Cite

The parallel computation capabilities of modern GPU (Graphics Processing Unit) processors have attracted increasing attention from researchers and engineers who have been conducting high computational throughput studies. However, current single GPU based engineering solutions are often struggle to fulfill their real-time requirements. Thus, the multi-GPU-based approach has become a popular and cost-effective choice for tackling the demands. In those cases, the computational load balancing over multiple GPU "nodes" is often the key and bottleneck that affect the quality and performance of the runtime system. The existing load balancing approaches are mainly based on the assumption that all GPU nodes in the same computer framework are of equal computational performance, which are often not the case due to cluster design and other legacy issues. This paper presents a novel dynamic load balancing (DLB) model for rapid data division and allocation on heterogeneous GPU nodes based on an innovative fuzzy neural network (FNN). In this research, a 5-state parameter feedback mechanism defining the overall cluster and node performances is proposed. The corresponding FNN-based DLB model will be capable of monitoring and predicting individual node performance under different workload scenarios. A real-time adaptive scheduler has been devised to reorganize the data inputs to each node when necessary to maintain their runtime computational performances. The devised model has been implemented on two dimensional (2D) discrete wavelet transform (DWT) tasks for evaluation. Experiment results show that this DLB model has enabled a high computational throughput while ensuring real-time and precision requirements from complex computational tasks.

show abstract