The emergence of heterogeneous systems has been very notable recently. Still their programming is a complex task. The co-execution of a single OpenCL kernel on several devices is a challenging endeavour, requiring considering the different computing capabilities of the devices and application behaviour. OmpSs is a framework for task based parallel applications, that does not support coexecution between several devices. This paper presents an extension of OmpSs that solves two main issues. First, the automatic distribution of datasets and the management of device memory address spaces. Second, the implementation of a set of load balancing algorithms to adapt to the particularities of applications and systems. All this is accomplished with negligible impact on the programming. Experimental results reveal that the use of all the devices in the system is beneficial in terms performance and energy consumption. Also, the Auto-Tune algorithm gives the best overall results without requiring manual parameter tuning. 45 diverse nature of kernels prevents the success of a single data-division strategy in maximising the performance and efficiency of a heterogeneous system. Aside from kernel behaviour, the other key factor for load distribution is the configuration of the heterogeneous system. For the load to be well balanced, each device must get the right amount of work, adapted to the capabilities of 50 the device itself. Therefore, a work distribution that has been hand-tuned for a given system is likely to underperform on a different one. The OmpSs programming model presents a change of paradigm in many ways. It provides support for task parallelism due to its benefits in terms of performance, cross-platform flexibility and reduction of data motion [9]. The 55 programmer divides the code in interrelating tasks and OmpSs essentially orchestrates their parallel execution maintaining their control and data dependences. To that end, OmpSs uses the information supplied by the programmer, via code annotations with pragmas, to determine at run-time which parts of the code can be run in parallel. It enhances OpenMP with support for irregular 60 and asynchronous parallelism, as well as support for heterogeneous architec-
Heterogeneous systems are nowadays a common choice in the path to Exascale. Through the use of accelerators they offer outstanding energy efficiency. The programming of these devices employs the host-device model, which is suboptimal as CPU remains idle during kernel executions, but still consumes energy. Making the CPU contribute computing effort might improve the performance and energy consumption of the system. This paper analyses the advantages of this approach and sets the limits of when its beneficial. The claims are supported by a set of models that determine how to share a single data-parallel task between the CPU and the accelerator for optimum performance, energy consumption or efficiency. Interestingly, the models show that optimising performance does not always mean optimum energy or efficiency as well. The paper experimentally validates the models, which represent an invaluable tool for programmers when faced with the dilemma of whether to distribute their workload in these systems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.