Auto-tuning has become increasingly popular for optimizing non-functional parameters of parallel programs. The typically large search space requires sophisticated techniques to find well performing parameter values in a reasonable amount of time. Different parts of a program often perform best with different parameter values. We therefore subdivide programs into several regions, and try to optimize the parameter values for each of those regions separately as opposed to setting the parameter values globally for the entire program. As this enlarges the search space even further, we have to extend existing auto-tuning techniques in order to obtain good results. In this paper we introduce a novel enhancement to the RS-GDE3 algorithm which is used to explore the search space for auto-tuning programs with multiple regions regarding several objectives. We have implemented our auto-tuner using the Insieme compiler and runtime system. In comparison to a non-optimized parallel version of the tested programs, our novel approach achieves up to 7.6, 10.5, and 61.6 fold improvements for three tuned objectives wall time, energy consumption, and resource usage, respectively.
Task-based programming models for shared memory-such as Cilk Plus and OpenMP 3-are well established and documented. However, with the increase in parallel, many-core, and heterogeneous systems, a number of research-driven projects have developed more diversified task-based support, employing various programming and runtime features. Unfortunately, despite the fact that dozens of different task-based systems exist today and are actively used for parallel and high-performance computing (HPC), no comprehensive overview or classification of task-based technologies for HPC exists. In this paper, we provide an initial task-focused taxonomy for HPC technologies, which covers both programming interfaces and runtime mechanisms. We
Over the past years there has been a steady change in CPU design towards both many-core processors and poweraware hardware architectures. These two trends are combined in the Intel Single-chip Cloud Computer (SCC), an experimental prototype with 48 Pentium cores created by Intel Labs. The SCC is a highly configurable many-core chip which provides unique opportunities to optimize run time, communication and memory access as well as power/energy consumption of parallel programs. The aim of this paper is to characterize the performance behavior of the chip with various power settings, mappings of processes/cores to memory controllers, etc through benchmarking. Analytical models are used to verify and interpret the results. Conclusions drawn from our benchmark outcomes are that data exchange based on message passing is faster than shared memory data exchange. Contrary to popular belief, lowest energy consumption is not achieved for the fastest execution time. Furthermore in order to improve the memory access behavior one should increase the clock frequency of both, mesh network and memory controllers. In general, the results of our investigations can be used to analyze the effect of power settings and architecture properties on the performance and energy consumption of parallel programs as well as assist in choosing appropriate settings for specific workloads.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.