In recent years, the heterogeneity of both commodity and supercomputers hardware has increased sharply. Accelerators, such as GPUs or Intel Xeon Phi co-processors, are often key to improving speed and energy efficiency of highly-parallel codes. However, due to the complexity of heterogeneous architectures, optimization of codes for a certain type of architecture as well as porting codes across different architectures, while maintaining a comparable level of performance, can be extremely challenging. Addressing the challenges associated with performance optimization and performance portability, autotuning has gained a lot of interest. Autotuning of performance-relevant source-code parameters allows to automatically tune applications without hard coding optimizations and thus helps with keeping the performance portable. In this paper, we introduce a benchmark set of ten autotunable kernels for important computational problems implemented in OpenCL or CUDA. Using our Kernel Tuning Toolkit, we show that with autotuning most of the kernels reach near-peak performance on various GPUs and outperform baseline implementations on CPUs and Xeon Phis. Our evaluation also demonstrates that autotuning is key to performance portability. In addition to offline tuning, we also introduce dynamic autotuning of code optimization parameters during application runtime. With dynamic tuning, the Kernel Tuning Toolkit enables applications to re-tune performance-critical kernels at runtime whenever needed, for example, when input data changes. Although it is generally believed that autotuning spaces tend to be too large to be searched during application runtime, we show that it is not necessarily the case when tuning spaces are designed rationally. Many of our kernels reach near peak-performance with moderately sized tuning spaces that can be searched at runtime with acceptable overhead. Finally we demonstrate, how dynamic performance tuning can be integrated into a real-world application from cryo-electron microscopy domain.
Autotuning, the practice of automatic tuning of applications to provide performance portability, has received increased attention in the research community, especially in high performance computing. Ensuring high performance on a variety of hardware usually means modifications to the code, often via different values of a selected set of parameters, such as tiling size, loop unrolling factor or data layout. However, the search space of all possible combinations of these parameters can be large, which can result in cases where the benefits of autotuning are outweighed by its cost, especially with dynamic tuning. Therefore, estimating the tuning time in advance or shortening the tuning time is very important in dynamic tuning applications.We have found that certain properties of tuning spaces do not vary much when hardware is changed. In this paper, we demonstrate that it is possible to use historical data to reliably predict the number of tuning steps that is necessary to find a wellperforming configuration, and to reduce the size of the tuning space. We evaluate our hypotheses on a number of HPC benchmarks written in CUDA and OpenCL, using several different generations of GPUs and CPUs.
The accuracy of biomolecular
simulations depends on the accuracy
of an empirical molecular mechanics potential known as a force field:
a set of parameters and expressions to estimate the potential from
atomic coordinates. Accurate parametrization of force fields for small
organic molecules is a challenge due to their high diversity. One
of the possible approaches is to apply a correction to the existing
force fields. Here, we propose an approach to estimate the density
functional theory (DFT)-derived force field correction which is calculated
during the run of molecular dynamics without significantly affecting
its speed. Using the formula known as a property map collective variable,
we approximate the force field correction by a weighted average of
this force field correction calculated only for a small series of
reference structures. To validate this method, we used seven AMBER
force fields, and we show how it is possible to convert one force
field to behave like the other one. We also present the force field
correction for the important anticancer drug Imatinib as a use case
example. Our method appears to be suitable for adjusting the force
field for general drug-like molecules. We provide a pipeline that
generates the correction; this pipeline is available at .
Nowadays, GPU accelerators are commonly used to speed up general-purpose computing tasks on a variety of hardware. However, due to the diversity of GPU architectures and processed data, optimization of codes for a particular type of hardware and specific data characteristics can be extremely challenging. The autotuning of performance-relevant sourcecode parameters allows for automatic optimization of applications and keeps their performance portable. Although the autotuning process typically results in code speed-up, searching the tuning space can bring unacceptable overhead if (i) the tuning space is vast and full of poorly-performing implementations, or (ii) the autotuning process has to be repeated frequently because of changes in processed data or migration to different hardware.In this paper, we introduce a novel method for searching tuning spaces. The method takes advantage of collecting hardware performance counters (also known as profiling counters) during empirical tuning. Those counters are used to navigate the searching process towards faster implementations. The method requires the tuning space to be sampled on any GPU. It builds a problem-specific model, which can be used during autotuning on various, even previously unseen inputs or GPUs. Using a set of five benchmarks, we experimentally demonstrate that our method can speed up autotuning when an application needs to be ported to different hardware or when it needs to process data with different characteristics. We also compared our method to state of the art and show that our method is superior in terms of the number of searching steps and typically outperforms other searches in terms of convergence time.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.