Florian Wende scite author profile

Steinke

2013

Simulations of the critical Ising model by means of local update algorithms suffer from critical slowing down. One way to partially compensate for the influence of this phenomenon on the runtime of simulations is using increasingly faster and parallel computer hardware. Another approach is using algorithms that do not suffer from critical slowing down, such as cluster algorithms. This paper reports on the Swendsen-Wang multi-cluster algorithm on Intel Xeon Phi coprocessor 5110P, Nvidia Tesla M2090 GPU, and x86 multi-core CPU. We present shared memory versions of the said algorithm for the simulation of the two-and three-dimensional Ising model. We use a combination of local cluster search and global label reduction by means of atomic hardware primitives. Further, we describe an MPI version of the algorithm on Xeon Phi and CPU, respectively. Significant performance improvements over known implementations of the Swendsen-Wang algorithm are demonstrated.

On Improving the Performance of Multi-threaded CUDA Applications with Concurrent Kernel Execution by Kernel Reordering

Cordes²,

Steinke

2012

General-purpose graphics processing units (GPUs) have been found to be viable solutions for large-scale numerical computations with an inherent potential for massive parallelism. In contrast, only few is known about using GPUs for small-scale computations. To have the GPU not be under-utilized for small problem sizes, a meaningful approach is to perform as many small-scale computations as possible in a concurrent manner. On NVIDIA Fermi GPUs, the concept of Concurrent Kernel Execution (CKE) allows for the execution of up to 16 GPU kernels on a single device. While using CKE in single-threaded CUDA programs is straightforward, for multi-threaded programs it might become a challenge to manage multiple host threads interacting with the GPU device, and in addition to have the CKE concept work properly. It can be observed that CKE performance breaks down when multiple host threads each invoke multiple GPU kernels in succession without synchronizing their actions. Since in real-world applications it is common that multiple host threads process their data independently, a mechanism is needed that helps avoiding CKE breakdown. We propose a producer-consumer principle approach to manage GPU kernel invocations from within parallel host regions by reordering the respective GPU kernels before actually invoking them. We are able to demonstrate significant performance improvements with this technique in a strong scaling simulation of a small molecule solvated within a nanodroplet.

OpenMP in VASP: Threading and SIMD

Int J of Quantum Chemistry

Marsman

Kim

et al. 2018

The Vienna Ab initio Simulation Package (VASP) is a widely used electronic structure code that originally exploits process‐level parallelism through the Message Passing Interface (MPI) for work distribution within and across nodes. Architectural changes of modern parallel processors urge programmers to address thread‐ and data‐level parallelism as well to benefit most from the available compute resources within a node. We describe for VASP how to approach for an MPI + OpenMP parallelization including data‐level parallelism through OpenMP SIMD constructs together with a generic high‐level vector coding scheme. We can demonstrate an improved scalability of VASP and more than 20% gain over the MPI‐only version as well as a 2× increased performance of collective operations using the multiple‐endpoint MPI feature. The high‐level vector coding scheme applied to VASP's general gradient approximation routine gives up 9× performance gain on AVX512 platforms with the Intel compiler.

A Unified Programming Model for Intra- and Inter-Node Offloading on Xeon Phi Clusters

Noack

Steinke

et al. 2014

Impact of mixed precision and storage layout on additive Schwarz smoothers

Schneck

Weiser

Numerical Linear Algebra App

2021

The growing discrepancy between CPU computing power and memory bandwidth drives more and more numerical algorithms into a bandwidth-bound regime. One example is the overlapping Schwarz smoother, a highly effective building block for iterative multigrid solution of elliptic equations with higher order finite elements. Two options of reducing the required memory bandwidth are sparsity exploiting storage layouts and representing matrix entries with reduced precision in floating point or fixed point format. We investigate the impact of several options on storage demand and contraction rate, both analytically in the context of subspace correction methods and numerically at an example of solid mechanics. Both perspectives agree on the favourite scheme: fixed point representation of Cholesky factors in nested dissection storage.