A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters

Riesinger, Christoph; Bakhtiari, Arash; Schreiber, Martin; Neumann, Philipp; Bungartz, Hans‐Joachim

doi:10.3390/computation5040048

Cited by 30 publications

(17 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In case of larger differences, like e.g. using a GPU instead of a CPU to run the simulation [34,35], this might no longer be the case and the measurements and fits should be redone. A possible improvement to overcome this drawback would be to also add hardware details, like cache sizes, clock frequency, etc., to the estimator and try to come up with a performance model for hardware-aware predictions.…”

Section: Discussionmentioning

confidence: 99%

Dynamic Load Balancing Techniques for Particulate Flow Simulations

Rettinger

Rüde

2019

Computation

View full text Add to dashboard Cite

Parallel multiphysics simulations often suffer from load imbalances originating from the applied coupling of algorithms with spatially and temporally varying workloads. It is thus desirable to minimize these imbalances to reduce the time to solution and to better utilize the available hardware resources. Taking particulate flows as an illustrating example application, we present and evaluate load balancing techniques that tackle this challenging task. This involves a load estimation step in which the currently generated workload is predicted. We describe in detail how such a workload estimator can be developed. In a second step, load distribution strategies like space-filling curves or graph partitioning are applied to dynamically distribute the load among the available processes. To compare and analyze their performance, we employ these techniques to a benchmark scenario and observe a reduction of the load imbalances by almost a factor of four. This results in a decrease of the overall runtime by 14% for space-filling curves.

show abstract

Section: Discussionmentioning

confidence: 99%

Dynamic Load Balancing Techniques for Particulate Flow Simulations

Rettinger

Rüde

2019

Computation

View full text Add to dashboard Cite

show abstract

“…Our method can completely run on GPUs which achieves great performance for biomedical geometry extraction from CT and MRI images. Recently, LBM computations have been implemented on multi-core CPU platforms [35], GPU clusters [36] and heterogeneous CPU/GPU clusters [37]. Our LBM algorithm has similar computational structure and procedure to these methods, while the extra regularization step keeps the parallelism and locality.…”

Section: Discussionmentioning

confidence: 99%

Fully parallelized Lattice Boltzmann scheme for fast extraction of biomedical geometry

Wang

Zhao

et al. 2019

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

We develop a fully parallel numerical method which quickly performs 2D and 3D segmentation on GPU to extract anatomical structures from medical images. The algorithm solves the level set equations completely within a Lattice Boltzmann model (LBM). Compared with existing LBM-based segmentation approaches, a parallel distance field regularization is added to the LBM computing scheme to keep computation stable with large time step iteration. This approach also avoids external regularization which has been a major impediment to direct parallelization of level set evolution with LBM. It allows the whole computing process to be efficiently executed on GPU. Moreover, the method can be incorporated with different image features to adopt in various image segmentation tasks. Therefore, our method enables fully GPU accelerated geometric extraction from medical images, leading to high computing performance which is demanded in many practical applications. This method is used to exact accurate 2D and 3D anatomical structures from many real world CT and MRI images. The achieved results can also directly feed required boundary information to LBM-based hemodynamics simulation.

show abstract

“…This foreshadowed the trend for 2.5D blocking 1D streaming algorithm [13,24,31], which may be used in conjunction with temporal blocking [20]. The best performance of the applied stencil codes reaches about 30% of the peak theoretical performance [17,18,29].…”

Section: Introductionmentioning

confidence: 97%

Performance Limits Study of Stencil Codes on Modern GPGPUs

Pershin

Levchenko

Perepelkina

2019

JSFI

View full text Add to dashboard Cite

We study the performance limits of different algorithmic approaches to the implementation of a sample problem of wave equation solution with a cross stencil scheme. With this, we aim to find the highest limit of the achievable performance efficiency for stencil computing. To estimate the limits, we use a quantitative Roofline model to make a thorough analysis of the performance bottlenecks and develop the model further to account for the latency of different levels of GPU memory. These estimates provide an incentive to use spatial and temporal blocking algorithms. Thus, we study stepwise, domain decomposition, and domain decomposition with halo algorithms in that order. The knowledge of the limit incites the motivation to optimize the implementation. This led to the analysis of the block synchronization methods in CUDA, which is also provided in the text. After all optimizations, we have achieved 90% of the peak performance, which amounts to more than 1 trillion cell updates per second on one consumer level GPU device.

show abstract

A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters

Cited by 30 publications

References 48 publications

Dynamic Load Balancing Techniques for Particulate Flow Simulations

Dynamic Load Balancing Techniques for Particulate Flow Simulations

Fully parallelized Lattice Boltzmann scheme for fast extraction of biomedical geometry

Performance Limits Study of Stencil Codes on Modern GPGPUs

Contact Info

Product

Resources

About