2016
DOI: 10.1016/j.parco.2016.08.005
|View full text |Cite
|
Sign up to set email alerts
|

Massively parallel lattice–Boltzmann codes on large GPU clusters

Abstract: This paper describes a massively parallel code for a state-of-the art thermal Lattice Boltzmann method. Our code has been carefully optimized for performance on one GPU and to have a good scaling behavior extending to a large number of GPUs. Versions of this code have been already used for large-scale studies of convective turbulence.GPUs are becoming increasingly popular in HPC applications, as they are able to deliver higher performance than traditional processors. Writing efficient programs for large cluste… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
49
0

Year Published

2017
2017
2020
2020

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 66 publications
(49 citation statements)
references
References 35 publications
0
49
0
Order By: Relevance
“…In contrast to [12,16] in which a thread deals with an entire row of lattice cells, we assign one thread to each lattice cell [9,18]. This invokes a large amount of threads at runtime enabling memory latency hiding.…”
Section: Lattice Boltzmann Methods Kernels For the Gpumentioning
confidence: 99%
“…In contrast to [12,16] in which a thread deals with an entire row of lattice cells, we assign one thread to each lattice cell [9,18]. This invokes a large amount of threads at runtime enabling memory latency hiding.…”
Section: Lattice Boltzmann Methods Kernels For the Gpumentioning
confidence: 99%
“…In the last years several implementations of this model were developed, which were used both for convective turbulence studies [30,31], as well as for a benchmarking application for programming models and HPC hardware architectures [32][33][34][35]. In this work we utilize three different implementations of the same model.…”
Section: Lattice Boltzmannmentioning
confidence: 99%
“…Furthermore, to exploit CPU vector units, they both use, respectively, AVX2 and NEON intrinsics. On the other hand, the third implementation, targeting NVIDIA GPUs, exploits MPI to divide computations across several processes and then each process manages one GPU device launching CUDA kernels [35] in it.…”
Section: Lattice Boltzmannmentioning
confidence: 99%
See 1 more Smart Citation
“…The D2Q37 model has been implemented and extensively optimized on a wide range of parallel machines like BG/Q [28] as well as on a cluster of nodes based on traditional commodity x86 CPUs [29], GPUs [30][31][32], and Xeon-Phi [33,34]. It has been extensively used for large-scale production simulations of convective turbulence [35,36].…”
Section: Applicationsmentioning
confidence: 99%