Massively parallel lattice–Boltzmann codes on large GPU clusters

Calore, Enrico; Gabbana, Alessandro; Kraus, Jiří; Pellegrini, Enrico; Schifano, Sebastiano Fabio; Tripiccione, R.

doi:10.1016/j.parco.2016.08.005

Cited by 66 publications

(49 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In contrast to [12,16] in which a thread deals with an entire row of lattice cells, we assign one thread to each lattice cell [9,18]. This invokes a large amount of threads at runtime enabling memory latency hiding.…”

Section: Lattice Boltzmann Methods Kernels For the Gpumentioning

confidence: 99%

A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters

et al. 2017

View full text Add to dashboard Cite

Heterogeneous clusters are a widely utilized class of supercomputers assembled from different types of computing devices, for instance CPUs and GPUs, providing a huge computational potential. Programming them in a scalable way exploiting the maximal performance introduces numerous challenges such as optimizations for different computing devices, dealing with multiple levels of parallelism, the application of different programming models, work distribution, and hiding of communication with computation. We utilize the lattice Boltzmann method for fluid flow as a representative of a scientific computing application and develop a holistic implementation for large-scale CPU/GPU heterogeneous clusters. We review and combine a set of best practices and techniques ranging from optimizations for the particular computing devices to the orchestration of tens of thousands of CPU cores and thousands of GPUs. Eventually, we come up with an implementation using all the available computational resources for the lattice Boltzmann method operators. Our approach shows excellent scalability behavior making it future-proof for heterogeneous clusters of the upcoming architectures on the exaFLOPS scale. Parallel efficiencies of more than 90% are achieved leading to 2604.72 GLUPS utilizing 24,576 CPU cores and 2048 GPUs of the CPU/GPU heterogeneous cluster Piz Daint and computing more than 6.8 × 10 9 lattice cells.

show abstract

Section: Lattice Boltzmann Methods Kernels For the Gpumentioning

confidence: 99%

A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters

et al. 2017

View full text Add to dashboard Cite

show abstract

“…In the last years several implementations of this model were developed, which were used both for convective turbulence studies [30,31], as well as for a benchmarking application for programming models and HPC hardware architectures [32][33][34][35]. In this work we utilize three different implementations of the same model.…”

Section: Lattice Boltzmannmentioning

confidence: 99%

“…Furthermore, to exploit CPU vector units, they both use, respectively, AVX2 and NEON intrinsics. On the other hand, the third implementation, targeting NVIDIA GPUs, exploits MPI to divide computations across several processes and then each process manages one GPU device launching CUDA kernels [35] in it.…”

Section: Lattice Boltzmannmentioning

confidence: 99%

“…Moreover, relieving these two threads from part of the propagate duties, while performing MPI transfers, allows to overlap MPI communications with computations. Concerning the GPU implementation, communications are handled by MPI processes, exploiting CUDA aware MPI capabilities [35] allowing us to fully overlap the communications with GPU computations [39].…”

Section: Lattice Boltzmannmentioning

confidence: 99%

See 1 more Smart Citation

Performance and Power Analysis of HPC Workloads on Heterogeneous Multi-Node Clusters

Mantovani

Calore

2018

JLPEA

View full text Add to dashboard Cite

Performance analysis tools allow application developers to identify and characterize the inefficiencies that cause performance degradation in their codes, allowing for application optimizations. Due to the increasing interest in the High Performance Computing (HPC) community towards energy-efficiency issues, it is of paramount importance to be able to correlate performance and power figures within the same profiling and analysis tools. For this reason, we present a performance and energy-efficiency study aimed at demonstrating how a single tool can be used to collect most of the relevant metrics. In particular, we show how the same analysis techniques can be applicable on different architectures, analyzing the same HPC application on a high-end and a low-power cluster. The former cluster embeds Intel Haswell CPUs and NVIDIA K80 GPUs, while the latter is made up of NVIDIA Jetson TX1 boards, each hosting an Arm Cortex-A57 CPU and an NVIDIA Tegra X1 Maxwell GPU.

show abstract

“…The D2Q37 model has been implemented and extensively optimized on a wide range of parallel machines like BG/Q [28] as well as on a cluster of nodes based on traditional commodity x86 CPUs [29], GPUs [30][31][32], and Xeon-Phi [33,34]. It has been extensively used for large-scale production simulations of convective turbulence [35,36].…”

Section: Applicationsmentioning

confidence: 99%

Power-Efficient Computing: Experiences from the COSA Project

Cesini¹,

Corni²,

Falabella³

et al. 2017

Scientific Programming

View full text Add to dashboard Cite

Energy consumption is today one of the most relevant issues in operating HPC systems for scientific applications. The use of unconventional computing systems is therefore of great interest for several scientific communities looking for a better tradeoff between time-to-solution and energy-to-solution. In this context, the performance assessment of processors with a high ratio of performance per watt is necessary to understand how to realize energy-efficient computing systems for scientific applications, using this class of processors. Computing On SOC Architecture (COSA) is a three-year project (2015)(2016)(2017) funded by the Scientific Commission V of the Italian Institute for Nuclear Physics (INFN), which aims to investigate the performance and the total cost of ownership offered by computing systems based on commodity low-power Systems on Chip (SoCs) and high energyefficient systems based on GP-GPUs. In this work, we present the results of the project analyzing the performance of several scientific applications on several GPU-and SoC-based systems. We also describe the methodology we have used to measure energy performance and the tools we have implemented to monitor the power drained by applications while running.

show abstract

Massively parallel lattice–Boltzmann codes on large GPU clusters

Cited by 66 publications

References 35 publications

A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters

A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters

Performance and Power Analysis of HPC Workloads on Heterogeneous Multi-Node Clusters

Power-Efficient Computing: Experiences from the COSA Project

Contact Info

Product

Resources

About