Toward Full GPU Implementation of Fluid-Structure Interaction

Bény, Joël; Kotsalos, Christos; Lätt, Jonas

doi:10.1109/ispdc.2019.000-2

Cited by 9 publications

(10 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is feasible since the most refined blood cell model has fewer points (discretized membrane) than the maximum allowed number of threads per block (hardware constraint). Keeping all points of a cell within a CUDA-block allows us to compute the entire solver time step in one CUDA-kernel call, and make good use of cache and shared memory [ 44 ].…”

Section: Methodsmentioning

confidence: 99%

“…Another counter-argument is that some numerical methods such as the IBM have a large and complex memory footprint that renders them less GPU friendly. An earlier attempt [44] to port the whole framework on GPUs could not serve as a justification to move in this direction with the main bottleneck being the irregular memory patterns of the IBM, even if all computations were performed in just one GPU (data locality advantage). Figure 5 presents the execution time per iteration for the hybrid (CPU/GPU) version and the CPU-only version.…”

Section: Performance Analysismentioning

confidence: 99%

See 1 more Smart Citation

Digital blood in massively parallel CPU/GPU systems for the study of platelet transport

et al. 2020

Self Cite

View full text Add to dashboard Cite

We propose a highly versatile computational framework for the simulation of cellular blood flow focusing on extreme performance without compromising accuracy or complexity. The tool couples the lattice Boltzmann solver Palabos for the simulation of blood plasma, a novel finite-element method (FEM) solver for the resolution of deformable blood cells, and an immersed boundary method for the coupling of the two phases. The design of the tool supports hybrid CPU–GPU executions (fluid, fluid–solid interaction on CPUs, deformable bodies on GPUs), and is non-intrusive, as each of the three components can be replaced in a modular way. The FEM-based kernel for solid dynamics outperforms other FEM solvers and its performance is comparable to state-of-the-art mass–spring systems. We perform an exhaustive performance analysis on Piz Daint at the Swiss National Supercomputing Centre and provide case studies focused on platelet transport, implicitly validating the accuracy of our tool. The tests show that this versatile framework combines unprecedented accuracy with massive performance, rendering it suitable for upcoming exascale architectures.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Performance Analysismentioning

confidence: 99%

Digital blood in massively parallel CPU/GPU systems for the study of platelet transport

et al. 2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…Nvidia GPUs , CPUs and Nvidia GPUs [16][17][18][19][20][21][22][23][24][25][26] or mobile SoCs [116,117] -only few use OpenCL [5][6][7][8][9][10][11][12][13][14][15]. With FluidX3D also being implemented in OpenCL, we are able to benchmark our code across a large variety of hardware, from the worlds fastest data-center GPUs over gaming GPUs and CPUs to even the GPUs of mobile phone ARM SoCs.…”

Section: Memory and Performance Comparisonmentioning

confidence: 99%

“…Nevertheless, only few papers [17,32,33,49,57,61] provide some comparison on how floating-point formats affect the accuracy of the LBM and mostly find only insignificant differences between FP64 and FP32 except at very low velocity and where floating-point round-off leads to spontaneous symmetry breaking. Besides the question of accuracy, a quantitative performance comparison across different hardware microarchitectures is missing as the vast majority of LBM software is either written only for CPUs [62][63][64][65][66][67][68][69][70][71][72][73][74] or only for Nvidia GPUs or CPUs and Nvidia GPUs [16][17][18][19][20][21][22][23][24][25][26].…”

Section: Introductionmentioning

confidence: 99%

On the accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit and novel 16-bit number formats

Lehmann¹,

Krause²,

Amati³

et al. 2021

Preprint

View full text Add to dashboard Cite

Fluid dynamics simulations with the lattice Boltzmann method (LBM) are very memory-intensive. Alongside reduction in memory footprint, significant performance benefits can be achieved by using FP32 (single) precision compared to FP64 (double) precision, especially on GPUs. Here, we evaluate the possibility to use even FP16 and Posit16 (half) precision for storing fluid populations, while still carrying arithmetic operations in FP32. For this, we first show that the commonly occurring number range in the LBM is a lot smaller than the FP16 number range. Based on this observation, we develop novel 16-bit formats -based on a modified IEEE-754 and on a modified Posit standard -that are specifically tailored to the needs of the LBM. We then carry out an in-depth characterization of LBM accuracy for six different test systems with increasing complexity: Poiseuille flow, Taylor-Green vortices, Karman vortex streets, liddriven cavity, a microcapsule in shear flow (utilizing the immersed-boundary method) and finally the impact of a raindrop (based on a Volume-of-Fluid approach). We find that the difference in accuracy between FP64 and FP32 is negligible in almost all cases, and that for a large number of cases even 16-bit is sufficient. Finally, we provide a detailed performance analysis of all preci-sion levels on a large number of hardware microarchitectures and show that significant speedup is achieved with mixed FP32/16-bit.

show abstract

“…Yao et al (2017) accelerated the multirelaxation LBM model, calculated 896 × 768 × 4 fluid flow by six GPUs, and obtained speedup of 95 times. Bény et al (2019) used CUDA to simulate fluid-structure interaction problems. The CPU code used for comparison was executed using the high-performance LBM library Palabos (Latt, 2009).…”

Section: Introductionmentioning

confidence: 99%

Efficient graphic processing unit implementation of the chemical-potential multiphase lattice Boltzmann method

Zhu

Zhang

et al. 2020

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

The chemical-potential multiphase lattice Boltzmann method (CP-LBM) has the advantages of satisfying the thermodynamic consistency and Galilean invariance, and it realizes a very large density ratio and easily expresses the surface wettability. Compared with the traditional central difference scheme, the CP-LBM uses the Thomas algorithm to calculate the differences in the multiphase simulations, which significantly improves the calculation accuracy but increases the calculation complexity. In this study, we designed and implemented a parallel algorithm for the chemical-potential model on a graphic processing unit (GPU). Several strategies were used to optimize the GPU algorithm, such as coalesced access, instruction throughput, thread organization, memory access, and loop unrolling. Compared with dual-Xeon 5117 CPU server, our methods achieved 95 times speedup on an NVIDIA RTX 2080Ti GPU and 106 times speedup on an NVIDIA Tesla P100 GPU. When the algorithm was extended to the environment with dual NVIDIA Tesla P100 GPUs, 189 times speedup was achieved and the workload of each GPU reached 96%.

show abstract

Toward Full GPU Implementation of Fluid-Structure Interaction

Cited by 9 publications

References 6 publications

Digital blood in massively parallel CPU/GPU systems for the study of platelet transport

Digital blood in massively parallel CPU/GPU systems for the study of platelet transport

On the accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit and novel 16-bit number formats

Efficient graphic processing unit implementation of the chemical-potential multiphase lattice Boltzmann method

Contact Info

Product

Resources

About