2017
DOI: 10.3390/computation5040048
|View full text |Cite
|
Sign up to set email alerts
|

A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters

Abstract: Heterogeneous clusters are a widely utilized class of supercomputers assembled from different types of computing devices, for instance CPUs and GPUs, providing a huge computational potential. Programming them in a scalable way exploiting the maximal performance introduces numerous challenges such as optimizations for different computing devices, dealing with multiple levels of parallelism, the application of different programming models, work distribution, and hiding of communication with computation. We utili… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
16
0
1

Year Published

2018
2018
2023
2023

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 30 publications
(17 citation statements)
references
References 48 publications
0
16
0
1
Order By: Relevance
“…In case of larger differences, like e.g. using a GPU instead of a CPU to run the simulation [34,35], this might no longer be the case and the measurements and fits should be redone. A possible improvement to overcome this drawback would be to also add hardware details, like cache sizes, clock frequency, etc., to the estimator and try to come up with a performance model for hardware-aware predictions.…”
Section: Discussionmentioning
confidence: 99%
“…In case of larger differences, like e.g. using a GPU instead of a CPU to run the simulation [34,35], this might no longer be the case and the measurements and fits should be redone. A possible improvement to overcome this drawback would be to also add hardware details, like cache sizes, clock frequency, etc., to the estimator and try to come up with a performance model for hardware-aware predictions.…”
Section: Discussionmentioning
confidence: 99%
“…Our method can completely run on GPUs which achieves great performance for biomedical geometry extraction from CT and MRI images. Recently, LBM computations have been implemented on multi-core CPU platforms [35], GPU clusters [36] and heterogeneous CPU/GPU clusters [37]. Our LBM algorithm has similar computational structure and procedure to these methods, while the extra regularization step keeps the parallelism and locality.…”
Section: Discussionmentioning
confidence: 99%
“…This foreshadowed the trend for 2.5D blocking 1D streaming algorithm [13,24,31], which may be used in conjunction with temporal blocking [20]. The best performance of the applied stencil codes reaches about 30% of the peak theoretical performance [17,18,29].…”
Section: Introductionmentioning
confidence: 97%