Exploring New Architectures in Accelerating CFD for Air Force Applications

Dongarra, Jack; Peterson, Gregory D.; Tomov, Stanimire; Allred, Jeffrey; Natoli, V.; Richie, David A.

doi:10.1109/dod.hpcmp.ugc.2008.12

Cited by 23 publications

(15 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In May, Volkov and Demmel [35] described LU, QR, and Cholesky factorizations running at up to 180 GFlop/s in single precision (with QR a little bit more). The first results on a pre-released next-generation G90 NVIDIA card were presented at UGC2008 in May, where Dongarra et al [11] reported Cholesky running at up to 327 GFlop/s in single precision. Using again the newest generation card, in this paper, we describe an LU algorithm running at up to 388 GFlop/s in single precision and 99.4 Gflop/s in double precision.…”

Section: Gpus For Dlamentioning

confidence: 99%

“…This is illustrated for Cholesky factorization (so called left-looking version) in Fig. 2 (the case reported in [11]). The matrix to be factorized is allocated on the GPU memory and the code is as in LAPACK with BLAS calls replaced by CUBLAS, which represents the first idea from the list above.…”

Section: Gpus For Dlamentioning

confidence: 99%

“…Namely, while developing LAPACK-style onesided matrix factorizations for GPUs, several groups [2,4,36] observed that the panel factorizations are often faster on the CPU than on the GPU (the approach described in Section 2.2), which led to the development of highly efficient factorizations [11,29,34] where a single CPU core is used to control the GPU and to factor the panels. In contrast to this previous work, the importance here is on developing algorithms that will efficiently use both a multicore and a GPU.…”

Section: Design Philosophymentioning

confidence: 99%

See 2 more Smart Citations

Towards dense linear algebra for hybrid GPU accelerated manycore systems

2010

Self Cite

View full text Add to dashboard Cite

a b s t r a c tWe highlight the trends leading to the increased appeal of using hybrid multicore + GPU systems for high performance computing. We present a set of techniques that can be used to develop efficient dense linear algebra algorithms for these systems. We illustrate the main ideas with the development of a hybrid LU factorization algorithm where we split the computation over a multicore and a graphics processor, and use particular techniques to reduce the amount of pivoting and communication between the hybrid components. This results in an efficient algorithm with balanced use of a multicore processor and a graphics processor.

show abstract

Section: Gpus For Dlamentioning

confidence: 99%

Section: Gpus For Dlamentioning

confidence: 99%

Section: Design Philosophymentioning

confidence: 99%

See 1 more Smart Citation

Towards dense linear algebra for hybrid GPU accelerated manycore systems

2010

Self Cite

View full text Add to dashboard Cite

show abstract

“…Their use in general-purpose computations [24], and more specifically in CFD [5], is promising. Successful attempts were made to implement LBM solvers on the GPU [6].…”

Section: Introductionmentioning

confidence: 99%

The TheLMA project: A thermal lattice Boltzmann solver for the GPU

et al. 2012

View full text Add to dashboard Cite

In this paper, we consider the implementation of a thermal flow solver based on the lattice Boltzmann method (LBM) for graphics processing units (GPU). We first describe the hybrid thermal LBM model implemented, and give a concise review of the CUDA technology. The specific issues that arise with LBM on GPUs are outlined. We propose an approach for efficient handling of the thermal part. Performance is close to optimum and is significantly better than the one of comparable CPU solvers. We validate our code by simulating the differentially heated cubic cavity (DHC). The computed results for steady flow patterns are in good agreement with previously published ones. Finally, we use our solver to study the phenomenology of transitional flows in the DHC.

show abstract

“…Nevertheless, this approach can lead to high performance, but only after some modifications and for routines that map well on the GPU, like Cholesky (e.g. Dongarra et al [8] report up to 327 GFlop/s in single precision on a pre-released at the time NVIDIA T10P). Naturally, previous attempts to wrap some of the work needed in transitions like this in frameworks, have also failed to produce convincing results.…”

Section: Introductionmentioning

confidence: 99%

A Note on Auto-tuning GEMM for GPUs

Dongarra

Tomov

2009

Lecture Notes in Computer Science

Self Cite

138

View full text Add to dashboard Cite

Abstract. The development of high performance dense linear algebra (DLA) critically depends on highly optimized BLAS, and especially on the matrix multiplication routine (GEMM). This is especially true for Graphics Processing Units (GPUs), as evidenced by recently published results on DLA for GPUs that rely on highly optimized GEMM [13,11]. However, the current best GEMM performance, e.g. of up to 375 GFlop/s in single precision and of up to 75 GFlop/s in double precision arithmetic on NVIDIA's GTX 280, is difficult to achieve. The development involves extensive GPU knowledge and even backward engineering to understand some undocumented insides about the architecture that have been of key importance in the development [12]. In this paper, we describe some GPU GEMM auto-tuning optimization techniques that allow us to keep up with changing hardware by rapidly reusing, rather than reinventing, the existing ideas. Auto-tuning, as we show in this paper, is a very practical solution where in addition to getting an easy portability, we can often get substantial speedups even on current GPUs (e.g. up to 27% in certain cases for both single and double precision GEMMs on the GTX 280).

show abstract

Exploring New Architectures in Accelerating CFD for Air Force Applications

Cited by 23 publications

References 14 publications

Towards dense linear algebra for hybrid GPU accelerated manycore systems

Towards dense linear algebra for hybrid GPU accelerated manycore systems

The TheLMA project: A thermal lattice Boltzmann solver for the GPU

A Note on Auto-tuning GEMM for GPUs

Contact Info

Product

Resources

About