Batch Solution of Small PDEs with the OPS DSL

Reguly, István Z.; Moore, Branden J.; Schmielau, T.; Toit, Jacques du; Mudalige, Gihan R.

doi:10.1007/978-3-030-34356-9_12

Cited by 5 publications

(15 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, we note that OpenCL could equally be used to implement the same design. Finally, we compare performance on the FPGA to an NVIDIA Tesla V100 GPU using the tridiagonal solver library, tridsolver implemented by Lászl ó et al [13] [1] using its batched version presented by Reguly et al [22]. This GPU library has been shown [6] to provide matching or better performance than the two current batch tridiagonal solver functions -cusparse<t>gtsv2StridedBatch() and cusparse<t>gtsvInterleacedBatch(), in Nvidia's cuSPARSE library [4], [25].…”

Section: Performancementioning

confidence: 99%

High Throughput Multidimensional Tridiagonal Systems Solvers on FPGAs

Kamalakkannan¹,

Reguly²,

Fahmy³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

This paper presents a design space exploration for synthesizing optimized, high-throughput implementations of multiple multi-dimensional tridiagonal system solvers on FPGAs. Re-evaluating the characteristics of algorithms for the direct solution of tridiagonal systems, we develop a new tridiagonal solver library aimed at implementing high-performance computing applications on Xilinx FPGA hardware. Key new features of the library are (1) the unification of standard state-of-the-art techniques for implementing implicit numerical solvers with a number of novel high-gain optimizations such as vectorization and batching, motivated by multiple multi-dimensional systems common in real-world applications, (2) data-flow techniques that provide application specific optimizations for both 2D and 3D problems, including integration of explicit loops commonplace in real workloads, and (3) the development of a predictive analytic model to explore the design space, and obtain rapid resource and performance estimates. The new library provide an order of magnitude better performance when solving large batches of systems compared to Xilinx's current tridiagonal solver library. Two representative applications are implemented using the new solver on a Xilinx Alveo U280 FPGA, demonstrating over 85% predictive model accuracy. These are compared with a current state-of-the-art GPU library for solving multi-dimensional tridiagonal systems on an Nvidia V100 GPU, analyzing time to solution, bandwidth, and energy consumption. Results show the FPGAs achieving competitive or better runtime performance for a range of multi-dimensional mesh problems compared to the V100 GPU. Additionally, the significant energy savings offered by FPGA implementations, over 30% for the most complex application, are quantified. We discuss the algorithmic trade-offs required to obtain good performance on FPGAs, giving insights into the feasibility and profitability of FPGA implementations.

show abstract

Section: Performancementioning

confidence: 99%

High Throughput Multidimensional Tridiagonal Systems Solvers on FPGAs

Kamalakkannan¹,

Reguly²,

Fahmy³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Regarding runtime performance we measured the performance of a stochastic local volatility (SLV) model ported to ops (see [14]). SLV constitute state-of-the-art models to describe asset price processes, notably foreign exchange rates.…”

Section: Resultsmentioning

confidence: 99%

Automatic parallel implementations of adjoint codes for structured mesh applications

Balogh

Reguly

2020

2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)

Self Cite

View full text Add to dashboard Cite

Algorithmic Differentiation (AD) shown to be an essential tool to get sensitivity information in multiple areas of science such as Computational Fluid Dynamics (CFD) applications or finance. Yet there is no sufficient tool to ease the cost of providing performance portable AD codes, especially for modern hardware like GPU clusters. This paper sketches our plans and progress so far to extend the OPS framework with an adjoint tape (storage for descriptors of intermediate steps and intermediate states of variables) and shows preliminary performance results on CPU nodes. The OPS (Oxford Parallel library for Structured mesh solvers) has shown good performance and scaling on a wide range of HPC architectures. Our work aims to exploit the benefits of OPS to provide performance portable adjoint implementations for future structured mesh stencil applications using OPS with minimal modifications.

show abstract

“…Thus if a large number of smaller meshes are to be solved, as is the case in financial applications [27], then processing one mesh at a time incurs significant latencies. This motivates the idea of grouping together meshes with the same dimensions in batches, increasing the overall throughput of the solve.…”

Section: B Batchingmentioning

confidence: 99%

“…Baseline FPGA performance is significantly better than on the V100, since the GPU is not saturated by this application. The batching of 2D meshes as in [27] improves GPU performance significantly and offers a closer comparison. The FPGA achieves a maximum speedup of about 30-34% for different mesh sizes and batching sizes of 100 (100B) and 1000 (1000B).…”

Section: A Poisson-5pt-2dmentioning

confidence: 99%