GPU Acceleration of a Non-Standard Finite Element Mesh Truncation Technique for Electromagnetics

Badía, Jordi; Amor-Martín, Adrián; Belloch, Jose A.; Garcı́a-Castillo, L.E.

doi:10.1109/access.2020.2993103

Cited by 4 publications

(2 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As we increase both parameters, the spatial cost of the algorithm also increases, and we exhaust the resources available in the GPU. As we pointed out in [14], the main factor that limits the performance of this parallel algorithm is the number of registers available in each streaming multiprocessor. The computations involved by each iteration of the loop on S involve the use of a very large number of small vectors and scalar variables local to every CUDA thread, which use up even the large number of registers available on most modern GPUs.…”

Section: Cuda Results On Gpumentioning

confidence: 99%

“…The algorithm cuda S, introduced in [14], is based on a kernel that implements all computations involved in each iteration of the loop on S. The algorithm tries to optimise the management of the different kinds of GPU memory by leveraging the register file and the shared memory of the GPU. Specifically, we copy the elements of vector currents to the shared memory of each block of threads in order to reduce the number of accesses to global memory.…”

Section: Cuda Parallelization On Gpumentioning

confidence: 99%

See 1 more Smart Citation

Strategies to parallelize a finite element mesh truncation technique on multi- and manycore architectures

Badía

Amor-Martín

Belloch

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Achieving maximum parallel performance on multi-core CPUs and many-core GPUs is a challenging task depending on multiple factors. These include, for example, the number and granularity of the computations or the use of the memories of the devices. In this paper, we assess those factors by evaluating and comparing different parallelizations of the same problem on a multiprocessor containing a CPU with 40 cores and four P100 GPUs with Pascal architecture. We use as study case the convolutional operation behind a non-standard finite element mesh truncation technique in the context of open region electromagnetic wave propagation problems. A total of six parallel algorithms implemented using OpenMP and CUDA have been used to carry out the comparison by leveraging the same levels of parallelism on both types of platforms. Three of the algorithms are presented for the first time in this paper, including a multi-GPU method, and two others are improved versions of algorithms previously developed by some of the authors.This paper presents a thorough experimental evaluation of the parallel algorithms on a Radar Cross Section prediction problem. Results show that performance obtained on the GPU clearly overcomes those obtained in the CPU, much more so if we use multiple GPUs to distribute both data and computations. Accelerations close to 30 have been obtained on the CPU, while with the multi-GPU version acceleration larger than 250 have been achieved.

show abstract

Section: Cuda Results On Gpumentioning

confidence: 99%

Section: Cuda Parallelization On Gpumentioning

confidence: 99%

Strategies to parallelize a finite element mesh truncation technique on multi- and manycore architectures

Badía

Amor-Martín

Belloch

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

Strategies to parallelize a finite element mesh truncation technique on multi-core and many-core architectures

et al. 2022

View full text Add to dashboard Cite

Achieving maximum parallel performance on multi-core CPUs and many-core GPUs is a challenging task depending on multiple factors. These include, for example, the number and granularity of the computations or the use of the memories of the devices. In this paper, we assess those factors by evaluating and comparing different parallelizations of the same problem on a multiprocessor containing a CPU with 40 cores and four P100 GPUs with Pascal architecture. We use, as study case, the convolutional operation behind a non-standard finite element mesh truncation technique in the context of open region electromagnetic wave propagation problems. A total of six parallel algorithms implemented using OpenMP and CUDA have been used to carry out the comparison by leveraging the same levels of parallelism on both types of platforms. Three of the algorithms are presented for the first time in this paper, including a multi-GPU method, and two others are improved versions of algorithms previously developed by some of the authors. This paper presents a thorough experimental evaluation of the parallel algorithms on a radar cross-sectional prediction problem. Results show that performance obtained on the GPU clearly overcomes those obtained in the CPU, much more so if we use multiple GPUs to distribute both data and computations. Accelerations close to 30 have been obtained on the CPU, while with the multi-GPU version accelerations larger than 250 have been achieved.

show abstract

Comparative analysis of soft-error sensitivity in LU decomposition algorithms on diverse GPUs

Leon,

Badia,

Belloch

et al. 2024

J Supercomput

View full text Add to dashboard Cite

Graphics processing units (GPUs) have become integral to embedded systems and supercomputing centres due to their large memory, cutting-edge technology and high performance per watt. However, their susceptibility to transient errors requires a comprehensive analysis of error sensitivity, as well as the development of error mitigation techniques and fault-tolerant algorithms. This study focuses on evaluating the soft-error sensitivity of two distinct versions of LU decomposition algorithms implemented on two very different GPUs—a low-power SoC embedded GPU and a high-performance massively parallel GPU. Through extensive fault injection campaigns on both GPUs, we examine the vulnerability of the algorithms, identify error causes, and determine critical code components requiring enhanced protection. The experiments reveal that most single bit flip fault injections in the instruction results lead to erroneous outcomes or unrecoverable errors. Notably, efficient GPU resource utilisation can increase the number of masked errors, thereby enhancing error resilience. Additionally, while different parts of the code exhibit similar error occurrence types and rates, the propagation of errors to elements within the result matrix differs significantly.

show abstract

GPU Acceleration of a Non-Standard Finite Element Mesh Truncation Technique for Electromagnetics

Cited by 4 publications

References 47 publications

Strategies to parallelize a finite element mesh truncation technique on multi- and manycore architectures

Strategies to parallelize a finite element mesh truncation technique on multi- and manycore architectures

Strategies to parallelize a finite element mesh truncation technique on multi-core and many-core architectures

Comparative analysis of soft-error sensitivity in LU decomposition algorithms on diverse GPUs

Contact Info

Product

Resources

About