Abstract-We present a MPI based software library for computing the fast Fourier transforms on massively parallel, distributed memory architectures. Similar to established transpose FFT algorithms, we propose a parallel FFT framework that is based on a combination of local FFTs, local data permutations and global data transpositions. This framework can be generalized to arbitrary multi-dimensional data and process meshes. All performance relevant building blocks can be implemented with the help of the FFTW software library. Therefore, our library offers great flexibility and portable performance. Likewise FFTW, we are able to compute FFTs of complex data, real data and even-or odd-symmetric real data. All the transforms can be performed completely in place. Furthermore, we propose an algorithm to calculate pruned FFTs more efficiently on distributed memory architectures. For example, we provide performance measurements of FFTs of size 512 3 and 1024 3 up to 262144 cores on a BlueGene/P architecture.
Based on a parallel scalable library for Coulomb interactions in particle systems, a comparison between the fast multipole method (FMM), multigrid-based methods, fast Fourier transform (FFT)-based methods, and a Maxwell solver is provided for the case of three-dimensional periodic boundary conditions. These methods are directly compared with respect to complexity, scalability, performance, and accuracy. To ensure comparable conditions for all methods and to cover typical applications, we tested all methods on the same set of computers using identical benchmark systems. Our findings suggest that, depending on system size and desired accuracy, the FMM- and FFT-based methods are most efficient in performance and stability.
We present an analysis of different methods to calculate the classical electrostatic Hartree potential created by charge distributions. Our goal is to provide the reader with an estimation * To whom correspondence should be addressed † on the performance -in terms of both numerical complexity and accuracy-of popular Poisson solvers, and to give an intuitive idea on the way these solvers operate. Highly parallelisable routines have been implemented in the first-principle simulation code OCTOPUS to be used in our tests, so that reliable conclusions about the capability of methods to tackle large systems in cluster computing can be obtained from our work.
Abstract.Starting from an approved serial algorithm, we develop a new parallel algorithm for calculating nonequispaced fast Fourier transforms on massively parallel distributed memory architectures. We demonstrate how to deal with the inherent load imbalance of the serial algorithm due to the use of oversampled FFT. This algorithm has been implemented in a new open source software library called PNFFT. Furthermore, we derive a new parallel distributed memory algorithm for the fast computation of fully Coulomb interactions in a charged particle system with nonperiodic boundary conditions based on a particle-mesh approximation scheme. We show that an appropriate adjustment of the underlying parallel nonequispaced fast Fourier transform circumvents severe load imbalance due to particle scaling. To prove the high scalability of our algorithms we provide performance results on a BlueGene/P system using up to 65536 cores.Key words. parallel nonequispaced fast Fourier transform, parallel fast summation, parallel particle-mesh methods, NFFT AMS subject classifications. 65T50, 65Y05 DOI. 10.1137/120888478 1. Introduction. A broad variety of mathematical algorithms and applications depends on the calculation of the nonequispaced discrete Fourier transform (NDFT), which is a generalization of the discrete Fourier transform to nonequispaced nodes. Especially, its fast approximate realization called nonequispaced fast Fourier transform (NFFT) [8,3,55,59,52,20,31] led to the development of a large number of fast numerical algorithms, e.g., in computerized tomography [16,9], particle simulation [50,23], and spectral methods on adaptive grids, just to name a few examples. An extensive list of applications can be found e.g., in [20].Roughly speaking, the NFFT consists of three steps. First, a deconvolution in frequency domain. Second, a fast Fourier transform (FFT) and, finally, a discrete convolution in spatial domains. The deconvolution and convolution is performed with a window function that is well localized in frequency and spatial domains. Therefore, these two convolution steps can be performed approximately in a fast way. Another advantage of the good localization is that parallel implementations of the convolution steps only require next neighbor communication.The FFT plays a central role in the modular structure of the NFFT algorithm and is a perfect example for the important interplay between the development of fast algorithms and sustainable software engineering in order to produce high performance software. Without a doubt, the FFTW software library [18,19] is an outstanding implementation of the FFT and one of the most important software packages in scientific computing. It offers support of shared memory parallelism and also distributed memory parallelism based on a one-dimensional decomposition of the input
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.