Abstract.Starting from an approved serial algorithm, we develop a new parallel algorithm for calculating nonequispaced fast Fourier transforms on massively parallel distributed memory architectures. We demonstrate how to deal with the inherent load imbalance of the serial algorithm due to the use of oversampled FFT. This algorithm has been implemented in a new open source software library called PNFFT. Furthermore, we derive a new parallel distributed memory algorithm for the fast computation of fully Coulomb interactions in a charged particle system with nonperiodic boundary conditions based on a particle-mesh approximation scheme. We show that an appropriate adjustment of the underlying parallel nonequispaced fast Fourier transform circumvents severe load imbalance due to particle scaling. To prove the high scalability of our algorithms we provide performance results on a BlueGene/P system using up to 65536 cores.Key words. parallel nonequispaced fast Fourier transform, parallel fast summation, parallel particle-mesh methods, NFFT AMS subject classifications. 65T50, 65Y05 DOI. 10.1137/120888478 1. Introduction. A broad variety of mathematical algorithms and applications depends on the calculation of the nonequispaced discrete Fourier transform (NDFT), which is a generalization of the discrete Fourier transform to nonequispaced nodes. Especially, its fast approximate realization called nonequispaced fast Fourier transform (NFFT) [8,3,55,59,52,20,31] led to the development of a large number of fast numerical algorithms, e.g., in computerized tomography [16,9], particle simulation [50,23], and spectral methods on adaptive grids, just to name a few examples. An extensive list of applications can be found e.g., in [20].Roughly speaking, the NFFT consists of three steps. First, a deconvolution in frequency domain. Second, a fast Fourier transform (FFT) and, finally, a discrete convolution in spatial domains. The deconvolution and convolution is performed with a window function that is well localized in frequency and spatial domains. Therefore, these two convolution steps can be performed approximately in a fast way. Another advantage of the good localization is that parallel implementations of the convolution steps only require next neighbor communication.The FFT plays a central role in the modular structure of the NFFT algorithm and is a perfect example for the important interplay between the development of fast algorithms and sustainable software engineering in order to produce high performance software. Without a doubt, the FFTW software library [18,19] is an outstanding implementation of the FFT and one of the most important software packages in scientific computing. It offers support of shared memory parallelism and also distributed memory parallelism based on a one-dimensional decomposition of the input