Abstract-Many sparse matrix computations can be speeded up if the matrix is first reordered. Reordering was originally developed for direct methods but it has recently become popular for improving the cache locality of parallel iterative solvers since reordering the matrix to reduce bandwidth and wavefront can improve the locality of reference of sparse matrix-vector multiplication (SpMV), the key kernel in iterative solvers.In this paper, we present the first parallel implementations of two widely used reordering algorithms: Reverse Cuthill-McKee (RCM) and Sloan. On 16 cores of the Stampede supercomputer, our parallel RCM is 5.56 times faster on the average than a state-of-the-art sequential implementation of RCM in the HSL library. Sloan is significantly more constrained than RCM, but our parallel implementation achieves a speedup of 2.88X on the average over sequential HSL-Sloan. Reordering the matrix using our parallel RCM and then performing 100 SpMV iterations is twice as fast as using HSL-RCM and then performing the SpMV iterations; it is also 1.5 times faster than performing the SpMV iterations without reordering the matrix.
Current trends on high performance computing are moving towards the deployment of several cores on the same chip of modern processors in order to achieve substantial execution speedup through the extraction of the potential fine-grain parallelism of applications. At the forefront of this trend we find nowadays the modern Graphics Processors Units (GPUs), which due to their simplistic design are able to encompass hundreds of independent processing units on a single chip in contrast to their respective CPUs, which at the moment include only a few cores on the same chip. In order to study the potential speedup of computationally intensive applications that utilize the many-core architecture of GPUs, this paper presents a highly accelerated implementation of the finitedifference weighted essentially non-oscillatory (WENO) scheme. This method is suitable for direct numerical simulations (DNS) large eddy simulations (LES) of compressible turbulence and requires large computing resources in order to achieve high Reynolds numbers. Our implementation targets on large-scale simulations using the CUDA parallel programming and constitues a paradigm of GPU's applications in CFD. The results of the current implementation demonstrate that such a computationally intensive application could be highly accelerated running on the NVIDIA Tesla C1070 many-core GPU.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.