We show that recently developed
divide and conquer
parallel algorithm for solving tridiagonal Toeplitz systems of linear equations can be easily and efficiently implemented for a variety of modern multicore and GPU architectures, as well as hybrid systems. Our new portable implementation that uses OpenACC can be executed on both CPU-based and GPU-accelerated systems. More sophisticated variants of the implementation are suitable for systems with multiple GPUs and it can use CPU and GPU cores. We consider the use of both
column-wise
and
row-wise
storage formats for two dimensional double precision arrays and show how to efficiently convert between these two formats using cache memory. Numerical experiments performed on Intel CPUs and Nvidia GPUs show that our new implementation achieves relatively good performance.
The aim of this paper is to show that the multidimensional Monte Carlo integration can be efficiently implemented on various distributed memory parallel computers and clusters of multicore nodes using recently developed parallel versions of linear congruential generator and lagged Fibonacci generator pseudorandom number generators. We show how to accelerate the overall performance by offloading some computations to Graphics Processing Units (GPUs), and we discuss how to transform Message Passing Interface (MPI) + OpenMP programs to MPI + OpenMP + CUDA model. We explain how to utilize multiple cores of CPUs together with multiple GPU accelerators within a single node and how to achieve reasonable load balancing of all computational resources of GPU-accelerated multicore nodes. We present and discuss the results of experiments performed on the following target architectures: IBM Blue Gene/Q parallel computer, a cluster of Intel Xeon E5-2660 servers, and a Tesla-based GPU cluster with Intel Xeon X5650 multicore processors. The results are presented from two points of view: strong scaling and weak scaling. We also compare the performance of all considered architectures.
The aim of this paper is to evaluate OpenMP, TBB and Cilk Plus as basic language-based tools for simple and efficient parallelization of recursively defined computational problems and other problems that need both task and data parallelization techniques. We show how to use these models of parallel programming to transform a source code of Adaptive Simpson's Integration to programs that can utilize multiple cores of modern processors. Using the example of Belman-Ford algorithm for solving single-source shortest path problems, we advise how to improve performance of data parallel algorithms by tuning data structures for better utilization of vector extensions of modern processors. Manual vectorization techniques based on Cilk array notation and intrinsics are presented. We also show how to simplify such optimization using Intel SIMD Data Layout Template containers.
Abstract-In this paper we present two algorithms for performing sparse matrix-dense vector multiplication (known as SpMV operation). We show parallel (multicore) version of algorithm, which can be efficiently implemented on the contemporary multicore architectures. Next, we show distributed (so-called multinodal) version targeted at high performance clusters. Both versions are thoroughly tested using different architectures, compiler tools and sparse matrices of different sizes. Considered matrices comes from The University of Florida Sparse Matrix Collection. The performance of the algorithms is compared to the performance of SpMV routine from widely used Intel Math Kernel Library.
scite is a Brooklyn-based startup that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.