Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and its Application to Unstructured Matrices

Park, Jongsoo; Smelyanskiy, Mikhail; Vaidyanathan, Karthikeyan; Heinecke, Alexander; Kalamkar, Dhiraj D.; Liu, Xing; Patwary, Md. Mosotofa Ali; Lu, Yutong; Dubey, Pradeep

doi:10.1109/sc.2014.82

Cited by 44 publications

(45 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We can also note that on certain iterative approaches, the domain decomposition negatively impacts the convergence [21,22] regardless of whether it is implemented with processus or thread parallelism [21]. This is not the case in the matrix assembly part and therefore, the observed improvements are only related to our new parallelization strategy.…”

Section: Finite Element Methods Matrix Assemblymentioning

confidence: 86%

“…This leads to a better locality than the original ordering using the Cuthill-McKee approach [5]. We also observe that current coloring strategies [7,9,21] are not efficient on the very small data partition size of the fine grain task-based parallelism. We propose a new coloring heuristic to reveal data-parallelism in small partitions.…”

Section: Introductionmentioning

confidence: 85%

“…Efficient parallelization in shared memory is challenging [10,21] and recent many-core architectures expose the limit of current loop-level strategy [21,23]. The common approach in use is mesh coloring [7,9,21].…”

Section: Coloring Of Unstructured Meshesmentioning

confidence: 99%

“…The common approach in use is mesh coloring [7,9,21]. A pseudo-code and a regular 2D coloring example are given in Figure 2, and a state-of-art algorithm [9,21] is detailed in Section 3.2.3. Coloring avoids race conditions by assigning a different color to the elements sharing a reduction variable.…”

Section: Coloring Of Unstructured Meshesmentioning

confidence: 99%

“…Another drawback of the coloring is that nodes or edges of the mesh, i.e. CSR values, are updated by various colors and are loaded multiple times [21]. As large meshes do not fit in cache, data are accessed from the main memory and it multiplies the bandwidth by the number of colors.…”

Section: Coloring Of Unstructured Meshesmentioning

confidence: 99%

See 4 more Smart Citations

Scalable and efficient implementation of 3d unstructured meshes computation: a case study on matrix assembly

Thébault

Petit

Dinh

2015

Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

View full text Add to dashboard Cite

Con sis te n t * Comple te * W ell Docu m e n te d * Easy to R e us e * * E v a lu ate d * P oP * A r t ifact * A EC P PExposing massive parallelism on 3D unstructured meshes computation with efficient load balancing and minimal synchronizations is challenging. Current approaches relying on domain decomposition and mesh coloring struggle to scale with the increasing number of cores per nodes, especially with new many-core processors. In this paper, we propose an hybrid approach using domain decomposition to exploit distributed memory parallelism, Divide-and-Conquer, D&C, to exploit shared memory parallelism and improve locality, and mesh coloring at core level to exploit vectors. It illustrates a new trade-off for many-cores between structuredness, memory locality, and vectorization. We evaluate our approach on the finite element matrix assembly of an industrial fluid dynamic code developed by Dassault Aviation. We compare our D&C approach to domain decomposition and to mesh coloring. D&C achieves a high parallel efficiency, a good data locality as well as an improved bandwidth usage. It competes on current nodes with the optimized pure MPI version with a minimum 10% speed-up. D&C shows an impressive 319x strong scaling on 512 cores (32 nodes) with only 2000 vertices per core. Finally, the Intel Xeon Phi version has a performance similar to 10 Intel E5-2665 Xeon Sandy Bridge cores and 95% parallel efficiency on the 60 physical cores. Running on 4 Xeon Phi (240 cores), D&C has 92% efficiency on the physical cores and performance similar to 33 Intel E5-2665 Xeon Sandy Bridge cores.

show abstract

Section: Finite Element Methods Matrix Assemblymentioning

confidence: 86%

Section: Introductionmentioning

confidence: 85%

Section: Coloring Of Unstructured Meshesmentioning

confidence: 99%

Section: Coloring Of Unstructured Meshesmentioning

confidence: 99%

Section: Coloring Of Unstructured Meshesmentioning

confidence: 99%

See 3 more Smart Citations

Scalable and efficient implementation of 3d unstructured meshes computation: a case study on matrix assembly

Thébault

Petit

Dinh

2015

Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

View full text Add to dashboard Cite

show abstract

Data mining on vast data sets as a cluster system benchmark

Heinecke

Karlstetter

Pflüger

et al. 2015

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

Comparing different (accelerated) cluster architectures by a single application is a tough piece of work because this application has to be optimized with respect to platform-dependent features. In this work, we demonstrate such an optimization for a data mining algorithm which solves regression and classification problems on vast data sets. Our technique is based on least squares regression, and its major component is the iterative matrix-free solution of a linear system of equations. By processing data sets ranging from several hundreds of thousands instances to multi-million data points in strong-scaling and weak-scaling settings, we are able to estimate the amount of parallelism needed to unleash the performance of classic CPU-based machines and clusters employing Intel Xeon Phi coprocessors and NVIDIA Kepler GPUs. Only in strongscaling experiments, GPUs and coprocessors suffer from their tremendous amount of needed parallelism and get outperformed by dual socket Intel Sandy Bridge nodes at large scale (more than 64 nodes/accelerators). However, in weak-scaling scenarios, a speed-up larger than 2X over an entire CPU node can be achieved by a single accelerator.A. HEINECKE ET. AL.(NAS) Division parallel benchmark suite [9], which requires several application kernels to be run (including iterative solvers and fast Fourier transforms). In case of accelerated clusters, the scalable heterogeneous computing benchmark suite [10] is a good candidate which implements nearly all NAS benchmarks in OpenCL and CUDA, and can be easily executed on accelerators and GPUs. However, research and procurements performed in recent years have demonstrated that even running (just) application kernels might not be sufficient: Sandia Labs highlighted how mini-applications or proxy-applications can be used in order to understand the performance of a supercomputer and even influence its future development [11]. There, the benchmarks are not limited to kernels; they are simplified versions of real simulation codes stemming from several application domains. A similar approach was chosen for the procurement of the latest peta-scale system in Germany, called 'Super-MUC' at the Leibniz Supercomputing Centre: according to Brehm [12], 45% of the benchmarks required during this process were full applications. Finally, the proposal of having an additional ranking of the Top500 list machines (like the Green500 [13] list with respect to power consumption) based on a high-performance CG (HPCG) implementation was recently made [14].In this work, we apply the idea of an HPC benchmark to a full and relevant application, classification and regression of vast data sets. It exhibits different and distinct properties than the benchmarks discussed earlier, poses additional challenges to current and future HPC systems, and we thus propose it as a further extension of an application benchmark portfolio. Furthermore, we demonstrate its use to benchmark different clusters and supercomputers. A fair application-driven comparison is ensured by optimizing our dat...

show abstract