Exploring and analyzing the real impact of modern on-package memory on HPC scientific kernels

Li, Ang; Liu, Weifeng; Kristensen, Mads Ruben Burgdorff; Vinter, Brian; Wang, Hao; Hou, Kaixi; Márquez, Andrés; Song, Shuaiwen Leon

doi:10.1145/3126908.3126931

Cited by 46 publications

(22 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Furthermore, the work of [76] performs several experimentations on KNL with different applications, through which Roofline performance models are drawn for different configurations of KNL. The performance of the hybrid memory system of KNL is investigated in [77], which provides an analytic model for performance tuning. A Roofline model specifically for benchmarking the performance of a welloptimized OpenMP implementation of the tall-skinny matrix multiplication kernel for a molecular dynamics application code is proposed in [67], which essentially leverages the thread-level parallelism on KNL.…”

Section: State-of-the-art Shared-memory Optimizationsmentioning

confidence: 99%

Optimizations of Unstructured Aerodynamics Computations for Many-core Architectures

Farhan

Keyes

2018

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Section: State-of-the-art Shared-memory Optimizationsmentioning

confidence: 99%

Optimizations of Unstructured Aerodynamics Computations for Many-core Architectures

Farhan

Keyes

2018

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

“…To summarize, we argue that GPUs are the promising platform for the ALS workload when taking both performance and power consumption into account. In the future, we will further investigate the performance gap between platforms and push the factorizing performance to the hardware limit (in particular on newer Intel Xeon Phi processors with onpackage high bandwidth memory [35,36], newer GPUs on warp-level [37,38], CTA-level [39] and cache-level [40], and other emergent accelerators such as Matrix-2000 [41]).…”

Section: Applying Optimizationsmentioning

confidence: 99%

clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization

Chen

Fang

Liu

et al. 2020

Future Generation Computer Systems

Self Cite

View full text Add to dashboard Cite

Alternating least squares (ALS) has been proved to be an effective solver for matrix factorization in recommender systems. To speed up factorizing performance, various parallel ALS solvers have been proposed to leverage modern multi-cores and many-cores. Existing implementations are limited in either speed or portability. In this paper, we present an efficient and portable ALS solver (clMF) for recommender systems. On one hand, we diagnose the baseline implementation and observe that it lacks of the awareness of the hierarchical thread organization on modern hardware. To achieve high performance, we apply the thread batching technique, the fine-grained tiling technique and three architecture-specific optimizations. On the other hand, we implement the ALS solver in OpenCL so that it can run on various platforms (CPUs, GPUs and MICs). Based on the architectural specifics, we select a suitable code variant for each platform to efficiently map it to the underlying hardware. The experimental results show that our implementation performs 2.8×-15.7× faster on an Intel 16-core CPU, 23.9×-87.9× faster on an NVIDIA K20C GPU and 34.6×-97.1× faster on an AMD Fury X GPU than the baseline implementation. On the K20C GPU, our implementation also outperforms cuMF over different latent features ranging from 10 to 100 with various real-world recommendation datasets.

show abstract

“…Data processing for high memory bandwidth X-Stream accelerates graph processing with sequential access [55]. Recent work optimized quick sort [11], hash joins [14], scientific workloads [40,50], and machine learning [70] for KNL's HBM, but not streaming analytics. Beyond KNL, Mondrian [18] uses hardware support for analytics on high memory bandwidth in near-memory processing.…”

Section: Related Workmentioning

confidence: 99%

StreamBox-HBM

Miao¹,

Jeon²,

Pekhimenko³

et al. 2019

Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Syst

View full text Add to dashboard Cite

Stream analytics has an insatiable demand for memory and performance. Emerging hybrid memories combine commodity DDR4 DRAM with 3D-stacked High Bandwidth Memory (HBM) DRAM to meet such demands. However, achieving this promise is challenging because (1) HBM is capacitylimited and (2) HBM boosts performance best for sequential access and high parallelism workloads. At first glance, stream analytics appears a particularly poor match for HBM because they have high capacity demands and data grouping operations, their most demanding computations, use random access.This paper presents the design and implementation of StreamBox-HBM, a stream analytics engine that exploits hybrid memories to achieve scalable high performance.StreamBox-HBM performs data grouping with sequential access sorting algorithms in HBM, in contrast to random access hashing algorithms commonly used in DRAM. StreamBox-HBM solely uses HBM to store Key Pointer Array (KPA) data structures that contain only partial records (keys and pointers to full records) for grouping operations. It dynamically creates and manages prodigious data and pipeline parallelism, choosing when to allocate KPAs in HBM. It dynamically optimizes for both the high bandwidth and limited capacity of HBM, and the limited bandwidth and high capacity of standard DRAM.StreamBox-HBM achieves 110 million records per second and 238 GB/s memory bandwidth while effectively utilizing all 64 cores of Intel's Knights Landing, a commercial server with hybrid memory. It outperforms stream engines with sequential access algorithms without KPAs by 7× and stream engines with random access algorithms by an order Permission to make digital of magnitude in throughput. To the best of our knowledge, StreamBox-HBM is the first stream engine optimized for hybrid memories.

show abstract

Exploring and analyzing the real impact of modern on-package memory on HPC scientific kernels

Cited by 46 publications

References 43 publications

Optimizations of Unstructured Aerodynamics Computations for Many-core Architectures

Optimizations of Unstructured Aerodynamics Computations for Many-core Architectures

clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization

StreamBox-HBM

Contact Info

Product

Resources

About