Applying the Roofline Performance Model to the Intel Xeon Phi Knights Landing Processor

Doerfler, Douglas; Deslippe, Jack; Williams, Samuel; Oliker, Leonid; Cook, Brandon; Kurth, Thorsten; Lobet, Mathieu; Malas, Tareq B.; Vay, Jean‐Luc; Vincenti, Henri

doi:10.1007/978-3-319-46079-6_24

Cited by 57 publications

(35 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Their model is recently extended to explore KNL [46], which includes constructing several performance models for certain combinations of KNL clustering and memory modes. Furthermore, the work of [76] performs several experimentations on KNL with different applications, through which Roofline performance models are drawn for different configurations of KNL. The performance of the hybrid memory system of KNL is investigated in [77], which provides an analytic model for performance tuning.…”

Section: State-of-the-art Shared-memory Optimizationsmentioning

confidence: 99%

Optimizations of Unstructured Aerodynamics Computations for Many-core Architectures

Farhan

Keyes

2018

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Section: State-of-the-art Shared-memory Optimizationsmentioning

confidence: 99%

Optimizations of Unstructured Aerodynamics Computations for Many-core Architectures

Farhan

Keyes

2018

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

“…Unlike the CARM that includes the complete memory hierarchy in a single plot, the ORM mainly considers the memory transfers between the last level cache and the DRAM, thus it provides fundamentally different perspective and insights when characterizing and optimizing applications [18]. Recently, the ORM was also instantiated on the KNL [19], without modifying the original model. The arithmetic intensity (AI) described in ORM is not to be confused with CARM AI because of the difference in the way how the memory traffic is observed.…”

Section: Related Workmentioning

confidence: 99%

“…The bandwidth measured also differs from the one measured in this paper, the latter being explicitly load bandwidth. In [19], the authors present several ORM-based optimization case studies, and compare the performance improvements between Haswell processor and KNL, with data in DDR4 memory or MCDRAM, and finally KNL with data in MCDRAM memory. However, the authors do not show how the model can help choosing between memories when working sets do not fit in the fastest one nor they provide a comparison with the cache mode.…”

Section: Related Workmentioning

confidence: 99%

Modeling Large Compute Nodes with Heterogeneous Memories with Cache-Aware Roofline Model

Denoyelle

Goglin

Ilić

et al. 2017

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. In order to fulfill modern applications needs, computing systems become more powerful, heterogeneous and complex. NUMA platforms and emerging high bandwidth memories offer new opportunities for performance improvements. However they also increase hardware and software complexity, thus making application performance analysis and optimization an even harder task. The Cache-Aware Roofline Model (CARM) is an insightful, yet simple model designed to address this issue. It provides feedback on potential applications bottlenecks and shows how far is the application performance from the achievable hardware upper-bounds. However, it does not encompass NUMA systems and next generation processors with heterogeneous memories. Yet, some application bottlenecks belong to those memory subsystems, and would benefit from the CARM insights. In this paper, we fill the missing requirements to scope recent large shared memory systems with the CARM. We provide the methodology to instantiate, and validate the model on a NUMA system as well as on the latest Xeon Phi processor equiped with configurable hybrid memory. Finally, we show the model ability to exhibits several bottlenecks of such systems, which were not supported by CARM.

show abstract

“…PICSAR is an open-source ParticleIn-Cell FORTRAN+Python library designed to provide highperformance subroutines optimized for many-integrated core architectures [40], [41] that can be interfaced with WARP.…”

Section: Case Study 5 -Warp-picsarmentioning

confidence: 99%

“…Cartesian based PIC codes have a low flop/byte ratio that leads non-optimized algorithms to be highly memorybound [41]. Large field and particle arrays cannot in cache in most simulations.…”

Section: Case Study 5 -Warp-picsarmentioning

confidence: 99%

Evaluating and Optimizing the NERSC Workload on Knights Landing

Barnes

Cook

Deslippe

et al. 2016

2016 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS

Self Cite

View full text Add to dashboard Cite

Abstract-NERSC has partnered with 20 representative application teams to evaluate performance on the Xeon-Phi Knights Landing architecture and develop an application-optimization strategy for the greater NERSC workload on the recently installed Cori system. In this article, we present early case studies and summarized results from a subset of the 20 applications highlighting the impact of important architecture differences between the Xeon-Phi and traditional Xeon processors. We summarize the status of the applications and describe the greater optimization strategy that has formed.

show abstract

Applying the Roofline Performance Model to the Intel Xeon Phi Knights Landing Processor

Cited by 57 publications

References 13 publications

Optimizations of Unstructured Aerodynamics Computations for Many-core Architectures

Optimizations of Unstructured Aerodynamics Computations for Many-core Architectures

Modeling Large Compute Nodes with Heterogeneous Memories with Cache-Aware Roofline Model

Evaluating and Optimizing the NERSC Workload on Knights Landing

Contact Info

Product

Resources

About