The Power-Performance Tradeoffs of the Intel Xeon Phi on HPC Applications

Li, Bo; Chang, Hung-Ching; Song, Shuaiwen Leon; Su, Chun-Yi; Meyer, Timmy; Mooring, John; Cameron, Kirk W.

doi:10.1109/ipdpsw.2014.162

Cited by 25 publications

(21 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The work distribution at 240 threads is too fine-grain to hide the runtime overheads of these implementations, while the lightweight runtime of SWITCHES achieves the highest performance at 240 threads. Oversubscribing the Xeon Phi to 300 and 360 threads results in degraded performance as it can cause higher resource contention and pipeline latencies [33].…”

Section: Data-parallel Applicationmentioning

confidence: 99%

Switches

Diavastos

Trancoso

2017

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

SWITCHES is a task-based dataflow runtime that implements a lightweight distributed triggering system for runtime dependence resolution and uses static scheduling and compile-time assignment policies to reduce runtime overheads. Unlike other systems, the granularity of loop-tasks can be increased to favor data-locality, even when having dependences across different loops. SWITCHES introduces explicit task resource allocation mechanisms for efficient allocation of resources and adopts the latest OpenMP Application Programming Interface (API), as to maintain high levels of programming productivity. It provides a source-to-source tool that automatically produces thread-based code. Performance on an Intel Xeon-Phi shows good scalability and surpasses OpenMP by an average of 32%.

show abstract

Section: Data-parallel Applicationmentioning

confidence: 99%

Switches

Diavastos

Trancoso

2017

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…These benchmarks, which use algorithms in various domains to stress different processor components, have been used in several studies of accelerators. For example, [9] compares the many-core Intel R Xeon Phi TM to the Intel R Sandy Bridge Xeon E5-2620 multi-core processor and the manycore NVIDIA Tesla c2050 GPU (which employs the Fermi architecture). The SHOC benchmarks are used to compare the Phi TM with the Tesla in terms of power consumption and execution time, while the Rodinia benchmarks are used to compare the Phi TM to the Sandy Bridge in terms of execution time.…”

Section: Related Workmentioning

confidence: 99%

Cross-Accelerator Performance Profiling

Gallardo

Teller

Argueta

et al. 2016

Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale

View full text Add to dashboard Cite

The computing requirements of scientific applications have influenced processor design, and have motivated the introduction and use of many-core processors, i.e., accelerators, for high performance computing (HPC). Consequently, it is now common for the compute nodes of HPC clusters to be comprised of multiple computing devices, including accelerators. Although execution time can be used to compare the performance of different computing devices, there exists no standard way to analyze application performance across devices with very different architectural designs and, thus, understand why one outperforms another. Without this knowledge, a developer is handicapped when attempting to effectively tune application performance, as is a hardware designer when trying to understand how best to improve the design of computing devices. In this paper, we use the LULESH 1.0 proxy application to compare and analyze the performance of three different accelerators: the Intel R Xeon Phi TM and the NVIDIA Fermi and Kepler GPUs. Our study shows that LULESH 1.0 exhibits similar executiontime behavior across the three accelerators, but runs up to 7X faster on the Kepler. Despite the significant architectural differences between the Xeon Phi TM and the GPUs, and the differences in the metrics used to characterize their performance, we were able to quantify why the Kepler outperforms both the Fermi and the Xeon Phi TM. To do this, we compared their achieved instructions per cycle and vectorization usage, as well as their memory behavior and power and energy consumption.

show abstract

“…An initial validation of the model is performed using either single-or multi-node computing platforms running the CoMD proxy application for molecular dynamics simulations [17,7]. Other related work on modeling and performance profiling of the Xeon Phi has been conducted in [18] and [16]. However, those research efforts do not combine the accelerator execution modes with the host operation, as proposed here for heterogeneous architectures with accelerators used to offload computations from the host CPU.…”

Section: Related Workmentioning

confidence: 99%

Modeling performance and energy for applications offloaded to Intel Xeon Phi

Lawson

Sundriyal

Sosonkina

et al. 2015

Proceedings of the 2nd International Workshop on Hardware-Software Co-Design for High Performance Computing

View full text Add to dashboard Cite

Accelerators are adopted to increase performance, reduce time-to-solution, and minimize energy-to-solution. However, employing them efficiently, given system and application characteristics, is often a daunting task. A goal of this work is to propose a general model that predicts performance and power requirements for an application, computational portions of which are offloaded to an accelerator. Intel Xeon Phi is the only accelerator type investigated here, and only in offload execution mode. This mode is also employed by other accelerator types, such as GPU; thus the proposed model is applicable directly. The predictive capabilities of the model are demonstrated by determining the best hardware-software configuration instances with respect to the minimum energy consumption for the CoMD proxy application executed on single or multiple nodes. For the CoMD problem sizes investigated here, the best modeled configuration was relatively close to the best measured configuration with relative error under 5% of the energy consumed for most configurations. Initial model validation also confirmed the model accuracy for a variety of model parameters, such as host computation time and power consumption on the host and accelerator. The model also provides estimates of the total data movement and computational throughput as well as of some key metrics, such as FLOPs- * Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. Co-HPC2015, November 15-20, 2015, Austin, TX, USA c 2015 ACM. ISBN 978-1-4503-3992-6/15/11$15.00 DOI: http://dx.

show abstract

The Power-Performance Tradeoffs of the Intel Xeon Phi on HPC Applications

Cited by 25 publications

References 14 publications

Switches

Switches

Cross-Accelerator Performance Profiling

Modeling performance and energy for applications offloaded to Intel Xeon Phi

Contact Info

Product

Resources

About