Modeling communication in cache-coherent SMP systems

Ramos, Sabela; Hoefler, Torsten

doi:10.1145/2493123.2462916

Cited by 50 publications

(22 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Performance Analytical Models: The work of [1], [75] optimizes for cache line awareness, where an analytical performance model is built to tune the cache line transfers of different architectures, including KNC and Sandy Bridge. Their model is recently extended to explore KNL [46], which includes constructing several performance models for certain combinations of KNL clustering and memory modes.…”

Section: State-of-the-art Shared-memory Optimizationsmentioning

confidence: 99%

“…Some of this work is inherited and customized to our application code. For instance, SoA of [68], AoSoA of [22], low-level, MCDRAM-aware allocator of [39], data dependency conflicts migration of [71], Hilbert-based recursive tiling/blocking of [74], cache line aware optimization of [1], [46], [75], and partial coloring of [79]. In our work, we deal with irregular memory access patterns through optimizing for the cache line awareness based upon minimizing memory reference arithmetic and pointer chasing, as well as localizing a large bulk of computations inside a compute core.…”

Section: State-of-the-art Shared-memory Optimizationsmentioning

confidence: 99%

See 1 more Smart Citation

Optimizations of Unstructured Aerodynamics Computations for Many-core Architectures

Farhan

Keyes

2018

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Section: State-of-the-art Shared-memory Optimizationsmentioning

confidence: 99%

Section: State-of-the-art Shared-memory Optimizationsmentioning

confidence: 99%

Optimizations of Unstructured Aerodynamics Computations for Many-core Architectures

Farhan

Keyes

2018

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

“…Finally, Ramos and Hoefler [13] propose a model for dissemination barrier synchronization and also compare with the Intel OpenMP barrier. However, the authors only show equivalent performance with the Intel implementation.…”

Section: Related Workmentioning

confidence: 99%

Effective Barrier Synchronization on Intel Xeon Phi Coprocessor

Rodchenko

Nisbet

Pop

et al. 2015

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Barriers are a fundamental synchronization primitive, underpinning the parallel execution models of many modern shared-memory parallel programming languages such as OpenMP, OpenCL or Cilk, and are one of the main challenges to scaling. State-of-the-art barrier synchronization algorithms differ in tradeoffs between critical path length, communication traffic patterns and memory footprint. In this paper, we evaluate the efficiency of five such algorithms on the Intel Xeon Phi coprocessor. In addition, we present a novel hybrid barrier implementation that exploits the Xeon Phi's topology, memory hierarchy and streaming stores to achieve a 3× lower overhead than the Intel OpenMP barrier implementation (ICC 14.0.0), thus outperforming, to the best of our knowledge, all other implementations, and which we evaluate on the CG and MG kernels from the NAS Parallel Benchmarks, the direct N-body simulation kernel and the EPCC barrier OpenMP microbenchmark.

show abstract

“…A few recent studies have proposed performance models for other manycore architectures [21,24]. Our approach is similar to the one used in these papers.…”

Section: Related Workmentioning

confidence: 99%

“…They all cover the same communication scenarios as the LogP model [11] (or its extensions) that is commonly used in message-passing systems. The main difference is that the underlying communication system considered in these studies are different from the one of this chapter: [21] models RMA-based communication and targets the Intel SCC processor; [24] models point-to-point communication on top of cache-coherent shared memory and targets the Intel Xeon Phi processor.…”

Section: Related Workmentioning

confidence: 99%

High-Throughput Maps on Message-Passing Manycore Architectures: Partitioning versus Replication

Shahmirzadi

Ropars

Schiper

2014

Lecture Notes in Computer Science

View full text Add to dashboard Cite

The advent of manycore architectures raises new scalability challenges for concurrent applications. Implementing scalable data structures is one of them. Several manycore architectures provide hardware message passing as a means to efficiently exchange data between cores. In this paper, we study the implementation of high-throughput concurrent maps in message-passing manycores. Partitioning and replication are the two approaches to achieve high throughput in a message-passing system. Our paper presents and compares different strongly-consistent map algorithms based on partitioning and replication. To assess the performance of these algorithms independently of architecture-specific features, we propose a communication model of message-passing manycores to express the throughput of each algorithm. The model is validated through experiments on a 36-core TILE-Gx8036 processor. Evaluations show that replication outperforms partitioning only in a narrow domain.

show abstract

Modeling communication in cache-coherent SMP systems

Cited by 50 publications

References 18 publications

Optimizations of Unstructured Aerodynamics Computations for Many-core Architectures

Optimizations of Unstructured Aerodynamics Computations for Many-core Architectures

Effective Barrier Synchronization on Intel Xeon Phi Coprocessor

High-Throughput Maps on Message-Passing Manycore Architectures: Partitioning versus Replication

Contact Info

Product

Resources

About