2015
DOI: 10.1007/978-3-662-48096-0_45
|View full text |Cite
|
Sign up to set email alerts
|

Abstract: Abstract. Barriers are a fundamental synchronization primitive, underpinning the parallel execution models of many modern shared-memory parallel programming languages such as OpenMP, OpenCL or Cilk, and are one of the main challenges to scaling. State-of-the-art barrier synchronization algorithms differ in tradeoffs between critical path length, communication traffic patterns and memory footprint. In this paper, we evaluate the efficiency of five such algorithms on the Intel Xeon Phi coprocessor. In addition, … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
13
0

Year Published

2016
2016
2022
2022

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 23 publications
(13 citation statements)
references
References 14 publications
(26 reference statements)
0
13
0
Order By: Relevance
“…The OpenMP standard defines that the barrier construct applies to all threads within a team, but vertical or tracing threads that are initialized within an already-threaded region over element will be on different teams, and thus not synchronize across all threads in a standard “ O M P B A R R I E R ” call. Initially, we implemented a simple “ s p i n ” barrier, but later replaced that with a namelist-selectable algorithm that offers three different kinds of barriers—a standard, single-team “ O M P B A R R I E R ,” the initial “ s p i n ” barrier, and “ d i s s e min a t i o n ” barrier inspired by work by Rodchenko et al (2015). In theory, the “ d i s s e min a t i o n ” barrier offers substantial performance advantages, and synthetic benchmarks confirm this, but within our application, these differences appear to be negligible, at least for small numbers of threads.…”
Section: Computational Aspectsmentioning
confidence: 99%
“…The OpenMP standard defines that the barrier construct applies to all threads within a team, but vertical or tracing threads that are initialized within an already-threaded region over element will be on different teams, and thus not synchronize across all threads in a standard “ O M P B A R R I E R ” call. Initially, we implemented a simple “ s p i n ” barrier, but later replaced that with a namelist-selectable algorithm that offers three different kinds of barriers—a standard, single-team “ O M P B A R R I E R ,” the initial “ s p i n ” barrier, and “ d i s s e min a t i o n ” barrier inspired by work by Rodchenko et al (2015). In theory, the “ d i s s e min a t i o n ” barrier offers substantial performance advantages, and synthetic benchmarks confirm this, but within our application, these differences appear to be negligible, at least for small numbers of threads.…”
Section: Computational Aspectsmentioning
confidence: 99%
“…By experimenting we found that balanced affinity was generally better and it was used in the results presented in Figure 9. However, we followed the methodology presented in [22] to demonstrate the effect of affinity on the BFS algorithm, we ran a 48 thread version manually controlling the affinity to achieve one, two, three and four threads per core (1T/core, 2T/core, 3T/core and 4T/core), thus using 48 cores with one thread per core down to 12 cores with four threads per core. The thread affinity is controlled through the environment variable KMP AFFINITY.…”
Section: Thread Affinitymentioning
confidence: 99%
“…However for synchronization, typically spinlocking techniques are used that can take considerable performance and programming effort [RNPL15] FPGAs achieve outstanding performance on stream processing problems and whenever pipelining can be applied on large data sets. This is for example the case when some code contains many if-then-else or case statements.…”
Section: Flexibility and Customizationmentioning
confidence: 99%