Language-based vectorization and parallelization using intrinsics, OpenMP, TBB and Cilk Plus

Stpiczyński, Przemysław

doi:10.1007/s11227-017-2231-3

Cited by 13 publications

(7 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Conversely, all Cilk++, TBB and CUDA graphs require some refactoring from the code for different reasons: (a) Cilk++ does not provide data-flow dependencies, but full synchronizations instead; (b) TBB decouples the description of the graph from its execution, and requires specific functions for starting the graph and joining results; and (c) CUDA graphs provide a lowlevel API that forces programmers to manage data copies and point-to-point synchronizations. The performance comparison between these models is out of the scope of this paper, but several works have already tackled this topic showing performance results for OpenMP competitive to the other parallel models [12,17].…”

Section: The Tdg: a Door For Expanding Portabilitymentioning

confidence: 99%

Enhancing OpenMP Tasking Model: Performance and Portability

Royuela

Quiñones

2021

Lecture Notes in Computer Science

View full text Add to dashboard Cite

OpenMP, as the de-facto standard programming model in symmetric multiprocessing for HPC, has seen its performance boosted continuously by the community, either through implementation enhancements or specification augmentations. Furthermore, the language has evolved from a prescriptive nature, as defined by the thread-centric model, to a descriptive behavior, as defined by the task-centric model. However, the overhead related to the orchestration of tasks is still relatively high. Applications exploiting very fine-grained parallelism and systems with a large number of cores available might fail on scaling. In this work, we propose to include the concept of Task Dependency Graph (TDG) in the specification by introducing a new clause, named taskgraph, attached to task or target directives. By design, the TDG allows alleviating the overhead associated with the OpenMP tasking model, and it also facilitates linking OpenMP with other programming models that support task parallelism. According to our experiments, a GCC implementation of the taskgraph is able to significantly reduce the execution time of fine-grained task applications and increase their scalability with regard to the number of threads.

show abstract

Section: The Tdg: a Door For Expanding Portabilitymentioning

confidence: 99%

Enhancing OpenMP Tasking Model: Performance and Portability

Royuela

Quiñones

2021

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…Also, some works use OpenMP task pragmas for parallelization. 33,52,63 TA B L E 5 Parallelization language/library used on Phi MPI 4,9,12,16,22,24,25,28,35,36,43,44,50,54,59,60,63,66,76,82,86 Others Pthreads, 11,23,76,84,91,95 Intel TBB, 8,48 Cilk Plus, 48 OpenCL 49,100 Chatzikonstantis et al 28 study inferior-olivary nucleus (InfOli) simulation which is used in brain modeling. They accelerate the simulation using (i) MPI, (ii) OpenMP, and (iii) hybrid MPI+OpenMP.…”

Section: Hou Et Al 88 Present a Technique For Automatically Generatinmentioning

confidence: 99%

“…,[12][13][14][16][17][18]20,21,24,[27][28][29][30][33][34][35][36][37]39,[42][43][44][45][46][47][48]50,52,55,57,59,63,66,84,86,90,92,93,95,99 IntelMKL 2,17,19,31,32,40,93,99 …”

mentioning

confidence: 99%

A survey on evaluating and optimizing performance of Intel Xeon Phi

Mittal

2020

Concurrency and Computation

View full text Add to dashboard Cite

Summary Intel's Xeon Phi combines the parallel processing power of a many‐core accelerator with the programming ease of CPUs. In this paper, we present a survey of works that study the architecture of Phi and use it as an accelerator for a broad range of applications. We review performance optimization strategies as well as the factors that bottleneck the performance of Phi. We also review works that perform comparison or collaborative execution of Phi with CPUs and GPUs. This paper will be useful for researchers and developers in the area of computer‐architecture and high‐performance computing.

show abstract

“…Our OpenMP (version 3.1) implementation of this method achieved very good speedup on Intel Xeon CPUs (up to 5.06) and Intel Xeon Phi (up to 29.45). While this approach can be further improved using more sophisticated vectorization techniques such as the use of intrinsics [ 1 , 8 , 11 ], it will result in a loss of portability between different architectures. OpenACC is a standard for accelerated computing [ 2 , 6 ].…”

Section: Introductionmentioning

confidence: 99%

High Performance Portable Solver for Tridiagonal Toeplitz Systems of Linear Equations

Dmitruk

Stpiczyński

2021

Euro-Par 2020: Parallel Processing Workshops

Self Cite

View full text Add to dashboard Cite

We show that recently developed divide and conquer parallel algorithm for solving tridiagonal Toeplitz systems of linear equations can be easily and efficiently implemented for a variety of modern multicore and GPU architectures, as well as hybrid systems. Our new portable implementation that uses OpenACC can be executed on both CPU-based and GPU-accelerated systems. More sophisticated variants of the implementation are suitable for systems with multiple GPUs and it can use CPU and GPU cores. We consider the use of both column-wise and row-wise storage formats for two dimensional double precision arrays and show how to efficiently convert between these two formats using cache memory. Numerical experiments performed on Intel CPUs and Nvidia GPUs show that our new implementation achieves relatively good performance.

show abstract

Language-based vectorization and parallelization using intrinsics, OpenMP, TBB and Cilk Plus

Cited by 13 publications

References 10 publications

Enhancing OpenMP Tasking Model: Performance and Portability

Enhancing OpenMP Tasking Model: Performance and Portability

A survey on evaluating and optimizing performance of Intel Xeon Phi

High Performance Portable Solver for Tridiagonal Toeplitz Systems of Linear Equations

Contact Info

Product

Resources

About