A Parallel Task-Based Approach to Linear Algebra

Tousimojarad, Ashkan; Vanderbauwhede, Wim

doi:10.1109/ispdc.2014.11

Cited by 4 publications

(5 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Tousimojarad and Vanderbauwhede [33] cleverly reduce access latencies to uniformly distributed data by using copies whose home cache is local to the access thread on the TILEPro64 processor. Zhou and Demsky [2] build a NUMAaware adaptive garbage collector that migrate objects to improve locality on manycore processors.…”

Section: Related Workmentioning

confidence: 99%

Locality-Aware Task Scheduling and Data Distribution for OpenMP Programs on NUMA Systems and Manycore Processors

Muddukrishna

Jönsson

Brorsson

2015

Scientific Programming

View full text Add to dashboard Cite

Performance degradation due to nonuniform data access latencies has worsened on NUMA systems and can now be felt on-chip in manycore processors. Distributing data across NUMA nodes and manycore processor caches is necessary to reduce the impact of nonuniform latencies. However, techniques for distributing data are error-prone and fragile and require low-level architectural knowledge. Existing task scheduling policies favor quick load-balancing at the expense of locality and ignore NUMA node/manycore cache access latencies while scheduling. Locality-aware scheduling, in conjunction with or as a replacement for existing scheduling, is necessary to minimize NUMA effects and sustain performance. We present a data distribution and locality-aware scheduling technique for task-based OpenMP programs executing on NUMA systems and manycore processors. Our technique relieves the programmer from thinking of NUMA system/manycore processor architecture details by delegating data distribution to the runtime system and uses task data dependence information to guide the scheduling of OpenMP tasks to reduce data stall times. We demonstrate our technique on a four-socket AMD Opteron machine with eight NUMA nodes and on the TILEPro64 processor and identify that data distribution and locality-aware task scheduling improve performance up to 69% for scientific benchmarks compared to default policies and yet provide an architecture-oblivious approach for programmers.

show abstract

Section: Related Workmentioning

confidence: 99%

Locality-Aware Task Scheduling and Data Distribution for OpenMP Programs on NUMA Systems and Manycore Processors

Muddukrishna

Jönsson

Brorsson

2015

Scientific Programming

View full text Add to dashboard Cite

show abstract

“…Deriving the parallel kernel from the generated single-threaded code is mostly a matter of replacing the loops by the OpenCL indexing calls (get global id, get local id etc), and in the case where the original code has multiple loops, as is the case 2 [omitted for blind review] 3…”

Section: Opencl Implementation Detailsmentioning

confidence: 99%

“…Since all the tasks defined in the GPC code will be executed in parallel, a seq pragma is required to run the two phases sequentially. Each phase uses a partial continuous for, par cont for [3], in order to parallelise the outer loop over rows, and a #pragma simd 4 to help the compiler vectorise the inner loop over columns. par cont for is a sequential for loop that works as follows:…”

Section: Gprm Implementation Detailsmentioning

confidence: 99%

“…We then manually optimise the kernel and if required the OpenCL API calls. The OpenCL API we use is our own OpenCL wrapper library 3 , which provides convenient OpenCL integration in existing codebases for C, C++ and Fortran.…”

Section: Opencl Implementation Detailsmentioning

confidence: 99%

“…For this purpose, we have chosen three parallel programming models from different domains and for different reasons: OpenMP, the de-facto standard for programming shared memory architectures; OpenCL, known for being portable across multiple platforms; and finally GPRM, a pure taskbased programming framework. It has been reported that GPRM provides superior performance compared to OpenMP on manycore architectures [3] [4].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Comparison of Three Popular Parallel Programming Models on the Intel Xeon Phi

Tousimojarad

Vanderbauwhede

2014

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Abstract. Systems with large numbers of cores have become commonplace. Accordingly, applications are shifting towards increased parallelism. In a general-purpose system, applications residing in the system compete for shared resources. Thread and task scheduling in such a multithreaded multiprogramming environment is a significant challenge. In this study, we have chosen the Intel Xeon Phi system as a modern platform to explore how popular parallel programming models, namely OpenMP, Intel Cilk Plus and Intel TBB (Threading Building Blocks) scale on manycore architectures. We have used three benchmarks with different features which exercise different aspects of the system performance. Moreover, a multiprogramming scenario is used to compare the behaviours of these models when all three applications reside in the system. Our initial results show that it is to some extent possible to infer multiprogramming performance from single-program cases.

show abstract

Steal Locally, Share Globally

Tousimojarad

Vanderbauwhede

2015

Int J Parallel Prog

View full text Add to dashboard Cite

A Parallel Task-Based Approach to Linear Algebra

Cited by 4 publications

References 13 publications

Locality-Aware Task Scheduling and Data Distribution for OpenMP Programs on NUMA Systems and Manycore Processors

Locality-Aware Task Scheduling and Data Distribution for OpenMP Programs on NUMA Systems and Manycore Processors

Comparison of Three Popular Parallel Programming Models on the Intel Xeon Phi

Steal Locally, Share Globally

Contact Info

Product

Resources

About