Thread fork/join techniques for multi-level parallelism exploitation in NUMA multiprocessors

Martorell, Xavier; Ayguadé, Eduard; Navarro, Nacho; Corbalan, Julita; González, Marc; Labarta, Jesús

doi:10.1145/305138.305206

Cited by 52 publications

(31 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…We attribute the performance degradation in the directive implementation of LU to less data locality and larger synchronization overhead in the 1-D pipeline used in the OpenMP version as compared to the 2-D pipeline used in the MPI version. This is consistent with the result of a study from [12].…”

Section: The Nas Parallel Benchmarkssupporting

confidence: 93%

See 1 more Smart Citation

Automatic Generation of OpenMP Directives and Its Application to Computational Fluid Dynamics Codes

Jin

Frumkin

Yan

2000

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. The shared-memory programming model is a very effective way to achieve parallelism on shared memory parallel computers. As great progress was made in hardware and software technologies, performance of parallel programs with compiler directives has demonstrated large improvement. The introduction of OpenMP directives, the industrial standard for shared-memory programming, has minimized the issue of portability. In this study, we have extended CAPTools, a computer-aided parallelization toolkit, to automatically generate OpenMP-based parallel programs with nominal user assistance. We outline techniques used in the implementation of the tool and discuss the application of this tool on the NAS Parallel Benchmarks and several computational fluid dynamics codes. This work demonstrates the great potential of using the tool to quickly port parallel programs and also achieve good performance that exceeds some of the commercial tools.

show abstract

Section: The Nas Parallel Benchmarkssupporting

confidence: 93%

“…The pipeline algorithm is used for parallelizing the NAS benchmark LU in Sect. 4.1 and also described in [12].…”

Section: Pipeline Setupmentioning

confidence: 98%

Automatic Generation of OpenMP Directives and Its Application to Computational Fluid Dynamics Codes

Jin

Frumkin

Yan

2000

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…NANOS compiler [12] based on Parafrase2 has been trying to exploit the multi-level parallelism including the coarse grain parallelism by using extended OpenMP API. The OSCAR multigrain parallelizing compiler [13] exploits the coarse grain task parallelism among loops, subroutines and basic blocks [14], and the near fine grain parallelism among statements inside a basic block [15] in addition to the conventional loop parallelism among iterations.…”

Section: Introductionmentioning

confidence: 99%

Performance of OSCAR Multigrain Parallelizing Compiler on SMP Servers

Ishizaka

Miyamoto

Shirako

et al. 2005

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. This paper describes performance of OSCAR multigrain parallelizing compiler on various SMP servers, such as IBM pSeries 690, Sun Fire V880, Sun Ultra 80, NEC TX7/i6010 and SGI Altix 3700. The OS-CAR compiler hierarchically exploits the coarse grain task parallelism among loops, subroutines and basic blocks and the near fine grain parallelism among statements inside a basic block in addition to the loop parallelism. Also, it allows us global cache optimization over different loops, or coarse grain tasks, based on data localization technique with interarray padding to reduce memory access overhead. Current performance of OSCAR compiler is evaluated on the above SMP servers. For example, the OSCAR compiler generating OpenMP parallelized programs from ordinary sequential Fortran programs gives us 5.7 times speedup, in the average of seven programs, such as SPEC CFP95 tomcatv, swim, su2cor, hydro2d, mgrid, applu and turb3d, compared with IBM XL Fortran compiler 8.1 on IBM pSeries 690 24 processors SMP server. Also, it gives us 2.6 times speedup compare with Intel Fortran Itanium Compiler 7.1 on SGI Altix 3700 Itanium 2 16 processors server, 1.7 times speedup compared with NEC Fortran Itanium Compiler 3.4 on NEC TX7/i6010 Itanium 2 8 processors server, 2.5 times speedup compared with Sun Forte 7.0 on Sun Ultra 80 UltraSPARC II 4 processors desktop workstation, and 2.1 times speedup compare with Sun Forte compiler 7.1 on Sun Fire V880 UltraSPARC III Cu 8 processors server.

show abstract

“…NANOS compiler [3] uses multi level parallelism by using the extended OpenMP API. PROMIS compiler [4] integrates loop level parallelism and instruction level parallelism using a common intermediate language.…”

Section: Introductionmentioning

confidence: 99%

Cache Optimization for Coarse Grain Task Parallel Processing Using Inter-Array Padding

Ishizaka

Obata

Kasahara

2004

Languages and Compilers for Parallel Computing

View full text Add to dashboard Cite

Abstract. The wide use of multiprocessor system has been making automatic parallelizing compilers more important. To improve the performance of multiprocessor system more by compiler, multigrain parallelization is important. In multigrain parallelization, coarse grain task parallelism among loops and subroutines and near fine grain parallelism among statements are used in addition to the traditional loop parallelism. In addition, locality optimization to use cache effectively is also important for the performance improvement. This paper describes inter-array padding to minimize cache conflict misses among macro-tasks with data localization scheme which decomposes loops sharing the same arrays to fit cache size and executes the decomposed loops consecutively on the same processor. In the performance evaluation on Sun Ultra 80(4pe), OSCAR compiler on which the proposed scheme is implemented gave us 2.5 times speedup against the maximum performance of Sun Forte compiler automatic loop parallelization at the average of SPEC CFP95 tomcatv, swim hydro2d and turb3d programs. Also, OSCAR compiler showed 2.1 times speedup on IBM RS/6000 44p-270(4pe) against XLF compiler.

show abstract

Thread fork/join techniques for multi-level parallelism exploitation in NUMA multiprocessors

Cited by 52 publications

References 7 publications

Automatic Generation of OpenMP Directives and Its Application to Computational Fluid Dynamics Codes

Automatic Generation of OpenMP Directives and Its Application to Computational Fluid Dynamics Codes

Performance of OSCAR Multigrain Parallelizing Compiler on SMP Servers

Cache Optimization for Coarse Grain Task Parallel Processing Using Inter-Array Padding

Contact Info

Product

Resources

About