Automatic parallel code generation for tiled nested loops

Goumas, Georgios; Drosinos, Nikolaos; Athanasaki, Maria; Koziris, Nectarios

doi:10.1145/967900.968184

Cited by 9 publications

(6 citation statements)

References 27 publications

(43 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the digital signal processing domain, Bondalapati [7] try to parallelize nested loops. In [24], the authors try to automatically parallelize the tiled loop nests using a messagepassing interface (MPI [22]). Loop-level parallelism for coarsegrained reconfigurable architectures is introduced in [43], while Hogstedt et al [27] investigate the parallel execution time of tiled loop nests.…”

Section: Discussion Of Related Workmentioning

confidence: 99%

Data locality and parallelism optimization using a constraint-based approach

Öztürk

2011

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

Cataloged from PDF version of article.Embedded applications are becoming increasingly complex and processing ever-increasing datasets. In\ud the context of data-intensive embedded applications, there have been two complementary approaches to\ud enhancing application behavior, namely, data locality optimizations and improving loop-level parallelism.\ud Data locality needs to be enhanced to maximize the number of data accesses satisfied from the higher\ud levels of the memory hierarchy. On the other hand, compiler-based code parallelization schemes require\ud a fresh look for chip multiprocessors as interprocessor communication is much cheaper than off-chip\ud memory accesses. Therefore, a compiler needs to minimize the number of off-chip memory accesses. This\ud can be achieved by considering multiple loop nests simultaneously. Although compilers address these two\ud problems, there is an inherent difficulty in optimizing both data locality and parallelism simultaneously.\ud Therefore, an integrated approach that combines these two can generate much better results than each\ud individual approach. Based on these observations, this paper proposes a constraint network (CN)-based\ud formulation for data locality optimization and code parallelization. The paper also presents experimental\ud evidence, demonstrating the success of the proposed approach, and compares our results with those\ud obtained through previously proposed approaches. The experiments from our implementation indicate\ud that the proposed approach is very effective in enhancing data locality and parallelization.\ud © 2010 Elsevier Inc. All rights reserved

show abstract

Section: Discussion Of Related Workmentioning

confidence: 99%

Data locality and parallelism optimization using a constraint-based approach

Öztürk

2011

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

show abstract

“…To parallelize such loops, they exploit the distributed memory available in the reconfigurable architecture by implementing a data context switching technique. Goumas et al [36] propose a framework to automatically generate parallel code for tiled nested loops. They have implemented several loop transformations within the proposed approach using MPI [22], the message-passing parallel interface.…”

Section: Relevant Prior Workmentioning

confidence: 99%

Slicing based code parallelization for minimizing inter-processor communication

Kandemir

Zhang

Muralidhara

et al. 2009

Proceedings of the 2009 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems

View full text Add to dashboard Cite

One of the critical problems in distributed memory multicore architectures is scalable parallelization that minimizes inter-processor communication. Using the concept of iteration space slicing, this paper presents a new code parallelization scheme for data-intensive applications. This scheme targets distributed memory multi-core architectures, and formulates the problem of data-computation distribution (partitioning) across parallel processors using slicing such that, starting with the partitioning of the output arrays, it iteratively determines the partitions of other arrays as well as iteration spaces of the loop nests in the application code. The goal is to minimize inter-processor data communications. Based on this iteration space slicing based formulation of the problem, we also propose a solution scheme. The proposed data-computation scheme is evaluated using six data-intensive benchmark programs. In our experimental evaluation, we also compare this scheme against three alternate data-computation distribution schemes. The results obtained are very encouraging, indicating around 10% better speedup, with 16 processors, over the next-best scheme when averaged over all benchmark codes we tested.

show abstract

“…This introduces rescaling to q processors. Next we apply rules (9), (14), and (16)- (18) to formally parallelize:…”

Section: Rescaling Ffts Using Spiralmentioning

confidence: 99%

“…A compiler framework for generating MPI code for arbitrarily tiled for-loop nests by performing various loop transformations to gain inherent coarse-grained parallelism is presented in [14]. [18] describes the generation of collective communication MPI code by automatically searching for the best algorithm on a given system.…”

Section: Introductionmentioning

confidence: 99%

Automatic Performance Optimization of the Discrete Fourier Transform on Distributed Memory Computers

Bonelli

Franchetti

Lorenz

et al. 2006

Parallel and Distributed Processing and Applications

View full text Add to dashboard Cite

This paper introduces a formal framework for automatically generating performance optimized implementations of the discrete Fourier transform (DFT) for distributed memory computers. The framework is implemented as part of the program generation and optimization system SPIRAL. DFT algorithms are represented as mathematical formulas in SPIRAL's internal language SPL. Using a tagging mechanism and formula rewriting, we extend SPIRAL to automatically generate parallelized formulas. Using the same mechanism, we enable the generation of rescaling DFT algorithms, which redistribute the data in intermediate steps to fewer processors to reduce communication overhead. It is a novel feature of these methods that the redistribution steps are merged with the communication steps of the algorithm to avoid additional communication overhead. Among the possible alternative algorithms, SPIRAL's search mechanism now determines the fastest for a given platform, effectively generating adapted code without human intervention. Experiments with DFT MPI programs generated by SPIRAL show performance gains of up to 30% due to rescaling. Further, our generated programs compare favorably with FFTW-MPI 2.1.5.

show abstract

Automatic parallel code generation for tiled nested loops

Cited by 9 publications

References 27 publications

Data locality and parallelism optimization using a constraint-based approach

Data locality and parallelism optimization using a constraint-based approach

Slicing based code parallelization for minimizing inter-processor communication

Automatic Performance Optimization of the Discrete Fourier Transform on Distributed Memory Computers

Contact Info

Product

Resources

About