Abstract:Loop tiling to exploit data locality and parallelism plays an essential role in a variety of general-purpose and domain-specific compilers. Affine transformations in polyhedral frameworks implement classical forms of rectangular and parallelogram tiling, but these lead to pipelined start with rather inefficient wavefront parallelism. Multiple extensions to polyhedral compilers evaluated sophisticated shapes such as trapezoid or diamond tiles, enabling concurrent start along the axes of the iteration space; yet… Show more
“…Additional pre/post-processing is limited to point (pixel-wise) operators. Polyhedral optimization must be considered [38], [39] for extending to stencil or areawise operations.…”
“…Additional pre/post-processing is limited to point (pixel-wise) operators. Polyhedral optimization must be considered [38], [39] for extending to stencil or areawise operations.…”
“…In the case of iterative stencils we use the problem size of the outermost space dimension. As an illustration, given a problem size of 1024 and a 2D dominating array, the upper bound produced would be loд 1.15 (64).…”
Section: Modeling Resource Constraintsmentioning
confidence: 99%
“…Other works such as Flextended Tiles [64] improve on the traditional overlapped tiles [37] with the goal of reducing redundant computations. Previously, split tiling [20] and hexagonal tiling [19] were also used in multiple automated code generators, in particular, for iterative stencil computations, to enhance the parallelism.…”
Loop tiling is a key high-level transformation which is known to maximize locality in loop intensive programs. It has been successfully applied to a number of applications including tensor contractions, iterative stencils and machine learning. This technique has also been extended to a wide variety of computational domains and architectures. The performance achieved with this critical transformation largely depends on a set of inputs given, the tile sizes, due to the complex trade-off between locality and parallelism. This problem is exacerbated in GPGPU architectures due to limited hardware resources such as the available shared-memory.In this paper we present a new technique to compute resource conscious tile sizes for affine programs. We use Integer Linear Programming (ILP) constraints and objectives in a cross-compiler fashion to faithfully and effectively mimic the transformations applied in a polyhedral GPU compiler (PPCG). Our approach significantly reduces the need for experimental auto-tuning by generating only two tile size configurations that achieve strong out-of-the-box performance. We evaluate the effectiveness of our technique using the Polybench benchmark suite on two GPGPUs, an AMD Radeon VII and an NVIDIA Tesla V100, using OpenCL and CUDA programming models. Experimental validation reveals that our approach achieves nearly 75% of the best empirically found tile configuration across both architectures.
CCS CONCEPTS• Software and its engineering → Compilers; • General and reference → Performance; • Mathematics of computing → Combinatorial optimization; • Computer systems organization → Parallel architectures.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.