Abstract:Optimizing compilers apply numerous interdependent optimizations, leading to the notoriously difficult phase-ordering problem-that of deciding which transformations to apply and in which order. Fortunately, new infrastructures such as the polyhedral compilation framework host a variety of transformations, facilitating the efficient exploration and configuration of multiple transformation sequences. Many powerful optimizations, however, remain external to the polyhedral framework, including vectorization. The l… Show more
“…In this work we automatically apply post-transformations to expose parallelism at the innermost loop level, if possible. Previous work on SIMD vectorization for affine programs has proposed effective solutions to expose innerloop-level parallelism [12,37], and we seamlessly reuse those techniques to enable effective loop pipelining on the FPGA. This is achieved by using additional constraints during and after the tiling hyperplanes computation, to preserve one level of inner parallelism.…”
Section: Loop Pipelining and Task Parallelismmentioning
Many applications, such as medical imaging, generate intensive data traffic between the FPGA and off-chip memory. Significant improvements in the execution time can be achieved with effective utilization of on-chip (scratchpad) memories, associated with careful software-based data reuse and communication scheduling techniques.We present a fully automated C-to-FPGA framework to address this problem. Our framework effectively implements data reuse through aggressive loop transformation-based program restructuring. In addition, our proposed framework automatically implements critical optimizations for performance such as task-level parallelization, loop pipelining, and data prefetching. We leverage the power and expressiveness of the polyhedral compilation model to develop a multi-objective optimization system for off-chip communications management. Our technique can satisfy hardware resource constraints (scratchpad size) while aggressively exploiting data reuse. Our approach can also be used to reduce the on-chip buffer size subject to bandwidth constraint. We also implement a fast design space exploration technique for effective optimization of program performance using the Xilinx high-level synthesis tool.
“…In this work we automatically apply post-transformations to expose parallelism at the innermost loop level, if possible. Previous work on SIMD vectorization for affine programs has proposed effective solutions to expose innerloop-level parallelism [12,37], and we seamlessly reuse those techniques to enable effective loop pipelining on the FPGA. This is achieved by using additional constraints during and after the tiling hyperplanes computation, to preserve one level of inner parallelism.…”
Section: Loop Pipelining and Task Parallelismmentioning
Many applications, such as medical imaging, generate intensive data traffic between the FPGA and off-chip memory. Significant improvements in the execution time can be achieved with effective utilization of on-chip (scratchpad) memories, associated with careful software-based data reuse and communication scheduling techniques.We present a fully automated C-to-FPGA framework to address this problem. Our framework effectively implements data reuse through aggressive loop transformation-based program restructuring. In addition, our proposed framework automatically implements critical optimizations for performance such as task-level parallelization, loop pipelining, and data prefetching. We leverage the power and expressiveness of the polyhedral compilation model to develop a multi-objective optimization system for off-chip communications management. Our technique can satisfy hardware resource constraints (scratchpad size) while aggressively exploiting data reuse. Our approach can also be used to reduce the on-chip buffer size subject to bandwidth constraint. We also implement a fast design space exploration technique for effective optimization of program performance using the Xilinx high-level synthesis tool.
“…Several previous studies have shown how tiling, parallelization, vectorization or data locality enhancement can be efficiently addressed in an affine transformation framework [21], [34], [14], [24], [36]. Any loop transformation can be represented in the polyhedral representation, and composing arbitrarily complex sequences of loop transformations is seamlessly handled by the framework.…”
Section: Optimization Spacementioning
confidence: 99%
“…Our approach to vectorization leverages recent analytical modeling results by Trifunovic et al [36]. We take advantage of the polyhedral representation to restructure imperfectly nested programs, to expose vectorizable inner loops.…”
Section: ) Simd-level Parallelizationmentioning
confidence: 99%
“…The most important part of the transformation to enable vectorization comes from the selection of which parallel loop is moved to the innermost position. The cost model selects a synchronization-free loop that minimizes the memory stride of the data accessed by two contiguous iterations of the loop [36]. Note, this interchange may not always lead to the optimal vectorization, or may simply be useless for a machine which does not support SIMD instruction.…”
Abstract-Significant advances in compiler optimization have been made in recent years, enabling many transformations such as tiling, fusion, parallelization and vectorization on imperfectly nested loops. Nevertheless, the problem of finding the best combination of loop transformations remains a major challenge. Polyhedral models for compiler optimization have demonstrated strong potential for enhancing program performance, in particular for compute-intensive applications. But existing static cost models to optimize polyhedral transformations have significant limitations, and iterative compilation has become a very promising alternative to these models to find the most effective transformations. But since the number of polyhedral optimization alternatives can be enormous, it is often impractical to iterate over a significant fraction of the entire space of polyhedrally transformed variants. Recent research has focused on iterating over this search space either with manually-constructed heuristics or with automatic but very expensive search algorithms (e.g., genetic algorithms) that can eventually find good points in the polyhedral space.In this paper, we propose the use of machine learning to address the problem of selecting the best polyhedral optimizations. We show that these models can quickly find high-performance program variants in the polyhedral space, without resorting to extensive empirical search. We introduce models that take as input a characterization of a program based on its dynamic behavior, and predict the performance of aggressive high-level polyhedral transformations that includes tiling, parallelization and vectorization. We allow for a minimal empirical search on the target machine, discovering on average 83% of the searchspace-optimal combinations in at most 5 runs. Our end-to-end framework is validated using numerous benchmarks on two multi-core platforms.
“…[15,26,28,30,43]. These work are usually focusing on the back-end part, that is the actual SIMD code generation from a parallel loop [15,28,30], or on the highlevel loop transformation angle only [12,26,38,40]. To the best of our knowledge, our work is the first to address simultaneously both problems by setting a well-defined interface between a powerful polyhedral high-level transformation engine and a specialized SIMD code generator.…”
Data locality and parallelism are critical optimization objectives for performance on modern multi-core machines. Both coarse-grain parallelism (e.g., multi-core) and fine-grain parallelism (e.g., vector SIMD) must be effectively exploited, but despite decades of progress at both ends, current compiler optimization schemes that attempt to address data locality and both kinds of parallelism often fail at one of the three objectives.We address this problem by proposing a 3-step framework, which aims for integrated data locality, multi-core parallelism and SIMD execution of programs. We define the concept of vectorizable codelets, with properties tailored to achieve effective SIMD code generation for the codelets. We leverage the power of a modern high-level transformation framework to restructure a program to expose good ISA-independent vectorizable codelets, exploiting multi-dimensional data reuse. Then, we generate ISA-specific customized code for the codelets, using a collection of lower-level SIMD-focused optimizations.We demonstrate our approach on a collection of numerical kernels that we automatically tile, parallelize and vectorize, exhibiting significant performance improvements over existing compilers.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.