Mehdi Amini scite author profile

Abstract. We present an automatic, static program transformation that schedules and generates ecient memory transfers between a computer host and its hardware accelerator, addressing a well-known performance bottleneck. Our automatic approach uses two simple heuristics: to perform transfers to the accelerator as early as possible and to delay transfers back from the accelerator as late as possible. We implemented this transformation as a middle-end compilation pass in the pips/Par4All compiler. In the generated code, redundant communications due to data reuse between kernel executions are avoided. Instructions that initiate transfers are scheduled eectively at compile-time. We present experimental results obtained with the Polybench 2.0, some Rodinia benchmarks, and with a real numerical simulation. We obtain an average speedup of 4 to 5 when compared to a naïve parallelization using a modern gpu with Par4All, hmpp, and pgi, and 3.5 when compared to an OpenMP version using a 12-core multiprocessor.

show abstract

Pythran: enabling static optimization of scientific Python programs

Guelton

Brunet

Amini³

et al. 2015

Comput. Sci. Disc.

View full text Add to dashboard Cite

Pythran is a young open source static compiler that turns modules written in a subset of Python into native ones. Based on the fact that scientific modules do not rely much on the dynamic features of the language, it trades them in favor of powerful, eventually inter procedural, optimizations. These include detection of pure functions, temporary allocation removal, constant folding, Numpy ufunc fusion and parallelization, explicit thread-level parallelism through OpenMP annotations, false variable polymorphism pruning, and automatic vector instruction generation such as AVX or SSE.In addition to these compilation steps, Pythran provides a C++ runtime library that leverages the C++ STL to provide generic containers, and the Numeric Template Toolbox (NT2) for Numpy support. It takes advantage of modern C++11 features such as variadic templates, type inference, move semantics and perfect forwarding, as well as classical ones such as expression templates.The input code remains compatible with the Python interpreter, and output code is generally as efficient as the annotated Cython equivalent, if not more, without the backward compatibility loss of Cython. Numpy expressions run faster than when compiled with numexpr, without any change of the original code.

show abstract

ThinLTO: Scalable and incremental LTO

Johnson¹,

Amini²,

Li³

2017

View full text Add to dashboard Cite

Pythran: Enabling Static Optimization of Scientific Python Programs

Guelton¹,

Brunet²,

Raynaud³

et al. 2013

View full text Add to dashboard Cite

Pythran is a young open source static compiler that turns modules written in a subset of Python into native ones. Based on the fact that scientific modules do not rely much on the dynamic features of the language, it trades them in favor of powerful, eventually inter procedural, optimizations. These include detection of pure functions, temporary allocation removal, constant folding, Numpy ufunc fusion and parallelization, explicit thread-level parallelism through OpenMP annotations, false variable polymorphism pruning, and automatic vector instruction generation such as AVX or SSE. In addition to these compilation steps, Pythran provides a C++ runtime library that leverages the C++ STL to provide generic containers, and the Numeric Template Toolbox (NT2) for Numpy support. It takes advantage of modern C++11 features such as variadic templates, type inference, move semantics and perfect forwarding, as well as classical ones such as expression templates. The input code remains compatible with the Python interpreter, and output code is generally as efficient as the annotated Cython equivalent, if not more, without the backward compatibility loss of Cython. Numpy expressions run faster than when compiled with numexpr, without any change of the original code.

show abstract

A Particle-Mesh Integrator for Galactic Dynamics Powered by GPGPUs

Aubert

Amini

David

2009

View full text Add to dashboard Cite

Abstract. We present a particle-mesh N-body integrator running on GPU using CUDA. Relying on a grid-based description of the gravitational potential, it can simulate the evolution of self-interacting 'stars' in order to model e.g. galaxies. All the steps of the application have been ported on the GPU , namely 1/ an histogramming algorithm with CUDPP, 2/ of the resolution of the Poisson equation by means of FFT with CUFFT and multi-grid relaxation, 3/ of an optimized finitedifference scheme to compute the accelerations of stars and 4/ of an update procedure for positions and velocities. We present several tests at different resolution, and reach a speedup from 2 to 50 depending on the resolution and on the test case.

show abstract

Program Sequentially, Carefully, and Benefit from Compiler Advances for Parallel Heterogeneous Computing

Amini¹,

Ancourt²,

Creusillet³

et al.

View full text Add to dashboard Cite

International audienceThe current microarchitecture trend leads toward heterogeneity. This evolution is driven by the end of Moore's law and the frequency wall due to the power wall. Moreover, with the spreading of smartphone, some constraints from the mobile world drive the design of most new architectures. An immediate consequence is that an application has to be executable on various targets. Porting and maintaining multiple versions of the code base requires different skills and the efforts required in the process as well as the increased complexity in debugging and testing are time consuming, thus expensive. Some solutions based on compilers emerge. They are based either on directives added to C like in openhmpp or openacc or on automatic solution like pocc, Pluto, ppcg, or Par4All. However compilers cannot retarget in an efficient way any program written in a low-level language such as unconstrained C. Programmers should follow good practices when writing code so that compilers have more room to perform the transformations required for efficient execution on heterogeneous targets. This chapter explores the impact of different patterns used by programmers, and defines a set of good practices allowing a compiler to generate efficient code

show abstract

Beyond Do Loops: Data Transfer Generation with Convex Array Regions

Guelton

Amini

Creusillet³

2013

View full text Add to dashboard Cite

Abstract. Automatic data transfer generation is a critical step for guided or automatic code generation for accelerators using distributed memories. Although good results have been achieved for loop nests, more complex control ows such as switches or while loops are generally not handled. This paper shows how to leverage the convex array regions abstraction to generate data transfers. The scope of this study ranges from inter-procedural analysis in simple loop nests with function calls, to inter-iteration data reuse optimization and arbitrary control ow in loop bodies. Generated transfers are approximated when an exact solution cannot be found. Array regions are also used to extend redundant load store elimination to array variables. The approach has been successfully applied to GPUs and domain-specic hardware accelerators.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Mehdi Amini

MLIR: Scaling Compiler Infrastructure for Domain Specific Computation

Static Compilation Analysis for Host-Accelerator Communication Optimization

Pythran: enabling static optimization of scientific Python programs

ThinLTO: Scalable and incremental LTO

Pythran: Enabling Static Optimization of Scientific Python Programs

A Particle-Mesh Integrator for Galactic Dynamics Powered by GPGPUs

Program Sequentially, Carefully, and Benefit from Compiler Advances for Parallel Heterogeneous Computing

Beyond Do Loops: Data Transfer Generation with Convex Array Regions

Contact Info

Product

Resources

About