Abstract. We present an automatic, static program transformation that schedules and generates ecient memory transfers between a computer host and its hardware accelerator, addressing a well-known performance bottleneck. Our automatic approach uses two simple heuristics: to perform transfers to the accelerator as early as possible and to delay transfers back from the accelerator as late as possible. We implemented this transformation as a middle-end compilation pass in the pips/Par4All compiler. In the generated code, redundant communications due to data reuse between kernel executions are avoided. Instructions that initiate transfers are scheduled eectively at compile-time. We present experimental results obtained with the Polybench 2.0, some Rodinia benchmarks, and with a real numerical simulation. We obtain an average speedup of 4 to 5 when compared to a naïve parallelization using a modern gpu with Par4All, hmpp, and pgi, and 3.5 when compared to an OpenMP version using a 12-core multiprocessor.
Pythran is a young open source static compiler that turns modules written in a subset of Python into native ones. Based on the fact that scientific modules do not rely much on the dynamic features of the language, it trades them in favor of powerful, eventually inter procedural, optimizations. These include detection of pure functions, temporary allocation removal, constant folding, Numpy ufunc fusion and parallelization, explicit thread-level parallelism through OpenMP annotations, false variable polymorphism pruning, and automatic vector instruction generation such as AVX or SSE.In addition to these compilation steps, Pythran provides a C++ runtime library that leverages the C++ STL to provide generic containers, and the Numeric Template Toolbox (NT2) for Numpy support. It takes advantage of modern C++11 features such as variadic templates, type inference, move semantics and perfect forwarding, as well as classical ones such as expression templates.The input code remains compatible with the Python interpreter, and output code is generally as efficient as the annotated Cython equivalent, if not more, without the backward compatibility loss of Cython. Numpy expressions run faster than when compiled with numexpr, without any change of the original code.
No abstract
Pythran is a young open source static compiler that turns modules written in a subset of Python into native ones. Based on the fact that scientific modules do not rely much on the dynamic features of the language, it trades them in favor of powerful, eventually inter procedural, optimizations. These include detection of pure functions, temporary allocation removal, constant folding, Numpy ufunc fusion and parallelization, explicit thread-level parallelism through OpenMP annotations, false variable polymorphism pruning, and automatic vector instruction generation such as AVX or SSE. In addition to these compilation steps, Pythran provides a C++ runtime library that leverages the C++ STL to provide generic containers, and the Numeric Template Toolbox (NT2) for Numpy support. It takes advantage of modern C++11 features such as variadic templates, type inference, move semantics and perfect forwarding, as well as classical ones such as expression templates. The input code remains compatible with the Python interpreter, and output code is generally as efficient as the annotated Cython equivalent, if not more, without the backward compatibility loss of Cython. Numpy expressions run faster than when compiled with numexpr, without any change of the original code.
Abstract. We present a particle-mesh N-body integrator running on GPU using CUDA. Relying on a grid-based description of the gravitational potential, it can simulate the evolution of self-interacting 'stars' in order to model e.g. galaxies. All the steps of the application have been ported on the GPU , namely 1/ an histogramming algorithm with CUDPP, 2/ of the resolution of the Poisson equation by means of FFT with CUFFT and multi-grid relaxation, 3/ of an optimized finitedifference scheme to compute the accelerations of stars and 4/ of an update procedure for positions and velocities. We present several tests at different resolution, and reach a speedup from 2 to 50 depending on the resolution and on the test case.
International audienceThe current microarchitecture trend leads toward heterogeneity. This evolution is driven by the end of Moore's law and the frequency wall due to the power wall. Moreover, with the spreading of smartphone, some constraints from the mobile world drive the design of most new architectures. An immediate consequence is that an application has to be executable on various targets. Porting and maintaining multiple versions of the code base requires different skills and the efforts required in the process as well as the increased complexity in debugging and testing are time consuming, thus expensive. Some solutions based on compilers emerge. They are based either on directives added to C like in openhmpp or openacc or on automatic solution like pocc, Pluto, ppcg, or Par4All. However compilers cannot retarget in an efficient way any program written in a low-level language such as unconstrained C. Programmers should follow good practices when writing code so that compilers have more room to perform the transformations required for efficient execution on heterogeneous targets. This chapter explores the impact of different patterns used by programmers, and defines a set of good practices allowing a compiler to generate efficient code
Abstract. Automatic data transfer generation is a critical step for guided or automatic code generation for accelerators using distributed memories. Although good results have been achieved for loop nests, more complex control ows such as switches or while loops are generally not handled. This paper shows how to leverage the convex array regions abstraction to generate data transfers. The scope of this study ranges from inter-procedural analysis in simple loop nests with function calls, to inter-iteration data reuse optimization and arbitrary control ow in loop bodies. Generated transfers are approximated when an exact solution cannot be found. Array regions are also used to extend redundant load store elimination to array variables. The approach has been successfully applied to GPUs and domain-specic hardware accelerators.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.