High Performance Fortran (HPF) was developed to support data parallel programming for single-instruction multiple-data (SIMD) and multiple-instruction multiple-data (MIMD) machines with distributed memory. The programmer is provided a familiar uniform logical address space and specifies the data distribution by directives. The compiler then exploits these directives to allocate arrays in the local memories, to assign computations to elementary processors, and to migrate data between processors when required. We show here that linear algebra is a powerful framework to encode HPF directives and to synthesize distributed code with space-efficient array allocation, tight loop bounds, and vectorized communications forINDEPENDENTloops. The generated code includes traditional optimizations such as guard elimination, message vectorization and aggregation, and overlap analysis. The systematic use of an affine framework makes it possible to prove the compilation scheme correct.
International audienceThis paper provides both theoretical and experimental evidence for the existence of an Energy/Frequency Convexity Rule, which relates energy consumption and CPU frequency on mobile devices. We monitored a typical smartphone running a speci c computing-intensive kernel of multiple nested loops written in C using a high-resolution power gauge. Data gathered during a week-long acquisition campaign suggest that energy consumed per input element is strongly correlated with CPU frequency, and, more interestingly, the curve exhibits a clear minimum over a 0.2 GHz to 1.6 GHz window. We provide and motivate an analytical model for this behavior, which ts well with the data. Our work should be of clear interest to researchers focusing on energy usage and minimization for mobile devices, and provide new insights for optimization opportunities
International audienceModular static analyzers use procedure abstractions, a.k.a. summarizations, to ensure that their execution time increases linearly with the size of analyzed programs. A similar abstraction mechanism is also used within a procedure to perform a bottom-up analysis. For instance, a sequence of instructions is abstracted by combining the abstractions of its components, or a loop is abstracted using the abstraction of its loop body: fixed point iterations for a loop can be replaced by a direct computation of the transitive closure of the loop body abstraction. More specifically, our abstraction mechanism uses affine constraints, i.e. polyhedra, to specify pre- and post-conditions as well as state transformers. We present an algorithm to compute the transitive closure of such a state transformer, and we illustrate its performance on various examples. Our algorithm is simple, based on discrete differentiation and integration: it is very different from the usual abstract interpretation fixed point computation based on widening. Experiments are carried out using previously published examples. We obtain the same results directly, without using any heuristic
Abstract-We introduce and experimentally validate a new macro-level model of the CPU temperature/power relationship within nanometer-scale application processors or system-onchips. By adopting a holistic view, this model is able to take into account many of the physical effects that occur within such systems. Together with two algorithms described in the paper, our results can be used, for instance by engineers designing power or thermal management units, to cancel the temperatureinduced bias on power measurements. This will help them gather temperature-neutral power data while running multiple instance of their benchmarks. Also power requirements and system failure rates can be decreased by controlling the CPU's thermal behavior.Even though it is usually assumed that the temperature/power relationship is exponentially related, there is however a lack of publicly available physical temperature/power measurements to back up this assumption, something our paper corrects. Via measurements on two pertinent platforms sporting nanometerscale application processors, we show that the power/temperature relationship is indeed very likely exponential over a 20• C to 85• C temperature range. Our data suggest that, for application processors operating between 20• C and 50• C, a quadratic model is still accurate and a linear approximation is acceptable.
Abstract. We present an automatic, static program transformation that schedules and generates ecient memory transfers between a computer host and its hardware accelerator, addressing a well-known performance bottleneck. Our automatic approach uses two simple heuristics: to perform transfers to the accelerator as early as possible and to delay transfers back from the accelerator as late as possible. We implemented this transformation as a middle-end compilation pass in the pips/Par4All compiler. In the generated code, redundant communications due to data reuse between kernel executions are avoided. Instructions that initiate transfers are scheduled eectively at compile-time. We present experimental results obtained with the Polybench 2.0, some Rodinia benchmarks, and with a real numerical simulation. We obtain an average speedup of 4 to 5 when compared to a naïve parallelization using a modern gpu with Par4All, hmpp, and pgi, and 3.5 when compared to an OpenMP version using a 12-core multiprocessor.
Algebraic properties such as associativity or distributivity allow the manipulation of a set of mathematically equivalent expressions. However, as shown in this paper, the cost of evaluating such expressions on a computer is not constant within this domain. We suggest the use of algebraic transformations to improve the performance of computationally intensive applications on modern computer architectures. We claim that taking into account instruction-level parallelism and the new capabilities of processors when applying these transformations leads to large run-time improvements. Due to a combinatorial explosion, associativecommutative pattern-matching techniques cannot systematically be used in this context. Thus, we introduce two performance enhancing algorithms providing factorization and multiply-add extraction heuristics and choice criteria based on a simple cost model. This paper describes our approach and a first implementation. Experiments on real code, including an excerpt from SPEC FP95, are very promising since we automatically obtain the same results as manual transformations, with a performance improvement by a factor of up to 3.
Applications with varying array access patterns require to dynamically change array mappings on distributed-memory parallel machines. Hpf (High Performance Fortran) provides such remappings, on data that can be replicated, explicitly through the realign and redistribute directives and implicitly at procedure calls and returns. However such features are left out of the hpf subset or of the currently discussed hpf kernel for e ciency reasons. This paper presents a new compilation technique to handle hpf remappings for message-passing parallel architectures. The rst phase is global and removes all useless remappings that appear naturally in procedures. The code generated by the second phase takes advantage of replications to shorten the remapping time. It is proved optimal: A minimal number of messages, containing only the required data, is sent over the network.The technique is fully implemented in hpfc, our prototype hpf compiler. Experiments were performed on a Dec Alpha farm.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.