International audienceThe registers constraints are usually taken into account during the scheduling pass of an acyclic data dependence graph (DAG): any schedule of the instructions inside a basic block must bound the register requirement under a certain limit. In this work, we show how to handle the register pressure before the instruction scheduling of a DAG. We mathematically study an approach which consists in managing the exact upper-bound of the register need for all the valid schedules of a considered DAG, independently of the functional unit constraints. We call this computed limit the register saturation (RS) of the DAG. Its aim is to detect possible obsolete register constraints, i.e., when RS does not exceed the number of available registers. If it does, we add some serial edges to the original DAG such that the worst register need does not exceed the number of available registers. We propose an appropriate mathematical formalism for this problem. Our generic processor model takes into account superscalar, VLIW and EPIC/IA64 architectures. Our deeper analysis of the problem and our formal methods enable us to provide nearly optimal heuristics and strategies for register optimization in the face of ILP
Register allocation in loops is generally performed after or during the software pipelining process. This is because doing a conventional register allocation as a first step without assuming a schedule lacks the information of interferences between values live ranges. Thus, the register allocator may introduce an excessive amount of false dependences that dramatically reduce the ILP (Instruction Level Parallelism). We present a new theoretical framework for controlling the register pressure before software pipelining. Thus is based on inserting some anti-dependence edges (register reuse edges) labeled with reuse distances, directly on the data dependence graph. In this new graph, we are able to fix the register pressure, measured as the number of simultaneously alive variables in any schedule. The determination of register and distance reuse is parameterized by the desired minimum initiation interval (MII) as well as by the register pressure constraints - either can be minimized while the other one is fixed. After scheduling, register allocation is done on conventional register sets or on rotating register files. We give an optimal exact model, and an approximation that generalizes the Ning-Gao [22] buffer optimization method. We provide experimental results which show good improvement compared to [22]. Our theoretical model considers superscalar, VLIW and EPIC/IA64 processors.
In the area of high-performance computing and embedded systems, numerous code optimisation methods exist to accelerate the speed of the computation (or optimise another performance criteria). They are usually experimented by doing multiple observations of the initial and the optimised execution times of a programme in order to declare a speedup. Even with fixed input and execution environment, programme execution times vary in general. Hence, different kinds of speedups may be reported: the speedup of the average execution time, the speedup of the minimal execution time, the speedup of the median and others. Many published speedups in the literature are observations of a set of experiments. To improve the reproducibility of the experimental results, this article presents a rigorous statistical methodology regarding programme performance analysis. We rely on well-known statistical tests (Shapiro-Wilk's test, Fisher's F -test, Student's t-test, Kolmogorov-Smirnov's test and Wilcoxon-Mann-Whitney's test) to study if the observed speedups are statistically significant or not. By fixing 0 <˛< 1 a desired risk level, we are able to analyse the statistical significance of the average execution time as well as the median. We can also check if P OEX > Y > 1=2, the probability that an individual execution of the optimised code is faster than the individual execution of the initial code. In addition, we can compute the confidence interval of the probability to obtain a speedup on a randomly selected benchmark that does not belong to the initial set of tested benchmarks. Our methodology defines a consistent improvement compared with the usual performance analysis method in high-performance computing. We explain in each situation the hypothesis that must be checked to declare a correct risk level for the statistics. The Speedup-Test protocol certifying the observed speedups with rigorous statistics is implemented and distributed as an open source tool based on R software. CopyrightThe first principle is to provide a mathematical proof given a theoretical model that the published code optimisation method is correct or/and efficient: this is the hardest part of research in computer science, because if the theoretical model is too simple, it would not represent real world, and if the model is too close to real world, mathematics becomes too complex to digest. A second principle for code optimisation in general is to propose and implement a code transformation technique and to practice it on a set of chosen benchmarks in order to evaluate its efficiency. This article concerns this last point: how can we use rigorous statistics to compare between the performances of two versions of the same programme.What makes a binary programme execution time vary on a modern multicore processor, even if we use the same data input, the same binary and the same execution environment? Here are some factors:1. Intrinsic factors to the programme itself: synchronisation functions, OS calls and others. 2. Factors related to the execution environment: ...
This paper solves an open problem regarding loop unrolling after periodic register allocation. Although software pipelining is a powerful technique to extract fine-grain parallelism, it generates reuse circuits spanning multiple loop iterations. These circuits require periodic register allocation, which in turn yield a code generation challenge, generally addressed through: (1) hardware support-rotating register files-deemed too expensive for embedded processors, (2) insertion of register moves with a high risk of reducing the computation throughput-initiation interval (II)-of software pipelining, and (3) post-pass loop unrolling that does not compromise throughput but often leads to unpractical code growth. The latter approach relies on the proof that MAXLIVE registers are sufficient for periodic register allocation [2, 3, 5]; yet the only heuristic to control the amount of post-pass loop unrolling does not achieve this bound and leads to undesired register spills [4, 7]. We propose a periodic register allocation technique allowing a software-only code generation that does not trade the optimality of the II for compactness of the generated code. Our idea is based on using the remaining registers: calling Rarch the number of architectural registers of the target processor, then the number of remaining registers that can be used for minimising the unrolling degree is equal to Rarch − MAXLIVE. We provide a complete formalisation of the problem and algorithm, followed by extensive experiments. We achieve practical loop unrolling degrees in most cases-with no increase of the II-while state-of-the-art techniques would either induce register spilling, degrade the II or lead to unacceptable code growth.
Program performance optimisations, feedbackdirected iterative compilation and auto-tuning systems [1] all assume a fixed estimation of execution time given a fixed input data for the program. However, in practice we observe non-negligible program performance variations on hardware platforms. While these variations are insignificant for sequential applications, we show that parallel native OpenMP programs have less performance stability.This article does not try to quantify nor to qualify the factors influencing the variations of program execution times, that we let for a future work. This article demonstrates three observations: 1) The performance variations of sequential applications is insignificant. 2) OpenMP program execution times on multi-core platforms show important variations.3) The distribution of the execution times is not a Gaussian distribution in almost all cases. We finish by a discussion explaining why considering the minimal or the mean execution time within a sample of experiments is not the best estimation of program performance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.