Traditional speedup models, such as Amdahl's Law, Gustafson's, and Sun and Ni's models, have helped the research community and industry to better understand the performance capabilities of systems and the parallelizability of applications. Mostly targeting homogeneous hardware platforms or a limited form of processor heterogeneity, these models do not cover newly emerging multi-core heterogeneous architectures. This paper reports novel speedup and energy consumption models based on a more general representation of heterogeneity, called normal form heterogeneity, supporting a wide range of heterogeneous many-core architectures. The modelling method aims to predict system energy efficiency and performance ranges and facilitates research and development for the hardware and system software levels. Extensive experimentation on an off-the-shelf big.LITTLE heterogeneous platform validates the models showing less than 1% error for speedup and less than 4% error for power dissipation. The practical use of the method is demonstrated with a quantitative study of system load balancing efficiency.
Traditional speedup models, such as Amdahls, facilitate the study of the impact of running parallel workloads on manycore systems. However, these models are typically based on software characteristics, assuming ideal hardware behaviors. As such, the applicability of these models for energy and/or performance-driven system optimization is limited by two factors. Firstly, speedup cannot be measured without instrumenting the original software codes, and secondly, the parallelization factor of an application running on specific hardware is generally unknown.In this paper, we propose a novel method, whereby standard performance counters found in modern many-core platforms can be used to derive speedup without instrumenting applications for time measurements. We postulate that speedup can be accurately estimated as a ratio of instructions per cycle for a parallel manycore system to the instructions per cycle of a single core system. By studying the application instructions and system instructions for the first time, our method leads to the determination of the parallelization factor and the optimal system configuration for energy and/or performance. The method is extensively demonstrated through experiments on three different platforms with core numbers ranging from 4 to 61, running parallel benchmark applications (including synthetic and PARSEC benchmarks) on Linux operating system. Speedup and parallelization estimations using our method and their extensive cross-validations show negligible errors (up to 8%) in these systems. Additionally, we demonstrate the effectiveness of our method to explore parallelization-aware energy-efficient system configurations for many-core systems using energy-delay-product based formulations.Index Terms-Many-core processors; speedup; performance counter, power normalized performance, energy-delay-product.• Extend Amdahl's speedup model considering applications and system software related overhead separately.• Propose a new method to model parallelization and speedup via performance counters to avoid the need for instrumenting applications. We show that speedup can be accurately estimated as a ratio of instructions retired/executed per cycle of
For over 50 years, Amdahl's Law has been the hallmark model for reasoning about performance bounds for homogeneous parallel computing resources. As heterogeneous, many-core parallel resources continue to permeate into the modern server and embedded domains, there has been growing interest in promulgating realistic extensions and assumptions in keeping with newer use cases. This study aims to provide a comprehensive review of the purviews and insights provided by the extensive body of work related to Amdahl's law to date, focusing on computation speedup. The authors show that a significant portion of these studies has looked into analysing the scalability of the model considering both workload and system heterogeneity in real-world applications. The focus has been to improve the definition and semantic power of the two key parameters in the original model: the parallel fraction (f) and the computation capability improvement index (n). More recently, researchers have shown normal-form and multi-fraction extensions that can account for wider ranges of heterogeneity, validated on many-core systems running realistic workloads. Speedup models from Amdahl's law onwards have seen a wide range of uses, such as the optimisation of system execution, and these uses are even more important with the advent of the heterogeneous many-core era.
ARM 1 AbstractParallelization has been used to maintain a reasonable balance between energy consumption and performance in computing platforms especially in modern multi-and many-core systems. This paper studies the interplay between performance and energy, and their relationships with parallelization scaling in the context of the reliable operating region, focusing on the effectiveness of parallelization scaling in throughput-power tradeoffs. Theoretical and experimental explorations show that a meaningful cross-platform analysis of this interplay can be achieved using the proposed method of binormalization of the ROR. The concept of this interplay is captured in an online tool for finding optimal operating points. IntroductionIn digital CMOS circuits, a higher supply voltage (called henceforth) usually permits a higher operating (clock) frequency for capacitive load-balancing, and hence a higher throughput, given the same hardware platform. The scheme of dynamic voltage and frequency scaling (DVFS) scales and clock frequency (henceforth called ) together in order to obtain the best throughput under a given power budget or to save power for a given throughput requirement [1].It is possible to increase system throughput for a given power limit, or to reduce power whilst maintaining throughput, by combining DVFS with parallelization or scaling to multiple computation units if the computation can be parallelized [2]. A major challenge for the precise analysis of the effectiveness of using parallelization for these goals is to determine the parallelizability of any particular execution, which is related to complex issues such as software and hardware architecture details and must be modelled on a per-execution basis [3]. Another challenge is that quantitative studies of power and/or throughput improvements for any DVFS decision need complicated executiondependent models [4]. This paper explores the interplay between DVFS and parallelization scalability with respect to performance and power. The interplay is captured using the concept of a reliable operating region (ROR), which can be established from the knowledge of system reliability through experiments or simulations. The ROR therefore provides containment for platform and application specifics, hence helping to make the further analysis steps generic.The focus of this paper is the effectiveness of parallelization scaling, the latter denoted as .The ROR-based method can explore across the entire voltage range of a platform, from subthreshold to super-threshold regions. The explorations and models presented in this paper confirm and explain the general view that combined DVFS and parallelization scaling produces the best advantage when is scaled down to near-threshold voltages. This is known as near-threshold
Performance and energy efficiency considerations have shifted computing paradigms from single-core to many-core architectures. At the same time, traditional speedup models such as Amdahl's Law face challenges in the run-time reasoning for system performance and energy efficiency, because these models typically assume limited variations of the parallel fraction. Moreover, the parallel fraction, which varies dynamically in workloads, is generally unknown at run-time without application-level instrumentation. This paper describes novel performance/energy trade-off models based on realistic architectural considerations, which describe the parallel fraction and speedup as functions of performance counter values available in modern processors, removing the need for application-level instrumentation. These are then used to develop a Parallelization-Aware Run-time Management (PARMA) approach. PARMA aims at controlling core allocations and operating voltage/frequency points for energy efficiency, according to the varying workload parallel fractions. The efficacy of our models and the PARMA approach is extensively validated using a number of PARSEC benchmark applications, involving two performance/energy trade-off metrics: energy-delay-product (EDP), typically used in high-performance applications and energy per instruction (EPI), suitable for energy-aware applications. Up to 48 and 68 per-cent improvements in EDP and EPI have been observed using the PARMA approach compared with parallelization-agnostic methods.
Designing energy-efficient hardware continues to be challenging due to arithmetic complexities. The problem is further exacerbated in systems powered by energy harvesters as variable power levels can limit their computation capabilities. In this work, we propose a run-time configurable adaptive approximation method for multiplication that is capable of managing the energy and performance tradeoffsideally suited in these systems. Central to our approach is a Significance-Driven Logic Compression (SDLC) multiplier architecture that can dynamically adjust the level of approximation depending on the run-time power/accuracy constraints. The architecture can be configured to operate in the exact mode (no approximation) or in progressively higher approximation modes (i.e. 2 to 4-bit SDLC). Our method is implemented in both ASIC and FPGA. The implementation results indicate that our design has only a 2.3% silicon overhead, on top of what is required by a traditional exact multiplier. We evaluate the efficiency of the proposed design through a number of case studies. We show that our method achieves similar image fidelity as in the existing approximate methods, without a delay penalty. Further, the inclusion of the dynamic approximation techniques is justified by up to 62.6% energy savings when processing an image with a multiplier using 4-bit SDLC and 35% energy savings when using 2-bit SDLC. In addition, case study results show that the proposed approach incurs negligible loss in output quality with the worst PSNR of 30dB when using the 4-bit SDLC multiplier.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.