Reinforcement Learning-Based Inter- and Intra-Application Thermal Optimization for Lifetime Improvement of Multicore Systems

Das, Anup; Shafik, Rishad; Merrett, Geoff V.; Al-Hashimi, Bashir M.; Kumar, Akash; Veeravalli, Bharadwaj

doi:10.1145/2593069.2593199

Cited by 78 publications

(77 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [27], a design methodology that minimizes energy consumption of and temperature-induced wear on multiprocessor systems is introduced; yet neither energy nor temperature is modeled with an awareness of uncertainty due to process variation. A similar observation can be made with respect to the work reported in [28] where a reinforcementlearning algorithm is used in order to improve the lifetime of multiprocessor systems. An extensive survey of reliability-aware system-level design techniques given in [26] confirms the trend emphasized above: the widespread device-level models of failure mechanisms generally ignore the impact of process variation on temperature.…”

Section: Previous Worksupporting

confidence: 70%

“…Similarly, workload uncertainty has not been deprived of attention; see, for instance, [32,88,96,98,105,124]. Aging uncertainty has also been studied extensively in the literature; see, for instance, [24,28,39,50,61,83]. However, certain important problems have not been addressed yet, and in the case of the ones that have been considered, the proposed solutions are often restricted in use, which is due in part to the unrealistic assumptions that these solutions make.…”

Section: Previous Workmentioning

confidence: 99%

See 1 more Smart Citation

System-Level Analysis and Design under Uncertainty

Ukhov¹

View full text Add to dashboard Cite

One major problem for the designer of electronic systems is the presence of uncertainty, which is due to phenomena such as process and workload variation. Very often, uncertainty is inherent and inevitable. If ignored, it can lead to degradation of the quality of service in the best case and to severe faults or burnt silicon in the worst case. Thus, it is crucial to analyze uncertainty and to mitigate its damaging consequences by designing electronic systems in such a way that uncertainty is effectively and efficiently taken into account.We begin by considering techniques for deterministic system-level analysis and design of certain aspects of electronic systems. These techniques do not take uncertainty into account, but they serve as a solid foundation for those that do. Our attention revolves primarily around power and temperature, as they are of central importance for attaining robustness and energy efficiency. We develop a novel approach to dynamic steady-state temperature analysis of electronic systems and apply it in the context of reliability optimization.We then proceed to develop techniques that address uncertainty. The first technique is designed to quantify the variability in process parameters, which is induced by process variation, across silicon wafers based on indirect and potentially incomplete and noisy measurements. The second technique is designed to study diverse system-level characteristics with respect to the variability originating from process variation. In particular, it allows for analyzing transient temperature profiles as well as dynamic steady-state temperature profiles of electronic systems. This is illustrated by considering a problem of design-space exploration with probabilistic constraints related to reliability. The third technique that we develop is designed to efficiently tackle the case of sources of uncertainty that are less regular than process variation, such as workload variation. This technique is exemplified by analyzing the effect that workload units with uncertain processing times have on the timing-, power-, and temperature-related characteristics of the system under consideration.We also address the issue of runtime management of electronic systems that are subject to uncertainty. In this context, we perform an early investigation into the utility of advanced prediction techniques for the purpose of finegrained long-range forecasting of resource usage in large computer systems.All the proposed techniques are assessed by extensive experimental evaluations, which demonstrate the superior performance of our approaches to analysis and design of electronic systems compared to existing techniques. The research presented in this thesis has been partially funded by the National Computer Science Graduate School (cugs) in Sweden.v Sammanfattning Ett stort problem för designern inom elektroniska system är förekomsten av osäkerhet, som beror på sådana fenomen som variationer relaterade till tillverkning och arbetsbelastning. Osäkerhet är i många fall naturlig och oundv...

show abstract

Section: Previous Worksupporting

confidence: 70%

Section: Previous Workmentioning

confidence: 99%

System-Level Analysis and Design under Uncertainty

Ukhov¹

View full text Add to dashboard Cite

show abstract

“…However, as shown in [Faruque et al 2010], these approaches cannot guarantee to minimize a system's thermal overhead effectively for all applications. A cross-layer thermal optimization technique is proposed in [Das et al 2014] to manage temperature-related emergencies. Although these studies have shown improvement in thermal profile leading to extended lifetime reliability using scaled voltage and frequency, thermal cycling and energy consumption are not jointly addressed.…”

Section: Related Workmentioning

confidence: 99%

“…As shown in [Das et al 2014], temperature of an embedded system can be controlled significantly by controlling the processor power states (i.e., their voltage and frequency) and the application thread allocation (that limits context switching). However, the amount of thermal control achieved using these control levers is dependent on the application, its cross-layer interaction with the system software and the hardware, and also on the working environment.…”

Section: Motivation For Machine Learningmentioning

confidence: 99%

“…As such, static compile-time policies [Rai et al 2011;Schor et al 2013;Das et al 2015a] (with limited knowledge of application-specific variations) are often outperformed, even by naive run-time managers, both in terms of thermal overhead and energy consumption -the two key design aspects of modern systems. This has motivated researchers in recent years to investigate run-time approaches, thriving the development of intelligent run-time systems for energy and thermal management [Cochran et al 2011b;Javaid et al 2011;Juan et al 2013;Ye and Xu 2014;Srinivasan et al 2004;Sharifi et al 2013;Shi et al 2013;Faruque et al 2010;Ge and Qiu 2011;Coskun et al 2009a;Mercati et al 2013;Das et al 2014;Coskun et al 2009b;Ebi et al 2009;Ebi et al 2011;Shen et al 2012].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Adaptive and Hierarchical Runtime Manager for Energy-Aware Thermal Management of Embedded Systems

Das

Al-Hashimi

Merrett

2016

ACM Trans. Embed. Comput. Syst.

Self Cite

View full text Add to dashboard Cite

Modern embedded systems execute applications, which interacts with the operating system and hardware differently depending on type of workload. These cross-layer interactions result in wide variations of chipwide thermal profile. In this paper, a reinforcement learning-based run-time manager is proposed that guarantees application-specific performance requirements and controls the POSIX thread allocation and voltage/frequency scaling for energy-efficient thermal management. This controls three thermal aspectspeak temperature, average temperature and thermal cycling. Contrary to existing learning-based run-time approaches that optimize energy and temperature individually, the proposed run-time manager is the first approach to combine the two objectives, simultaneously addressing all three thermal aspects. However, determining thread allocation and core frequencies to optimize energy and temperature is an NP-hard problem. This leads to an exponential growth in the learning table (significant memory overhead) and a corresponding increase in the exploration time to learn the most appropriate thread allocation and core frequency for a particular application workload. To confine the learning space and to minimize the learning cost, the proposed run-time manager is implemented in a two-stage hierarchy: a heuristic-based thread allocation at a longer time interval to improve thermal cycling, followed by a learning-based hardware frequency selection at a much finer interval to improve average temperature, peak temperature and energy consumption. This enables finer control on temperature in an energy-efficient manner, while simultaneously addressing scalability, which is a crucial aspect for multi-/many-core embedded systems. The proposed hierarchical run-time manager is implemented for Linux running on nVidia's Tegra SoC, featuring four ARM Cortex-A15 cores. Experiments conducted with a range of embedded and cpu intensive applications demonstrate that the proposed run-time manager not only reduces energy consumption by an average 15% with respect to Linux, but also improves all the thermal aspects -average temperature by 14 • C, peak temperature by 16 • C and thermal cycling by 54%.

show abstract