Abstract:First, I want to thank my supervisor Prof. Thambipillai Srikanthan for giving me the opportunity of undertaking a Ph.D. under his sage guidance. His vision and most insightful advice have shaped this endeavor into a fruitful culmination-from choosing the topic to developing ideas and identifying any missing pieces to thesis writing. The motivation and kind support that he has provided have enabled the smooth continuation of the Ph.D., especially around the birth of my second daughter. This thesis would not hav… Show more
“…Some DRM research accounts for thermal cycling as a controllable metric [22], while others focus on electromigration [37], and others also consider the time-dependent dielectric breakdown [35,36] or on negative bias temperature instability [24,27]. Only one work accounts for process variation [38] and one for hot carry injection [24].…”
The advent of manycore systems has led to the need for efficient dynamic thermal and reliability management techniques to increase system reliability. Increasing power density and thermal hotspots in manycore systems pose significant challenges to reliability and performance. Existing techniques often fail to scale effectively or consider long-term reliability impacts. This work aims to develop a lightweight and scalable management strategy for manycore systems that integrates dynamic thermal management (DTM) and dynamic reliability management (DRM) using application mapping and task migration. The primary contribution is the introduction of the FIT-aware Learning Heuristic for Application Allocation (FLEA), which leverages Q-learning to optimize task allocation and migration based on Failure In Time (FIT) monitoring. FLEA operates in two phases: a design phase that uses Q-learning to train a policy table (Q-table) and a runtime phase that utilizes this Q-table to make decisions on task allocation and migration. The Q-table is populated with values representing the best task deployment patterns, minimizing thermal hotspots and maximizing system reliability. The evaluation of FLEA demonstrates improvements over state-of-the-art techniques. FLEA effectively reduces the thermal amplitude, peak temperature, and spatial thermal distribution, resulting in enhanced Mean Time To Failure (MTTF) for the system.
“…Some DRM research accounts for thermal cycling as a controllable metric [22], while others focus on electromigration [37], and others also consider the time-dependent dielectric breakdown [35,36] or on negative bias temperature instability [24,27]. Only one work accounts for process variation [38] and one for hot carry injection [24].…”
The advent of manycore systems has led to the need for efficient dynamic thermal and reliability management techniques to increase system reliability. Increasing power density and thermal hotspots in manycore systems pose significant challenges to reliability and performance. Existing techniques often fail to scale effectively or consider long-term reliability impacts. This work aims to develop a lightweight and scalable management strategy for manycore systems that integrates dynamic thermal management (DTM) and dynamic reliability management (DRM) using application mapping and task migration. The primary contribution is the introduction of the FIT-aware Learning Heuristic for Application Allocation (FLEA), which leverages Q-learning to optimize task allocation and migration based on Failure In Time (FIT) monitoring. FLEA operates in two phases: a design phase that uses Q-learning to train a policy table (Q-table) and a runtime phase that utilizes this Q-table to make decisions on task allocation and migration. The Q-table is populated with values representing the best task deployment patterns, minimizing thermal hotspots and maximizing system reliability. The evaluation of FLEA demonstrates improvements over state-of-the-art techniques. FLEA effectively reduces the thermal amplitude, peak temperature, and spatial thermal distribution, resulting in enhanced Mean Time To Failure (MTTF) for the system.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.