On the Measurement of Safe Fault Failure Rates in High-Performance Compute Processors

Bramley, Richard; Huang, Yuanding; Duan, Guangshan; Saxena, N.R.; Racunas, Paul

doi:10.1109/itc44778.2020.9325239

Cited by 2 publications

(2 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To this purpose, we exploit fine grained kernels profiling and machine learning (ML) methodologies. Previous work exists on exploiting ML approaches for deriving predictive models for latency deterioration caused by memory conflicts [21]- [25]. Saeed et al in [21] proposed a mechanism that is able to predict the execution time of two co-running applications in a multicore processor; their predictive model is based on hardware performance events that have been previously selected using the Spearman correlation coefficient.…”

Section: Related Workmentioning

confidence: 99%

“…Since the profiling tools provided in Nvidia embedded boards allow the user to collect metrics, counters and other execution facts in a much finer granularity, we argue that those are paramount instruments to exploit. Such tools, although in a different context, have been extensively used in Bramley et al [25]. The authors conducted a detailed analysis in the field of road vehicles and functional safety to determine whether bit faults in SDRAM cause computational errors in GPU kernels.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Machine Learning Techniques for Understanding and Predicting Memory Interference in CPU-GPU Embedded Systems

Masola,

Capodieci,

Rouxel

et al. 2023

2023 IEEE 29th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA)

View full text Add to dashboard Cite

Nowadays, heterogeneous embedded platforms are extensively used in various low-latency applications, including the automotive industry, real-time IoT systems, and automated factories. These platforms utilize specific components, such as CPUs, GPUs, and neural network accelerators for efficient task processing and to solve specific problems with a lower power consumption compared to more traditional systems. However, since these accelerators share resources such as the global memory, it is crucial to understand how workloads behave under high computational loads to determine how parallel computational engines on modern platforms can interfere and adversely affect the system's predictability and performance. One area that remains unclear is the interference effect on shared memory resources between the CPU and GPU: more specifically, the latency degradation experienced by GPU kernels when memory-intensive CPU applications run concurrently. In this work, we first analyze the metrics that characterize the behavior of different kernels under various board conditions caused by CPU memory-intensive workloads on a Nvidia Jetson Xavier. Then, we exploit various machine learning methodologies aiming to estimate the latency degradation of kernels based on their metrics. As a result of this, we are able to identify the metrics that could potentially have the most significant impact when predicting the kernels completion latency degradation.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%