Production-run software failure diagnosis via hardware performance counters

ArulrajJoy,; ChangPo-Chun,; JinGuoliang,; Lu, Shan

doi:10.1145/2490301.2451128

Cited by 6 publications

(13 citation statements)

References 41 publications

(69 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The high detector shows a limited performance overhead. 30 The detection results are shown in Figure 5. After adding the high detector, the SDC rate decreased from 20.49% to 4.45%, and shows that the detection effect of the high detector was obviously positive.…”

Section: Proposed Detection Mechanismsmentioning

confidence: 99%

Quantitative evaluation of fault propagation in a commercial cloud system

Wang

2020

International Journal of Distributed Sensor Networks

View full text Add to dashboard Cite

As semiconductor technology scales into the nano regime, hardware faults have been threats against computational devices. Cloud systems are incorporating more and more computing density and energy into themselves; thus, fundamental research on topics such as dependability validation is needed, in order to verify the robustness of clouds for sensor networks. However, dependability evaluation studies have often been carried out beyond isolated physical systems, such as processors, sensors, and single boards with or without operating system hosts. These studies have been performed using inaccurate simulations instead of validating complete cloud software stacks (firmware, hypervisor, operating system hosts and workloads) as a whole. In this article, we describe the implementation of a fault injection tool, which validates the dependability of a commercial cloud software stack. Hardware faults induced by high energy density environments can be injected; the fault propagation through the cloud software stack is traced, and quantitatively evaluated. Experimental results show that the integrated fault detection mechanism of the cloud system, such as fatal trap detectors, has left a detection margin of 20% silent data corruption to narrow down. We additionally propose two detection mechanisms, which proved good performance in fault detection of cloud systems.

show abstract

Section: Proposed Detection Mechanismsmentioning

confidence: 99%

Quantitative evaluation of fault propagation in a commercial cloud system

Wang

2020

International Journal of Distributed Sensor Networks

View full text Add to dashboard Cite

show abstract

“…In other words, although sampling collects less data from each run at each end-user, to achieve statistical significance, more runs/endusers need to be involved and their data need to be transferred, leading to increased latency for failure diagnosis and delayed patch design. For example, under the common 1/100 or 1/1000 sampling rate, hundreds or thousands more failure runs need to be traced before sufficient predicates get sampled to produce statistically meaningful results [4,19,23]. Furthermore, a whole-program sampling infrastructure may lead to a large baseline overhead (e.g., more than 50%) that cannot be amortized through sampling [6].…”

Section: Problems and Motivationmentioning

confidence: 99%

“…The outcomes of these predicates are obtained through software instrumentation or hardware support [4], and constitute the profile of each run. Finally, a profile consists of a set of predicate counts, each recording the number of times a predicate is observed true during the run.…”

Section: Predicatesmentioning

confidence: 99%

“…As an iterative approach, our technique needs to run the multi-iteration refinement process. Like all sampling-based debugging or iterative debugging techniques [4,7,19,23], our technique reduces the user-side overhead of running a heavily instrumented program, but increases the latency of collecting debugging information. This trade-off is widely considered to be worthwhile by previous work, considering the stringent user-side overhead requirement.…”

Section: Multi-version Deploymentmentioning

confidence: 99%

“…The cooperative statistical debugging approach was first proposed in the Cooperative Bug Isolation (CBI) work [24,25], and has been well researched since then [4,5,7,10,19,35]. Its key idea is to collect execution information from both failing and passing runs of production-run software at many end users' sites, and apply statistical techniques to analyze the collected traces and identify likely failure root causes.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Low-overhead and fully automated statistical debugging with abstraction refinement

Zuo

Khoo

et al. 2016

Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applicatio

Self Cite

View full text Add to dashboard Cite

Cooperative statistical debugging is an effective approach for diagnosing production-run failures. To quickly identify failure predictors from the huge program predicate space, existing techniques rely on random or heuristics-guided predicate sampling at the user side. However, none of them can satisfy the requirements of low cost, low diagnosis latency, and high diagnosis quality simultaneously, which are all indispensable for statistical debugging to be practical. This paper presents a new technique that tackles the above challenges. We formulate the technique as an instance of abstraction refinement, where efficient abstract-level profiling is first applied to the whole program and its execution brings information that can pinpoint suspicious coarsegrained entities that need to be refined. The refinement profiles a corresponding set of fine-grained entities, and generates feedback that determines what to prune and what to refine next. The process is fully automated, and more importantly, guided by a mathematically rigorous analysis that guarantees that our approach produces the same debugging results as an exhaustive analysis in deterministic settings. We have implemented this technique for both C and Java on both single machine and distributed system. A thorough evaluation demonstrates that our approach yields (1) an order of magnitude reduction in the user-side runtime overhead even compared to a sampling-based approach and (2) two orders of magnitude reduction in the size of data transferred over the network, completely automatically without sacrificing any debugging capability.

show abstract