Triage

Tucek, Joseph; Lu, Shan; Huang, Chengdu; Xanthos, Spiros; Zhou, Yuanyuan

doi:10.1145/1294261.1294275

Cited by 120 publications

(11 citation statements)

References 37 publications

(34 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Triage [Tucek et al 2007] uses dynamic slicing to diagnose failures at the user's site, which obviates privacy concerns. Despite that, it has limited support for concurrency bugs, being able to provide root cause isolation only for multithreaded programs running on uniprocessors.…”

Section: Related Workmentioning

confidence: 99%

Concurrency Debugging with Differential Schedule Projections

Machado

Quinta

Lucia

et al. 2016

ACM Trans. Softw. Eng. Methodol.

View full text Add to dashboard Cite

We present Symbiosis: a concurrency debugging technique based on novel differential schedule projections (DSPs). A DSP shows the small set of memory operations and dataflows responsible for a failure, as well as a reordering of those elements that avoids the failure. To build a DSP, Symbiosis first generates a full, failing, multithreaded schedule via thread path profiling and symbolic constraint solving. Symbiosis selectively reorders events in the failing schedule to produce a nonfailing, alternate schedule. A DSP reports the ordering and dataflow differences between the failing and nonfailing schedules. Our evaluation on buggy real-world software and benchmarks shows that, in practical time, Symbiosis generates DSPs that both isolate the small fraction of event orders and dataflows responsible for the failure and report which event reorderings prevent failing. In our experiments, DSPs contain 90% fewer events and 96% fewer dataflows than the full failure-inducing schedules. We also conducted a user study that shows that, by allowing developers to focus on only a few events, DSPs reduce the amount of time required to understand the bug's root cause and find a valid fix. CCS Concepts: r Software and its engineering → Software testing and debugging;

show abstract

Section: Related Workmentioning

confidence: 99%

Concurrency Debugging with Differential Schedule Projections

Machado

Quinta

Lucia

et al. 2016

ACM Trans. Softw. Eng. Methodol.

View full text Add to dashboard Cite

show abstract

“…An open-source flight simulator has been used to assess the proposal. In Tucek et al (2007) authors propose a system, called Triage, that automatically performs onsite software failure diagnosis. The system makes use of both kernel-level components and multiple re-executions of the target software to support failure diagnosis; during each re-execution, detailed data are collected via dynamic binary instrumentation to conduct the analysis of occurred failure and its causes.…”

Section: Code Instrumentation Approachesmentioning

confidence: 99%

“…In addition, the approach in Hiller et al (2004) requires measuring the error permeability for each input of each module, leading to a low scalability of the approach; while the tool (Hiller et al 2002a) addresses only single process software. The system proposed in Tucek et al (2007) uses kernel-level components and dynamic binary instrumentation, which is not allowed in critical production environments (e.g., mission critical systems) with stringent constraints imposed by certification standards and the use of obsolete kernel versions. Finally, the approaches (Hiller et al 2004;2002a;Johansson and Suri 2005) only address data errors, while those presented in Johansson and Suri (2005) and Calhoun et al (2017) are conceived only for OS device drivers and MPI applications, respectively.…”

Section: Code Instrumentation Approachesmentioning

confidence: 99%

“…To this aim, many existing approaches rely on quite convoluted data sources that entail a substantial degree of system internals' knowledge and source code visibility. For example, Jhumka and Leeke (2011), Abdelmoez et al (2004), Popic et al (2005), Cortellessa and Grassi (2007), and Voas (1997) require operation details, such as states and failure rates, for each system component, Hiller et al (2004), Hiller et al (2002a), Leeke and Jhumka (2010), and Michael and Jones (1997) leverage data obtained by instrumenting variables, while Tucek et al (2007) uses dynamic binary instrumentation.…”

mentioning

confidence: 99%

See 1 more Smart Citation

An empirical analysis of error propagation in critical software systems

2020

View full text Add to dashboard Cite

Error propagation analysis is a consolidated practice to gain insights into error modes and effects that pertain to the activation of faults in software systems. A variety of approaches, such as architecture-based, source code instrumentation and variable tracing, have been proposed so far to address software error propagation analysis. Although valuable, existing approaches entail a substantial degree of system internals' knowledge, visibility and code manipulation that is not well-suited for real-life production environments. This paper proposes an empirical analysis of error propagation. We specifically address the challenges in using fault data and error events in the logs, which are a convenient byproduct of the system's execution. The approach puts forth the construction of error reporting graphs. We apply the approach to 2,042 failure data points from two real-world critical systems from the Air Traffic Control domain by a top industry provider. The approach contributes to develop a deep understanding on error modes and propagation paths, which can be leveraged by practitioners to make informed decisions on the placement of error detection mechanisms. Keywords Error analysis • Error propagation • Critical systems • Monitoring IntroductionError propagation analysis is a consolidated practice to gain insights into the dependability of software systems. It allows to infer error modes, intermediate paths and effects

show abstract

“…Software contains latent bugs Tucek et al 2007]. Although software testing helps identify these bugs, the schedule pressure often causes vendors to release software without comprehensive testing.…”

Section: Introductionmentioning

confidence: 99%

WATCHER: in-situ failure diagnosis

Liu

Silvestro

Zhang

et al. 2020

Proc. ACM Program. Lang.

View full text Add to dashboard Cite

Diagnosing software failures is important but notoriously challenging. Existing work either requires extensive manual effort, imposing a serious privacy concern (for in-production systems), or cannot report sufficient information for bug fixes. This paper presents a novel diagnosis system, named Watcher, that can pinpoint root causes of program failures within the failing process ("in-situ"), eliminating the privacy concern. It combines identical record-and-replay, binary analysis, dynamic analysis, and hardware support together to perform the diagnosis without human involvement. It further proposes two optimizations to reduce the diagnosis time and diagnose failures with control flow hijacks. Watcher can be easily deployed, without requiring custom hardware or operating system, program modification, or recompilation. We evaluate Watcher with 24 program failures in real-world deployed software, including large-scale applications, such as Memcached, SQLite, and OpenJPEG. Experimental results show that Watcher can accurately identify the root causes in only a few seconds.CCS Concepts: • Software and its engineering → Software testing and debugging; Dynamic analysis.

show abstract

Triage

Cited by 120 publications

References 37 publications

Concurrency Debugging with Differential Schedule Projections

Concurrency Debugging with Differential Schedule Projections

An empirical analysis of error propagation in critical software systems

WATCHER: in-situ failure diagnosis

Contact Info

Product

Resources

About