Pervasive detection of process races in deployed systems

Laadan, Oren; Viennot, Nicolas; Tsai, Chia-Che; Blinn, Chris; Yang, Junfeng; Nieh, Jason

doi:10.1145/2043556.2043589

Cited by 23 publications

(19 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…First, CRANE can be leveraged by other replication concepts (e.g., byzantine fault tolerance [22,38]) and record-replay [39,41,46] because they also suffer from nondeterminism. Second, promising results in REPFRAME [30] have shown that CRANE's transparent replication architecture can enable multiple types of program analysis tools within one execution, making a server program enjoy benefits of multiple analyses.…”

Section: Applicationsmentioning

confidence: 99%

P axos made transparent

Cui

Liu

et al. 2015

Proceedings of the 25th Symposium on Operating Systems Principles

Self Cite

View full text Add to dashboard Cite

State machine replication (SMR) leverages distributed consensus protocols such as PAXOS to keep multiple replicas of a program consistent in face of replica failures or network partitions. This fault tolerance is enticing on implementing a principled SMR system that replicates general programs, especially server programs that demand high availability. Unfortunately, SMR assumes deterministic execution, but most server programs are multithreaded and thus nondeterministic. Moreover, existing SMR systems provide narrow state machine interfaces to suit specific programs, and it can be quite strenuous and error-prone to orchestrate a general program into these interfaces This paper presents CRANE, an SMR system that transparently replicates general server programs. CRANE achieves distributed consensus on the socket API, a common interface to almost all server programs. It leverages deterministic multithreading (specifically, our prior system PARROT) to make multithreaded replicas deterministic. It uses a new technique we call time bubbling to efficiently tackle a difficult challenge of nondeterministic network input timing. Evaluation on five widely used server programs (e.g., Apache, ClamAV, and MySQL) shows that CRANE is easy to use, has moderate overhead, and is robust. CRANE's source code is at github.com/columbia/crane.

show abstract

Section: Applicationsmentioning

confidence: 99%

P axos made transparent

Cui

Liu

et al. 2015

Proceedings of the 25th Symposium on Operating Systems Principles

Self Cite

View full text Add to dashboard Cite

show abstract

“…However, this approach can be both ineffective and inefficient as it cannot determine the location at which to issue an interrupt. Laadan et al [10] note the importance of process races and design a technique to detect and replay them. However, their technique can produce false negatives, and cause replay divergence failures in which the actual system environment does not match the replayed execution.…”

Section: B Related Workmentioning

confidence: 99%

An observable and controllable testing framework for modern systems

Yu¹

2013

2013 35th International Conference on Software Engineering (ICSE)

View full text Add to dashboard Cite

Abstract-Modern computer systems are prone to various classes of runtime faults due to their reliance on features such as concurrency and peripheral devices such as sensors. Testing remains a common method for uncovering faults in these systems. However, commonly used testing techniques that execute the program with test inputs and inspect program outputs to detect failures are often ineffective. To test for concurrency and temporal faults, test engineers need to be able to observe faults as they occur instead of relying on observable incorrect outputs. Furthermore, they need to be able to control thread or process interleavings so that they are deterministic. This research will provide a framework that allows engineers to effectively test for subtle and intermittent faults in modern systems by providing them with greater observability and controllability.

show abstract

“…Our previous work on Scribe [16] replays a recorded application execution until a specified point, and then transitions to live execution instead of replaying the rest of the log. Our previous work on Racepro [17] detects process races due to dependencies in the ordering of system calls by recording an application execution to a log, identifying a pair of system calls that may be racy, truncating the log at the occurrence of the pair of system calls, inverting their order, and then replaying the truncated log with the reordered system calls to detect process races. However, Racepro only supports changes that reorder system calls and does not support changes in the middle of replay.…”

Section: Related Workmentioning

confidence: 99%

Transparent mutable replay for multicore debugging and patch validation

2013

Self Cite

View full text Add to dashboard Cite

We present DORA, a mutable record-replay system which allows a recorded execution of an application to be replayed with a modified version of the application. This feature, not available in previous record-replay systems, enables powerful new functionality. In particular, DORA can help reproduce, diagnose, and fix software bugs by replaying a version of a recorded application that is recompiled with debugging information, reconfigured to produce verbose log output, modified to include additional print statements, or patched to fix a bug.DORA uses lightweight operating system mechanisms to record an application execution by capturing nondeterministic events to a log without imposing unnecessary timing and ordering constraints. It replays the log using a modified version of the application even in the presence of added, deleted, or modified operations that do not match events in the log. DORA searches for a replay that minimizes differences between the log and the replayed execution of the modified program. If there are no modifications, DORA provides deterministic replay of the unmodified program.We have implemented a Linux prototype which provides transparent mutable replay without recompiling or relinking applications. We show that DORA is useful for reproducing, diagnosing, and fixing software bugs in real-world applications, including Apache and MySQL. Our results show that DORA (1) captures bugs and replays them with applications modified or reconfigured to produce additional debugging output for root cause diagnosis, (2) captures exploits and replays them with patched applications to validate that the patches successfully eliminate vulnerabilities, (3) records production workloads and replays them with patched applications to validate patches with realistic workloads, and (4) maintains low recording overhead on commodity multicore hardware, making it suitable for production systems.

show abstract

Pervasive detection of process races in deployed systems

Cited by 23 publications

References 32 publications

P axos made transparent

P axos made transparent

An observable and controllable testing framework for modern systems

Transparent mutable replay for multicore debugging and patch validation

Contact Info

Product

Resources

About