Abstract:We present TaxDC, the largest and most comprehensive taxonomy of non-deterministic concurrency bugs in distributed systems. We study 104 distributed concurrency (DC) bugs from four widely-deployed cloud-scale datacenter distributed systems, Cassandra, Hadoop MapReduce, HBase and ZooKeeper. We study DC-bug characteristics along several axes of analysis such as the triggering timing condition and input preconditions, error and failure symptoms, and fix strategies, collectively stored as 2,083 classification labe… Show more
“…Concurrency error detect. Another group of works is on detecting concurrency errors including Razzer [9] and TaxDC [11] [1]. Existing methods mostly are based on monitoring memory access and find concurrency errors, or through static analyze of the code fragments.…”
With the increasing popularity of multi-core processors and multithread languages/frameworks, race conditions-which are nondeterministic by nature-are becoming a main root cause for concurrency bugs. It opens doors to malicious attacks such as remote code execution and denial of service attacks, potentially putting millions of users in danger. Yet, such non-deterministic racing conditions are often difficult to identify or reproduce in standard program testing. In this paper, we focus on the Garbage-Collection (GC) feature, which is known to be a frequent victim of concurrency bugs in many software systems. We develop a new approach to facilitate the testing of GC-related bugs through critical condition restoration. In particular, we propose a risk-score mechanism to quantify the risk of GC-related bugs in target functions and leverage the score to select appropriate testing parameters and garbage generation strategy, with a higher chance of producing the critical condition. Our experimental results show that the proposed approach could significantly improve the probability of finding GC-related bugs (from 0 in condition-oblivious testing to 14.8 bugs identified in our experiment) while incurring only around 26% execution overhead. CCS CONCEPTS • Software and its engineering → Software testing and debugging; Garbage collection; Software reliability.
“…Concurrency error detect. Another group of works is on detecting concurrency errors including Razzer [9] and TaxDC [11] [1]. Existing methods mostly are based on monitoring memory access and find concurrency errors, or through static analyze of the code fragments.…”
With the increasing popularity of multi-core processors and multithread languages/frameworks, race conditions-which are nondeterministic by nature-are becoming a main root cause for concurrency bugs. It opens doors to malicious attacks such as remote code execution and denial of service attacks, potentially putting millions of users in danger. Yet, such non-deterministic racing conditions are often difficult to identify or reproduce in standard program testing. In this paper, we focus on the Garbage-Collection (GC) feature, which is known to be a frequent victim of concurrency bugs in many software systems. We develop a new approach to facilitate the testing of GC-related bugs through critical condition restoration. In particular, we propose a risk-score mechanism to quantify the risk of GC-related bugs in target functions and leverage the score to select appropriate testing parameters and garbage generation strategy, with a higher chance of producing the critical condition. Our experimental results show that the proposed approach could significantly improve the probability of finding GC-related bugs (from 0 in condition-oblivious testing to 14.8 bugs identified in our experiment) while incurring only around 26% execution overhead. CCS CONCEPTS • Software and its engineering → Software testing and debugging; Garbage collection; Software reliability.
“…Semantic‐aware model checking, SAMC 49, verifies distributed systems by supplying a system‐specific test and monitoring harness on top of a framework to verify networked systems. In that framework, all possible schedules (interleavings) of networked messages are tested, based on a given system test that executes a particular sequence of API calls for each client.…”
Section: Related Workmentioning
confidence: 99%
“…In that framework, all possible schedules (interleavings) of networked messages are tested, based on a given system test that executes a particular sequence of API calls for each client. Existing work has applied this approach to a model of ZooKeeper 49 that is written in Java but about an order of magnitude smaller than the real implementation. In contrast to this, our work tests many different sequences of API calls, but executes one thread schedule each time, on the actual implementation of ZooKeeper rather than a model of it.…”
In this paper, we extend work on model-based testing for Apache ZooKeeper, to handle watchers (triggers) and improve scalability. In a distributed asynchronous shared storage like ZooKeeper, watchers deliver notifications on state changes. They are difficult to test because watcher notifications involve an initial action that sets the watcher, followed by another action that changes the previously seen state. We show how to generate test cases for concurrent client sessions executing against ZooKeeper with the tool Modbat. The tests are verified against an oracle that takes into account all possible timings of network communication. The oracle has to verify that there exists a chain of events that triggers both the initial callback and the subsequent watcher notification. We show in detail how the oracle computes whether watch triggers are correct and how the model was adapted and improved to handle these features. Together with a new search improvement that increases both speed and accuracy, we are able to verify large test setups and confirm several defects with our model.
Over the years, organizations acquired disparate software systems, each answering one specific need. Currently, the desirable outcomes of integrating these systems (higher degrees of automation and better system consistency) are often outbalanced by the complexity of mitigating their discrepancies. These problems are magnified in the decentralized setting (e.g., cross-organizational cases) where the integration is usually dealt with ad-hoc "glue" connectors, each integrating two or more systems. Since the overall logic of the integration is spread among many glue connectors, these solutions are difficult to program correctly (making them prone to misbehaviors and system blocks), maintain, and evolve. In response to these problems, we propose ChIP, an integration process advocating choreographic programs as intermediate artifacts to refine high-level global specifications (e.g., UML Sequence Diagrams), defined by the domain experts of each partner, into concrete, distributed implementations. In ChIP, once the stakeholders agree upon a choreographic integration design, they can automatically generate the respective local connectors, which are guaranteed to faithfully implement the described distributed logic. In the paper, we illustrate ChIP with a pilot from the EU EIT Digital project SMAll, aimed at integrating pre-existing systems from government, university, and transport industry.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.