Robert Utterback scite author profile

Fineman

et al. 2016

If a parallel program has determinacy race(s), different schedules can result in memory accesses that observe different values -various race-detection tools have been designed to find such bugs. A key component of race detectors is an algorithm for series-parallel (SP) maintenance, which identifies whether two accesses are logically parallel.This paper describes an asymptotically optimal algorithm, called WSP-Order, for performing SP maintenance in programs with fork-join (or nested) parallelism. Given a forkjoin program with T1 work and T∞ span, WSP-Order executes it while also maintaining SP relationships in O(T1/P + T∞) time on P processors, which is asymptotically optimal. At the heart of WSP-Order is a work-stealing scheduler designed specifically for SP maintenance.We also implemented C-RACER, a race-detector based on WSP-Order within the Cilk Plus runtime system, and evaluated its performance on five benchmarks. Empirical results demonstrate that when run sequentially, it performs almost as well as previous best sequential race detectors. More importantly, when run in parallel, it achieves almost as much speedup as the original program without race-detection.

Provably good scheduling for parallel programs that use data structures through implicit batching

et al. 2014

Race Detection and Reachability in Nearly Series-Parallel DAGs

Devietti

Fineman

et al. 2018

A program is said to have a determinacy race if logically parallel parts of a program access the same memory location and one of the accesses is a write. These races are generally bugs in the program since they lead to non-deterministic program behavior -different schedules of the program can lead to different results. Most prior work on detecting these races focuses on a subclass of programs with series-parallel or nested parallelism.This paper presents a race-detection algorithm for detecting races in a more general class of programs, namely programs that include arbitrary ordering constraints in additional to the series-parallel constructs. The algorithm performs a serial execution of the program, augmented to detect races, in O(T 1 + k 2 ) time, where T 1 is the sequential running time of the original program and k is the number of non series-parallel constraints.The main technical novelty of this paper is a new data structure, R-Sketch, for answering reachability queries in nearly series-parallel (SP) directed acyclic graphs (DAGs). Given as input a graph comprising an n-node series parallel graph and k additional non-SP edges, the total construction time of the data structure is O(n + k 2 ), and each reachability query can be answered in O(1) time. The data structure is traversally incremental, meaning that it supports the insertion of nodes/edges, but only as they are discovered through a graph traversal.

Processor-Oblivious Record and Replay

Utterback

Lee

et al. 2017

Record-and-replay systems are useful tools for debugging non-deterministic parallel programs by first recording an execution and then replaying that execution to produce the same access pattern. Existing record-and-replay systems generally target thread-based execution models, and record the behaviors and interleavings of individual threads. Dynamic multithreaded languages and libraries, such as the Cilk family, OpenMP, TBB, etc., do not have a notion of threads. Instead, these languages provide a processoroblivious model of programming, where programs expose task-parallelism using high-level constructs such as spawn/sync without regard to the number of threads/cores available to run the program. Thread-based record-and-replay would violate the processor-oblivious nature of these programs, as they incorporate the number of threads into the recorded information, constraining the replayed execution to the same number of threads. In this paper, we present a processor-oblivious recordand-replay scheme for such languages where record and replay can use different number of processors and both are scheduled using work stealing. We provide theoretical guarantees for our record and replay scheme-namely that record is optimal for programs with one lock and replay is near-optimal for all cases. In addition, we implemented this scheme in the Cilk Plus runtime system and our evaluation indicates that processor-obliviousness does not cause substantial overheads.

Provably good scheduling for parallel programs that use data structures through implicit batching

Fineman

et al. 2014

Although concurrent data structures are commonly used in practice on shared-memory machines, even the most efficient concurrent structures often lack performance theorems guaranteeing linear speedup for the enclosing parallel program. Moreover, efficient concurrent data structures are difficult to design. In contrast, parallel batched data structures do provide provable performance guarantees, since processing a batch in parallel is easier than dealing with the arbitrary asynchrony of concurrent accesses. They can limit programmability, however, since restructuring a parallel program to use batched data structure instead of a concurrent data structure can often be difficult or even infeasible. This paper presents BATCHER, a scheduler that achieves the best of both worlds through the idea of implicit batching, and a corresponding general performance theorem. BATCHER takes as input (1) a dynamically multithreaded program that makes arbitrary parallel accesses to an abstract data type, and (2) an implementation of the abstract data type as a batched data structure that need not cope with concurrent accesses. BATCHER extends a randomized work-stealing scheduler and guarantees provably good performance to parallel algorithms that use these data structures. In particular, suppose a parallel algorithm has T 1 work, T ∞ span, and n data-structure operations. Let W (n) be the total work of datastructure operations and let s(n) be the span of a size-P batch. Then BATCHER executes the program in O((T 1 + W (n) + ns(n))/P + s(n)T ∞ ) expected time on P processors. For higher-cost data structures like search trees and large enough n, this bound becomes ((T 1 +n lg n)/P+T ∞ lg n), provably matching the work of a sequential search tree but with nearly linear speedup, even though the data structure is accessed concurrently. The BATCHER runtime bound also readily extends to data structures with amortized bounds.