Ming-Dong Feng scite author profile

¹

,

Leiserson

²

1997

A parallel multithreaded program that is ostensibly deterministic may nevertheless behave nondeterministically due to bugs in the code. These bugs are called determinacy races, and they result when one thread updates a location in shared memory while another thread is concurrently accessing the location. We have implemented a provably efficient determinacy-race detector for Cilk, an algorithmic multithreaded programming language. If a Cilk program run on a given input data set has a determinacy race, our debugging tool, which we call the "Nondeterminator," guarantees to detect and localize the race.The core of the Nondeterminator is an asymptotically efficient serial algorithm (inspired by Tarjan's nearly linear-time leastcommon-ancestors algorithm) for detecting determinacy races in series-parallel directed acyclic graphs. For a Cilk program that runs in T time on one processor and uses v shared-memory locations, the Nondeterminator runs in O (T α(v; v)) time, where α is Tarjan's functional inverse of Ackermann's function, a very slowly growing function which, for all practical purposes, is bounded above by 4. The Nondeterminator uses at most a constant factor more space than does the original program. On a variety of Cilk program benchmarks, the Nondeterminator exhibits a slowdown of less than 12 compared with the serial execution time of the original optimized code, which we contend is an acceptable slowdown for debugging purposes.

Detecting data races in Cilk programs that use locks

Cheng¹,

²

,

Leiserson³

et al. 1998

When two parallel threads holding no locks in common access the same memory location and at least one of the threads modifies the location, a "data race" occurs, which is usually a bug. This paper describes the algorithms and strategies used by a debugging tool, called the Nondeterminator-2, which checks for data races in programs coded in the Cilk multithreaded language. Like its predecessor, the Nondeterminator, which checks for simple "determinacy" races, the Nondeterminator-2 is a debugging tool, not a verifier, since it checks for data races only in the computation generated by a serial execution of the program on a given input.We give an algorithm, ALL-SETS, that determines whether the computation generated by a serial execution of a Cilk program on a given input contains a race. For a program that runs serially in time T, accesses V shared memory locations, uses a total of n locks, and holds at most k << n locks simultaneously, ALL-SETS runs in O(r#Tcx(V,V)) time and O(dV) space, where a is Tarjan's functional inverse of Ackermann's function.Since ALL-SETS may be too inefficient in the worst case, we propose a much more efficient algorithm which can be used to detect races in programs that obey the "umbrella" locking discipline, a programming methodology that is more flexible than similar disciplines proposed in the literature. We present an algorithm, BRELLY, which detects violations of the umbrella discipline in O(kT cx(V, V)) time using O(kV) space.We also prove that any "abelian" Cilk program, one whose critical sections commute, produces a determinate final state if it is deadlock free and if it generates any computation which is datarace free. Thus, the Nondeterminator-2's two algorithms can verify the determinacy of a deadlock-free abelian program running on a given input.

Efficient Detection of Determinacy Races in Cilk Programs

Theory of Computing Systems

¹

,

Leiserson²

1999

A parallel multithreaded program that is ostensibly deterministic may nevertheless behave nondeterministically due to bugs in the code. These bugs are called determinacy races, and they result when one thread updates a location in shared memory while another thread is concurrently accessing the location. We have implemented a provably efficient determinacy-race detector for Cilk, an algorithmic multithreaded programming language. If a Cilk program run on a given input data set has a determinacy race, our debugging tool, which we call the "Nondeterminator," guarantees to detect and localize the race.The core of the Nondeterminator is an asymptotically efficient serial algorithm (inspired by Tarjan's nearly linear-time leastcommon-ancestors algorithm) for detecting determinacy races in series-parallel directed acyclic graphs. For a Cilk program that runs in T time on one processor and uses v shared-memory locations, the Nondeterminator runs in O (T α(v; v)) time, where α is Tarjan's functional inverse of Ackermann's function, a very slowly growing function which, for all practical purposes, is bounded above by 4. The Nondeterminator uses at most a constant factor more space than does the original program. On a variety of Cilk program benchmarks, the Nondeterminator exhibits a slowdown of less than 12 compared with the serial execution time of the original optimized code, which we contend is an acceptable slowdown for debugging purposes.

SilkRoad: a multithreaded runtime system with software distributed shared memory for SMP clusters

Wong

¹

,

²

,

³

Multithreaded parallel system with software Distributed Shared Memory (DSM) management, load balancing, etc) are kept consistent by means of the backing store, just as it is in the original distributed Cilk, while the user's cluster wide shared data are kept consistent by LRC. With LRC, SilkRoad programmers are allowed to define and use shared variables between the threads running on different nodes in a cluster, and this greatly enlarged the scope of supported programming paradigms in Cilk. The result is a system that supports work-stealing and a true shared memory programming paradigm. To show the benefits of integration, we compared SilkRoad with the original distributed Cilk. We also compared SilkRoad with TreadMarks, a LRC software DSM implementation for clusters with no support of multithreading. The results show that with the hybrid memory model of dag-consistency and LRC, multithreaded SilkRoad programs written in a divide-and-conquer fashion with good data locality can achieve comparable performance with the corresponding multipleprocess TreadMarks programs.

Distributed Linda tuplespace algorithms and implementations

¹

,

Gao

²

,

Yuen

³

1994

Parallel multiplication

Yuen

¹

,

²

1994

OctFormer: Efficient Octree-Based Transformer for Point Cloud Compression with Local Enhancement

Cui

¹

,

Long

²

,

³

et al. 2023

AAAI

Point cloud compression with a higher compression ratio and tiny loss is essential for efficient data transportation. However, previous methods that depend on 3D convolution or frequent multi-head self-attention operations bring huge computations. To address this problem, we propose an octree-based Transformer compression method called OctFormer, which does not rely on the occupancy information of sibling nodes. Our method uses non-overlapped context windows to construct octree node sequences and share the result of a multi-head self-attention operation among a sequence of nodes. Besides, we introduce a locally-enhance module for exploiting the sibling features and a positional encoding generator for enhancing the translation invariance of the octree node sequence. Compared to the previous state-of-the-art works, our method obtains up to 17% Bpp savings compared to the voxel-context-based baseline and saves an overall 99% coding time compared to the attention-based baseline.

Dynamic load balancing on a distributed system

¹

,

²

We consider the problem of load balancing on loosely coupled multiprocessor systems. Ruring run time, a task may create subtasks, which are dynamically distributed by the load balancer. Different load-balancing strategies (receiver-initiated, sender-initiated and mixture of both) are studied and evaluated on Transputers. We test three commonly used benchmark problems (fibonacci function, N-queen and 15-puzzle) to observe the effect of load balancing. Our experiments involve up to 18 Transputers, and we observe speed improvements from 12 to 16 times over a sequential program. The mixed strategy was the best in most cases. We also find that the longer a problem takes to solve using sequential implementation, the more likely it is to benefit from parallel execution. The load balancing algorithms presented here are applicable to any distributed systems where processor interconnection is modifiable.