Deterministic parallel random-number generation for dynamic-multithreading platforms

Leiserson, Charles E.; Schardl, Tao B.; Sukha, Jim

doi:10.1145/2145816.2145841

Cited by 37 publications

(30 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The NUMA extension supports non-commuting reductions [6] and pedigrees [16]. Both constructs depend on the execution order of function calls, which the helper function disrupts.…”

Section: F a Numa-aware Cilk Extensionmentioning

confidence: 99%

GraphGrind

Sun

Vandierendonck

Nikolopoulos

2017

Proceedings of the International Conference on Supercomputing

View full text Add to dashboard Cite

We investigate how graph partitioning adversely affects the performance of graph analytics. We demonstrate that graph partitioning induces extra work during graph traversal and that graph partitions have markedly different connectivity than the original graph. By consequence, increasing the number of partitions reaches a tipping point after which overheads quickly dominate performance gains. Moreover, we show that the heuristic to balance CPU load between graph partitions by balancing the number of edges is inappropriate for a range of graph analyses. However, even when it is appropriate, it is sub-optimal due to the skewed degree distribution of social networks. Based on these observations, we propose GraphGrind, a new graph analytics system that addresses the limitations incurred by graph partitioning. We moreover propose a NUMA-aware extension to the Cilk programming language and obtain a scale-free yet NUMA-aware parallel programming environment which underpins NUMA-aware scheduling in GraphGrind. We demonstrate that GraphGrind outperforms state-of-the-art graph analytics systems for shared memory including Ligra, Polymer and Galois.

show abstract

“…The NUMA extension supports non-commuting reductions [6] and pedigrees [16]. Both constructs depend on the execution order of function calls, which the helper function disrupts.…”

Section: F a Numa-aware Cilk Extensionmentioning

confidence: 99%

GraphGrind

Sun

Vandierendonck

Nikolopoulos

2017

Proceedings of the International Conference on Supercomputing

View full text Add to dashboard Cite

show abstract

“…Figure 1 shows the performance loss of the reproducible implementation of transcendental functions with respect to the standard library installed on different systems, i.e., Glibc for the CPU and the CUDA toolkit for GPUs. The loss of performance is defined as the ratio of the time required by the deterministic implementation to the time required by the standard one on the corresponding platform to perform the task of evaluating the function on 2 22 input values. Figure 2 shows the geometric mean of the performance loss of all implemented functions for every architecture.…”

Section: Case Study: Standard Transcendental Functionsmentioning

confidence: 99%

“…Other studies on non-determinism caused by parallelism have been performed by Bergan et al [20], Bocchino et al [21], Leiserson et al [22], Olszewski et al [23].…”

Section: Related Workmentioning

confidence: 99%

Designing Bit-Reproducible Portable High-Performance Applications

Arteaga

Fuhrer

Hoefler

2014

2014 IEEE 28th International Parallel and Distributed Processing Symposium

View full text Add to dashboard Cite

Abstract-Bit-reproducibility has many advantages in the context of high-performance computing. Besides simplifying and making more accurate the process of debugging and testing the code, it can allow the deployment of applications on heterogeneous systems, maintaining the consistency of the computations. In this work we analyze the basic operations performed by scientific applications and identify the possible sources of non-reproducibility. In particular, we consider the tasks of evaluating transcendental functions and performing reductions using non-associative operators. We present a set of techniques to achieve reproducibility and we propose improvements over existing algorithms to perform reproducible computations in a portable way, at the same time obtaining good performance and accuracy. By applying these techniques to more complex tasks we show that bit-reproducibility can be achieved on a broad range of scientific applications.

show abstract

“…Not only must the output of the program be deterministic, but all intermediate values returned from operations must also be deterministic. We note that this does not preclude the use of pseudorandom numbers, where one can use, for example, the approach of Leiserson et al [33] to generate deterministic pseudorandom numbers in parallel from a single seed, which can be part of the input.…”

Section: Programming Modelmentioning

confidence: 99%

Internally deterministic parallel algorithms can be fast

et al. 2012

View full text Add to dashboard Cite

The virtues of deterministic parallelism have been argued for decades and many forms of deterministic parallelism have been described and analyzed. Here we are concerned with one of the strongest forms, requiring that for any input there is a unique dependence graph representing a trace of the computation annotated with every operation and value. This has been referred to as internal determinism, and implies a sequential semantics-i.e., considering any sequential traversal of the dependence graph is sufficient for analyzing the correctness of the code. In addition to returning deterministic results, internal determinism has many advantages including ease of reasoning about the code, ease of verifying correctness, ease of debugging, ease of defining invariants, ease of defining good coverage for testing, and ease of formally, informally and experimentally reasoning about performance. On the other hand one needs to consider the possible downsides of determinism, which might include making algorithms (i) more complicated, unnatural or special purpose and/or (ii) slower or less scalable.In this paper we study the effectiveness of this strong form of determinism through a broad set of benchmark problems. Our main contribution is to demonstrate that for this wide body of problems, there exist efficient internally deterministic algorithms, and moreover that these algorithms are natural to reason about and not complicated to code. We leverage an approach to determinism suggested by Steele (1990), which is to use nested parallelism with commutative operations. Our algorithms apply several diverse programming paradigms that fit within the model including (i) a strict functional style (no shared state among concurrent operations), (ii) an approach we refer to as deterministic reservations, and (iii) the use of commutative, linearizable operations on data structures. We describe algorithms for the benchmark problems that use these deterministic approaches and present performance results on a 32-core machine. Perhaps surprisingly, for all problems, our internally deterministic algorithms achieve good speedup and good performance even relative to prior nondeterministic solutions.

show abstract

Deterministic parallel random-number generation for dynamic-multithreading platforms

Cited by 37 publications

References 31 publications

GraphGrind

GraphGrind

Designing Bit-Reproducible Portable High-Performance Applications

Internally deterministic parallel algorithms can be fast

Contact Info

Product

Resources

About