The Cache Complexity of Multithreaded Cache Oblivious Algorithms

Frigo, Matteo; Strumpen, Volker

doi:10.1007/s00224-007-9098-2

Cited by 50 publications

(76 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Part (b) of the following lemma is obtained by considering the schedule that executes each subproblem of size n √ p × n √ p entirely on a single processor. This schedule gives a better result than the one given in part (a) for the work-stealing scheduler Cilk [12]; the bound in part (a) is obtained by applying a result in [13] on the caching performance of parallel algorithms whose sequential cache complexity is a concave function of work. …”

Section: Cache-complexitymentioning

confidence: 94%

The cache-oblivious gaussian elimination paradigm

Chowdhury

Ramachandran

2007

Proceedings of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures

View full text Add to dashboard Cite

The Gaussian Elimination Paradigm (GEP) was introduced by the authors in [6] to represent the triply-nested loop computation that occurs in several important algorithms including Gaussian elimination without pivoting and Floyd-Warshall's all-pairs shortest paths algorithm. An efficient cache-oblivious algorithm for these instances of GEP was presented in [6]. In this paper we establish several important properties of this cache-oblivious framework, and extend the framework to solve GEP in its full generality within the same time and I/O bounds. We then analyze a parallel implementation of the framework and its caching performance for both shared and distributed caches. We present extensive experimental results for both in-core and out-of-core performance of our algorithms. We consider both sequential and parallel implementations of our algorithms, and compare them with finely-tuned cache-aware BLAS code for matrix multiplication and Gaussian elimination without pivoting. Our results indicate that cache-oblivious GEP offers an attractive tradeoff between efficiency and portability.

show abstract

Section: Cache-complexitymentioning

confidence: 94%

The cache-oblivious gaussian elimination paradigm

Chowdhury

Ramachandran

2007

Proceedings of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures

View full text Add to dashboard Cite

show abstract

“…Figure shows the pseudocode for Trap operating on a 3‐dimensional zoid. For didactic purposes, we have abstracted away many details, which are well described in previous studies . Trap works as follows.…”

Section: The Trap and Trapple Algorithmsmentioning

confidence: 99%

“…Since our focus is on autotuning serial codes, we modified Trap to disable parallelism. We also disabled the “hyperspace cuts,” which enhance the parallelism of Trap , and replaced them with sequential cuts, as in the original algorithms due to Frigo and Strumpen . The modified code performs equivalently to Pochoir's original Trap code when run serially.…”

Section: Introductionmentioning

confidence: 99%

Autotuning divide‐and‐conquer stencil computations

Natarajan

Dehnavi

Leiserson

2017

Concurrency and Computation

View full text Add to dashboard Cite

Summary This paper explores autotuning strategies for serial divide‐and‐conquer stencil computations, comparing the efficacy of traditional “heuristic” autotuning with that of “pruned‐exhaustive” autotuning. We present a pruned‐exhaustive autotuner called Ztune that searches for optimal divide‐and‐conquer trees for stencil computations. Ztune uses three pruning properties—space‐time equivalence, divide subsumption, and favored dimension—that greatly reduce the size of the search domain without significantly sacrificing the quality of the autotuned code. We compared the performance of Ztune with that of a state‐of‐the‐art heuristic autotuner called OpenTuner in tuning the divide‐and‐conquer algorithm used in Pochoir stencil compiler. Over a nightly run on ten application benchmarks across two machines with different hardware configurations, the Ztuned code ran 5% –12% faster on average, and the OpenTuner tuned code ran from 9% slower to 2% faster on average, than Pochoir's default code. In the best case, the Ztuned code ran 40% faster, and the OpenTuner tuned code ran 33% faster than Pochoir's code. Whereas the autotuning time of Ztune for each benchmark could be measured in minutes, to achieve comparable results, the autotuning time of OpenTuner was typically measured in hours or days. Surprisingly, for some benchmarks, Ztune actually autotuned faster than the time it takes to perform the stencil computation once.

show abstract

“…Figure 6 compares the speedup of our fully optimized MDoptimization-3 with that of the baseline as a function of the number of cores/threads. Although the preprocessing leverages the EDC framework to achieve locality through memory hierarchy [12,13], the scalability of baseline begins to deteriorate when the number of cores exceeds 32. Additional optimizations, which take advantage of architectural features to maximize data locality and exploit data reuse, make optimization-3 scale almost linearly up to 64 cores with an on-chip strong-scaling parallel efficiency 0.99 on 64 cores.…”

Section: Performance Tests and Analysis Of MD Onmentioning

confidence: 99%

Performance analysis and optimization of molecular dynamics simulation on Godson-T many-core processor

Liu

Nakano

Tan

et al. 2011

Proceedings of the 8th ACM International Conference on Computing Frontiers

View full text Add to dashboard Cite

Molecular dynamics (MD) simulation has broad applications, but its irregular memory-access pattern makes performance optimization a challenge. This paper presents a joint application/architecture study to enhance on-chip parallelism of MD on Godson-T -like many-core architecture. First, a preprocessing leveraging an adaptive divide-and-conquer framework is designed to exploit locality through memory hierarchy with software controlled memory. Then we propose three incremental optimization strategies: (1) a novel data-layout to re-organize linked-list cell data structures to improve data locality; (2) an on-chip locality-aware parallel algorithm to enhance data reuse; and (3) a pipelining algorithm to hide latency to shared memory. Experiments on Godson-T simulator exhibit strong-scaling parallel efficiency 0.99 on 64 cores, which is confirmed by an FPGA emulator. Detailed analysis shows that optimizations utilizing architectural features to maximize data locality and to enhance data reuse benefit scalability most. Furthermore, a simple performance model suggests that the optimization scheme is likely to scale well toward exascale. Certain architectural features are found essential for these optimizations, which could guide future hardware developments.

show abstract

The Cache Complexity of Multithreaded Cache Oblivious Algorithms

Cited by 50 publications

References 26 publications

The cache-oblivious gaussian elimination paradigm

The cache-oblivious gaussian elimination paradigm

Autotuning divide‐and‐conquer stencil computations

Performance analysis and optimization of molecular dynamics simulation on Godson-T many-core processor

Contact Info

Product

Resources

About