2007
DOI: 10.1007/s00224-007-9098-2
|View full text |Cite
|
Sign up to set email alerts
|

The Cache Complexity of Multithreaded Cache Oblivious Algorithms

Abstract: We present a technique for analyzing the number of cache misses incurred by multithreaded cache oblivious algorithms on an idealized parallel machine in which each processor has a private cache. We specialize this technique to computations executed by the Cilk workstealing scheduler on a machine with dag-consistent shared memory. We show that a multithreaded cache oblivious matrix multiplication incurs O(n 3 / √ Z + (P n) 1/3 n 2 ) cache misses when executed by the Cilk scheduler on a machine with P processors… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
74
1

Year Published

2007
2007
2021
2021

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 50 publications
(76 citation statements)
references
References 26 publications
1
74
1
Order By: Relevance
“…Part (b) of the following lemma is obtained by considering the schedule that executes each subproblem of size n √ p × n √ p entirely on a single processor. This schedule gives a better result than the one given in part (a) for the work-stealing scheduler Cilk [12]; the bound in part (a) is obtained by applying a result in [13] on the caching performance of parallel algorithms whose sequential cache complexity is a concave function of work. …”
Section: Cache-complexitymentioning
confidence: 94%
“…Part (b) of the following lemma is obtained by considering the schedule that executes each subproblem of size n √ p × n √ p entirely on a single processor. This schedule gives a better result than the one given in part (a) for the work-stealing scheduler Cilk [12]; the bound in part (a) is obtained by applying a result in [13] on the caching performance of parallel algorithms whose sequential cache complexity is a concave function of work. …”
Section: Cache-complexitymentioning
confidence: 94%
“…Figure shows the pseudocode for Trap operating on a 3‐dimensional zoid. For didactic purposes, we have abstracted away many details, which are well described in previous studies . Trap works as follows.…”
Section: The Trap and Trapple Algorithmsmentioning
confidence: 99%
“…Since our focus is on autotuning serial codes, we modified Trap to disable parallelism. We also disabled the “hyperspace cuts,” which enhance the parallelism of Trap , and replaced them with sequential cuts, as in the original algorithms due to Frigo and Strumpen . The modified code performs equivalently to Pochoir's original Trap code when run serially.…”
Section: Introductionmentioning
confidence: 99%
“…Figure 6 compares the speedup of our fully optimized MDoptimization-3 with that of the baseline as a function of the number of cores/threads. Although the preprocessing leverages the EDC framework to achieve locality through memory hierarchy [12,13], the scalability of baseline begins to deteriorate when the number of cores exceeds 32. Additional optimizations, which take advantage of architectural features to maximize data locality and exploit data reuse, make optimization-3 scale almost linearly up to 64 cores with an on-chip strong-scaling parallel efficiency 0.99 on 64 cores.…”
Section: Performance Tests and Analysis Of MD Onmentioning
confidence: 99%