Cache-oblivious wavefront: improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency

Tang, Yuan; You, Ronghui; Kan, Haibin; Tithi, Jesmin Jahan; Ganapathi, Pramod; Chowdhury, Rezaul

doi:10.1145/2688500.2688514

Cited by 29 publications

(28 citation statements)

References 53 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For the problems that we consider in this paper, the parallel DP algorithms were already discussed by a rich literature in the eighties and nighties (e.g., [49,51,42,58,57,72]). Later work not only considers parallelism, but also optimizes symmetric cache complexity (e.g., [46,34,36,31,20,60,77,74,75,41,73,32]). The algorithms in linear algebra that share the similar computation structures (but with different orders in the computation) are also discussed (e.g., [36,41,83,78,25,40,11,65]).…”

Section: Preliminaries and Related Workmentioning

confidence: 99%

“…For other problems (GAP, RNA, protein accordion folding, knapsack), the bounds in the symmetric setting are also improved. Some previous work [75,41] achieves the linear span in several problems. We note that they assume a much stronger model to guarantee the sequential and parallel execution order, so their algorithms need specially designed schedulers [41,30].…”

Section: Preliminaries and Related Workmentioning

confidence: 99%

“…Polylogarithmic span can be achieved in computing the 2-knapsack recurrence, and linear span in LWS recurrence and protein accordion folding. The linear span for LWS can be achieved by previous work [75,41], but they are not race-free and in the nested-parallel model. Meanwhile, our algorithms are arguably simpler.…”

Section: Dynamic Programming Recurrencesmentioning

confidence: 99%

“…There exist work-optimal and sublinear depth algorithm for APSP [76], but we are unaware of how to make it I/O-efficient. Compared to previous linear-span algorithms [75,41], our new algorithms are race-free and in the nested-parallel model.…”

Section: Matrix Multiplication and All-pair Shortest Pathsmentioning

confidence: 99%

See 3 more Smart Citations

Improved Parallel Cache-Oblivious Algorithms for Dynamic Programming [Extend Abstract]

Blelloch¹,

Gu²

2020

Symposium on Algorithmic Principles of Computer Systems

View full text Add to dashboard Cite

Emerging non-volatile main memory (NVRAM) technologies provide byte-addressability, low idle power, and improved memory-density, and are likely to be a key component in the future memory hierarchy. However, a critical challenge in achieving high performance is in accounting for the asymmetry that NVRAM writes can be significantly more expensive than NVRAM reads.In this paper, we consider a large class of cache-oblivious algorithms for dynamic programming (DP) and linear algebra, and try to reduce the writes in the asymmetric setting while maintaining high parallelism. To achieve that, our key approach is to show the correspondence between these problems and an abstraction for their computation, which is referred to as the k-d grids. Then by showing lower bound and new algorithms for computing k-d grids, we show a list of improved cache-oblivious algorithms of many DP recurrences and in linear algebra in the asymmetric setting, both sequentially and in parallel.Surprisingly, even without considering the read-write asymmetry (i.e., setting the write cost to be the same as the read cost in the algorithms), the new algorithms improve the existing cache complexity of many problems. We believe the reason is that the extra level of abstraction of k-d grids helps us to better understand the complexity and difficulties of these problems. We believe that the novelty of our framework is of interests and leads to many new questions for future work.

show abstract

Section: Preliminaries and Related Workmentioning

confidence: 99%

Section: Preliminaries and Related Workmentioning

confidence: 99%

Section: Dynamic Programming Recurrencesmentioning

confidence: 99%

Section: Matrix Multiplication and All-pair Shortest Pathsmentioning

confidence: 99%

See 2 more Smart Citations

Improved Parallel Cache-Oblivious Algorithms for Dynamic Programming [Extend Abstract]

Blelloch¹,

Gu²

2020

Symposium on Algorithmic Principles of Computer Systems

View full text Add to dashboard Cite

show abstract

“…This overhead may be reduced to some extent by falsely reporting a greater size for Cholesky tasks at lower levels of recursion to force the SB scheduler to be more aggressive at load balancing. Further, Cholesky factorization could also achieve better performance through a relaxation of false dependencies introduced by expressing the algorithm in the fork-join paradigm using techniques recently introduced by Tang et al [2015]. This would reduce the depth of the algorithm to O(n/L) and remove all serial points in the DAG except the start and the end.…”

Section: Algorithmsmentioning

confidence: 99%

Experimental Analysis of Space-Bounded Schedulers

Simhadri

Blelloch

Fineman

et al. 2016

ACM Trans. Parallel Comput.

View full text Add to dashboard Cite

The running time of nested parallel programs on shared-memory machines depends in significant part on how well the scheduler mapping the program to the machine is optimized for the organization of caches and processor cores on the machine. Recent work proposed "space-bounded schedulers" for scheduling such programs on the multilevel cache hierarchies of current machines. The main benefit of this class of schedulers is that they provably preserve locality of the program at every level in the hierarchy, which can result in fewer cache misses and better use of bandwidth than the popular work-stealing scheduler. On the other hand, compared to work stealing, space-bounded schedulers are inferior at load balancing and may have greater scheduling overheads, raising the question as to the relative effectiveness of the two schedulers in practice.In this article, we provide the first experimental study aimed at addressing this question. To facilitate this study, we built a flexible experimental framework with separate interfaces for programs and schedulers. This enables a head-to-head comparison of the relative strengths of schedulers in terms of running times and cache miss counts across a range of benchmarks. (The framework is validated by comparisons with the Intel R Cilk TM Plus work-stealing scheduler.) We present experimental results on a 32-core Xeon R 7560 comparing work stealing, hierarchy-minded work stealing, and two variants of space-bounded schedulers on both divide-and-conquer microbenchmarks and some popular algorithmic kernels. Our results indicate that space-bounded schedulers reduce the number of L3 cache misses compared to work-stealing schedulers by 25% to 65% for most of the benchmarks, but incur up to 27% additional scheduler and load-imbalance overhead. Only for memory-intensive benchmarks can the reduction in cache misses overcome the added overhead, resulting in up to a 25% improvement in running time for synthetic benchmarks and about 20% improvement for algorithmic kernels. We also quantify runtime improvements varying the available bandwidth per core (the "bandwidth gap") and show up to 50% improvements in the running times of kernels as this gap increases fourfold. As part of our study, we generalize prior definitions of space-bounded schedulers to allow for more practical variants (while still preserving their guarantees) and explore implementation tradeoffs.

show abstract

An Algorithm for the Sequence Alignment with Gap Penalty Problem using Multiway Divide-and-Conquer and Matrix Transposition

Shubham

Prakash

Ganapathi

2022

Information Processing Letters

View full text Add to dashboard Cite

Cache-oblivious wavefront: improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency

Cited by 29 publications

References 53 publications

Improved Parallel Cache-Oblivious Algorithms for Dynamic Programming [Extend Abstract]

Improved Parallel Cache-Oblivious Algorithms for Dynamic Programming [Extend Abstract]

Experimental Analysis of Space-Bounded Schedulers

An Algorithm for the Sequence Alignment with Gap Penalty Problem using Multiway Divide-and-Conquer and Matrix Transposition

Contact Info

Product

Resources

About