The Cache-Oblivious Gaussian Elimination Paradigm: Theoretical Framework, Parallelization and Experimental Evaluation

Chowdhury, Rezaul; Ramachandran, Vijaya

doi:10.1007/s00224-010-9273-8

Cited by 45 publications

(63 citation statements)

References 25 publications

(45 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In most of these computations, each process also has full knowledge about its future request sequence of its current task. For instance, the computation of Gaussian elimination paradigm as discussed by Chowdhury and Ramachandran [9] has this type of behavior. Even the computation of matrix multiplication and fast Fourier transform have this type of behavior.…”

Section: Disjoint and Shared Memory Frameworkmentioning

confidence: 97%

“…In several of such applications, processes also have perfect knowledge about the sequence of requests they plan to request in the future since they work on a well-defined computation like matrix multiplication or Gaussian elimination paradigm [9,10]. Observe that in these computations, the interleaving of requests from different processes reaching the shared cache still remains adversarial since the interleaving depends on factors like the difference in the clock period, interrupts from the operating systems, etc.…”

Section: Shared Memory Framework Descriptionmentioning

confidence: 99%

See 1 more Smart Citation

Competitive Cache Replacement Strategies for Shared Cache Environments

Katti

Ramachandran

2012

2012 IEEE 26th International Parallel and Distributed Processing Symposium

Self Cite

View full text Add to dashboard Cite

Section: Disjoint and Shared Memory Frameworkmentioning

confidence: 97%

Section: Shared Memory Framework Descriptionmentioning

confidence: 99%

Competitive Cache Replacement Strategies for Shared Cache Environments

Katti

Ramachandran

2012

2012 IEEE 26th International Parallel and Distributed Processing Symposium

Self Cite

View full text Add to dashboard Cite

“…The Gaussian elimination paradigm of Chowdhury and Ramachandran [13] provides a cache-oblivious framework for these problems, similar to Toledo's recursive blocked LU factorization [41]. Our APSP work is orthogonal to that of Chowdhury and Ramachandran in the sense we provide distributed memory algorithms that minimize internode communication (both latency and bandwidth), while their method focuses on cacheobliviousness and multithreaded (shared memory) implementation.…”

Section: Previous Workmentioning

confidence: 99%

Minimizing Communication in All-Pairs Shortest Paths

Solomonik¹,

Buluç²,

Demmel³

2013

View full text Add to dashboard Cite

Abstract-We consider distributed memory algorithms for the all-pairs shortest paths (APSP) problem. Scaling the APSP problem to high concurrencies requires both minimizing inter-processor communication as well as maximizing temporal data locality. The 2.5D APSP algorithm, which is based on the divide-andconquer paradigm, satisfies both of these requirements: it can utilize any extra available memory to perform asymptotically less communication, and it is rich in semiring matrix multiplications, which have high temporal locality. We start by introducing a block-cyclic 2D (minimal memory) APSP algorithm. With a careful choice of block-size, this algorithm achieves known communication lower-bounds for latency and bandwidth. We extend this 2D block-cyclic algorithm to a 2.5D algorithm, which can use c extra copies of data to reduce the bandwidth cost by a factor of c 1/2 , compared to its 2D counterpart. However, the 2.5D algorithm increases the latency cost by c 1/2 . We provide a tighter lower bound on latency, which dictates that the latency overhead is necessary to reduce bandwidth along the critical path of execution. Our implementation achieves impressive performance and scaling to 24,576 cores of a Cray XE6 supercomputer by utilizing well-tuned intra-node kernels within the distributed memory algorithm.

show abstract

“…However, their analysis is limited to the hierarchical divide-and-conquer problems and a moderate level of parallelism. Chowdhury and Ramachandran [9] consider cache-complexity in both private-and shared-cache models for matrix-based computations, including all-pairs shortest paths algorithm of FloydWarshall. They also consider parallel dynamic programming algorithms in private-, shared-and multicore-cache models [10].…”

Section: A Prior Related Workmentioning

confidence: 99%

Parallel external memory graph algorithms

Arge

Goodrich

Sitchinava

2010

2010 IEEE International Symposium on Parallel &Amp; Distributed Processing (IPDPS)

View full text Add to dashboard Cite

Abstract-In this paper, we study parallel I/O efficient graph algorithms in the Parallel External Memory (PEM) model, one of the private-cache chip multiprocessor (CMP) models. We study the fundamental problem of list ranking which leads to efficient solutions to problems on trees, such as computing lowest common ancestors, tree contraction and expression tree evaluation. We also study the problems of computing the connected and biconnected components of a graph, minimum spanning tree of a connected graph and ear decomposition of a biconnected graph. All our solutions on a P -processor PEM model provide an optimal speedup of Θ(P ) in parallel I/O complexity and parallel computation time, compared to the single-processor external memory counterparts.

show abstract

The Cache-Oblivious Gaussian Elimination Paradigm: Theoretical Framework, Parallelization and Experimental Evaluation

Cited by 45 publications

References 25 publications

Competitive Cache Replacement Strategies for Shared Cache Environments

Competitive Cache Replacement Strategies for Shared Cache Environments

Minimizing Communication in All-Pairs Shortest Paths

Parallel external memory graph algorithms

Contact Info

Product

Resources

About