Jeremy T. Fineman scite author profile

This paper introduces a storage format for sparse matrices, called compressed sparse blocks (CSB), which allows both Ax and A T x to be computed efficiently in parallel, where A is an n × n sparse matrix with nnz ≥ n nonzeros and x is a dense n-vector. Our algorithms use Θ(nnz) work (serial running time) and Θ( √ n lg n) span (critical-path length), yielding a parallelism of Θ(nnz / √ n lg n), which is amply high for virtually any large matrix. The storage requirement for CSB is esssentially the same as that for the morestandard compressed-sparse-rows (CSR) format, for which computing Ax in parallel is easy but A T x is difficult. Benchmark results indicate that on one processor, the CSB algorithms for Ax and A T x run just as fast as the CSR algorithm for Ax, but the CSB algorithms also scale up linearly with processors until limited by offchip memory bandwidth.

show abstract

Greedy sequential maximal independent set and matching are parallel on average

Blelloch

2012

View full text Add to dashboard Cite

The greedy sequential algorithm for maximal independent set (MIS) loops over the vertices in an arbitrary order adding a vertex to the resulting set if and only if no previous neighboring vertex has been added. In this loop, as in many sequential loops, each iterate will only depend on a subset of the previous iterates (i.e. knowing that any one of a vertex's previous neighbors is in the MIS, or knowing that it has no previous neighbors, is sufficient to decide its fate one way or the other). This leads to a dependence structure among the iterates. If this structure is shallow then running the iterates in parallel while respecting the dependencies can lead to an efficient parallel implementation mimicking the sequential algorithm.In this paper, we show that for any graph, and for a random ordering of the vertices, the dependence length of the sequential greedy MIS algorithm is polylogarithmic (O(log 2 n) with high probability). Our results extend previous results that show polylogarithmic bounds only for random graphs. We show similar results for greedy maximal matching (MM). For both problems we describe simple linear-work parallel algorithms based on the approach. The algorithms allow for a smooth tradeoff between more parallelism and reduced work, but always return the same result as the sequential greedy algorithms. We present experimental results that demonstrate efficiency and the tradeoff between work and parallelism.

show abstract

Scheduling irregular parallel computations on hierarchical caches

et al. 2011

View full text Add to dashboard Cite

Making efficient use of cache hierarchies is essential for achieving good performance on multicore and other shared-memory parallel machines. Unfortunately, designing algorithms for complicated cache hierarchies can be difficult and tedious. To address this, recent work has developed high-level models that expose locality in a manner that is oblivious to particular cache or processor organizations, placing the burden of making effective use of a parallel machine on a runtime task scheduler rather than the algorithm designer/programmer. This paper continues this line of work by (i) developing a new model for parallel cache cost, (ii) developing a task scheduler for irregular tasks on cache hierarchies, and (iii) proving that the scheduler assigns tasks to processors in a work-efficient manner (including cache costs) relative to the model. As with many previous models, our model allows algorithms to be analyzed using a single level of cache with parameters M (cache size) and B (cache-line size), and algorithms can be written cache obliviously (with no choices made based on machine parameters). Unlike previous models, our cost Q α (n; M, B), for problem size n, captures costs due to work-space imbalance among tasks, and we prove a lower bound that shows that some sort of penalty is needed to achieve work efficiency. Nevertheless, for many algorithms, Q α () is asymptotically equal to the standard sequential cache cost Q().Our task scheduler is a specific "space-bounded scheduler," which assigns subtasks to caches based on their space usage. Our scheduler extends prior work by efficiently scheduling "irregular" computations with arbitrary work imbalance among parallel subtasks, reflected by the Q α () cost. Moreover, in addition to proving bounds on cache complexity, we also provide a bound on the total running time of an program execution using our scheduler. Specifically, our scheduler executes a program on a homogeneous h-level parallel memory hierarchy having p processors in time O (v h /p) h i=0 Q α (n; M i , B) · C i , where M i is the size of the level-i cache, B is the cache-line size, C i is the cost of level-i cache miss, and v h is an overhead defined in the paper. Ignoring the overhead v h , which may be small (or even constant) when certain side conditions hold, this bound is optimal whenever Q α () matches the sequential cache complexity-at every level of the hierarchy it is dividing the total cache cost by the total number of processors.

show abstract

Cache-oblivious streaming B-trees

Bender

Farach-Colton

Fineman

et al. 2007

View full text Add to dashboard Cite

A streaming B-tree is a dictionary that efficiently implements insertions and range queries. We present two cache-oblivious streaming B-trees, the shuttle tree, and the cache-oblivious lookahead array (COLA).For block-transfer size B and on N elements, the shuttle tree im-transfers, which is an asymptotic speedup over traditional B-trees if B ≥ (log N) 1+c/ log log log 2 N for any constant c > 1. A COLA implements searches in O(logN) transfers, range queries in O(logN + L/B) transfers, and insertions in amortized O((logN)/B)transfers, matching the bounds for a (cache-aware) buffered repository tree. A partially deamortized COLA matches these bounds but reduces the worst-case insertion cost to O(logN) if memory size M = Ω(log N). We also present a cache-aware version of the COLA, the lookahead array , which achieves the same bounds as Brodal and Fagerberg's (cache-aware) B ε -tree.We compare our COLA implementation to a traditional B-tree. Our COLA implementation runs 790 times faster for random insertions, 3.1 times slower for insertions of sorted data, and 3.5 times slower for searches.

show abstract

Internally deterministic parallel algorithms can be fast

et al. 2012

View full text Add to dashboard Cite

The virtues of deterministic parallelism have been argued for decades and many forms of deterministic parallelism have been described and analyzed. Here we are concerned with one of the strongest forms, requiring that for any input there is a unique dependence graph representing a trace of the computation annotated with every operation and value. This has been referred to as internal determinism, and implies a sequential semantics-i.e., considering any sequential traversal of the dependence graph is sufficient for analyzing the correctness of the code. In addition to returning deterministic results, internal determinism has many advantages including ease of reasoning about the code, ease of verifying correctness, ease of debugging, ease of defining invariants, ease of defining good coverage for testing, and ease of formally, informally and experimentally reasoning about performance. On the other hand one needs to consider the possible downsides of determinism, which might include making algorithms (i) more complicated, unnatural or special purpose and/or (ii) slower or less scalable.In this paper we study the effectiveness of this strong form of determinism through a broad set of benchmark problems. Our main contribution is to demonstrate that for this wide body of problems, there exist efficient internally deterministic algorithms, and moreover that these algorithms are natural to reason about and not complicated to code. We leverage an approach to determinism suggested by Steele (1990), which is to use nested parallelism with commutative operations. Our algorithms apply several diverse programming paradigms that fit within the model including (i) a strict functional style (no shared state among concurrent operations), (ii) an approach we refer to as deterministic reservations, and (iii) the use of commutative, linearizable operations on data structures. We describe algorithms for the benchmark problems that use these deterministic approaches and present performance results on a 32-core machine. Perhaps surprisingly, for all problems, our internally deterministic algorithms achieve good speedup and good performance even relative to prior nondeterministic solutions.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Jeremy T. Fineman

Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks

Greedy sequential maximal independent set and matching are parallel on average

Scheduling irregular parallel computations on hierarchical caches

Cache-oblivious streaming B-trees

Internally deterministic parallel algorithms can be fast

Contact Info

Product

Resources

About