Vlad Slavici scite author profile

2010

Binary Decision Diagrams (BDDs) are widely used in formal verification. They are also widely known for consuming large amounts of memory. For larger problems, a BDD computation will often start thrashing due to lack of memory within minutes. This work uses the parallel disks of a cluster or a SAN (storage area network) as an extension of RAM, in order to efficiently compute with BDDs that are orders of magnitude larger than what is available on a typical computer. The use of parallel disks overcomes the bandwidth problem of single disk methods, since the bandwidth of 50 disks is similar to the bandwidth of a single RAM subsystem. In order to overcome the latency issues of disk, the Roomy library is used for the sake of its latency-tolerant data structures. A breadth-first algorithm is implemented. A further advantage of the algorithm is that RAM usage can be very modest, since its largest use is as buffers for open files. The success of the method is demonstrated by solving the 16-queens problem, and by solving a more unusual problem -counting the number of tie games in a three-dimensional 4×4×4 tic-tac-toe board.

show abstract

An efficient programming model for memory-intensive recursive algorithms using parallel disks

Kunkle

et al. 2012

In order to keep up with the demand for solutions to problems with ever-increasing data sets, both academia and industry have embraced commodity computer clusters with locally attached disks or SANs as an inexpensive alternative to supercomputers. With the advent of tools for parallel disks programming, such as MapReduce, STXXL and Roomythat allow the developer to focus on higher-level algorithms -the programmer productivity for memory-intensive programs has increased many-fold. However, such parallel tools were primarily targeted at iterative programs.We propose a programming model for migrating recursive RAM-based legacy algorithms to parallel disks. Many memory-intensive symbolic algebra algorithms are most easily expressed as recursive algorithms. In this case, the programming challenge is multiplied, since the developer must re-structure such an algorithm with two criteria in mind: converting a naturally recursive algorithm into an iterative algorithm, while simultaneously exposing any potential data parallelism (as needed for parallel disks).This model alleviates the large effort going into the design phase of an external memory algorithm. Research in this area over the past 10 years has focused on per-problem solutions, without providing much insight into the connection between legacy algorithms and out-of-core algorithms. Our method shows how legacy algorithms employing recursion and non-streaming memory access can be more easily translated into efficient parallel disk-based algorithms.We demonstrate the ideas on a largest computation of its kind: the determinization via subset construction and minimization of very large nondeterministic finite set automata (NFA). To our knowledge, this is the largest subset construction reported in the literature. Determinization for large NFA has long been a large computational hurdle in the study of permutation classes defined by token passing networks. The programming model was used to design and implement an efficient NFA determinization algorithm that solves the next stage in analyzing token passing networks representing two stacks in series.

show abstract

Adapting Irregular Computations to Large CPU-GPU Clusters in the MADNESS Framework

Varier

et al. 2012

Finding the Minimal DFA of Very Large Finite State Automata with an Application to Token Passing Networks

Slavici¹,

Kunkle²,

Cooperman³

et al. 2011

Preprint

Finite state automata (FSA) are ubiquitous in computer science. Two of the most important algorithms for FSA processing are the conversion of a non-deterministic finite automaton (NFA) to a deterministic finite automaton (DFA), and then the production of the unique minimal DFA for the original NFA. We exhibit a parallel disk-based algorithm that uses a cluster of 29 commodity computers to produce an intermediate DFA with almost two billion states and then continues by producing the corresponding unique minimal DFA with less than 800,000 states. The largest previous such computation in the literature was carried out on a 512-processor CM-5 supercomputer in 1996. That computation produced an intermediate DFA with 525,000 states and an unreported number of states for the corresponding minimal DFA. The work is used to provide strong experimental evidence satisfying a conjecture on a series of token passing networks. The conjecture concerns stack sortable permutations for a finite stack and a 3-buffer. The origins of this problem lie in the work on restricted permutations begun by Knuth and Tarjan in the late 1960s. The parallel disk-based computation is also compared with both a single-threaded and multi-threaded RAM-based implementation using a 16-core 128 GB large shared memory computer.

show abstract

Scaling up scientific computations by using map-reduce-like control flow on NUMA architectures