Lessons Learned from Exploring the Backtracking Paradigm on the GPU

Jenkins, John; Arkatkar, Isha; Owens, John D.; Choudhary, Alok; Samatova, Nagiza F.

doi:10.1007/978-3-642-23397-5_42

Cited by 40 publications

(24 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The described GPU backtracking strategy performs well in regular scenarios, but it faces a decrease of performance in more irregular ones, being outperformed even by the serial CPU implementation in some situations . The main reason for this decrease of performance is that GPUs suffer from load imbalance and diverging instruction flow.…”

Section: Background and Related Workmentioning

confidence: 99%

“…The main reason for this decrease of performance is that GPUs suffer from load imbalance and diverging instruction flow. Thus, to achieve a proper utilization of the multiprocessors, this parallel backtracking strategy must launch a huge amount of GPU threads …”

Section: Background and Related Workmentioning

confidence: 99%

“…GPU‐based backtracking strategies have been efficiently used in regular scenarios, such as using DFS to perform a complete enumeration of the solution space . However, they face huge performance penalties in irregular ones, being outperformed even by their serial counterparts .…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

GPU‐accelerated backtracking using CUDA Dynamic Parallelism

Pessoa

Gmys

Júnior

et al. 2017

Concurrency and Computation

View full text Add to dashboard Cite

Summary New GPGPU technologies, such as CUDA Dynamic Parallelism (CDP), can help dealing with recursive patterns of computation, such as divide‐and‐conquer, used by backtracking algorithms. In this paper, we propose a GPU‐accelerated backtracking algorithm using CDP that extends a well‐known parallel backtracking model. The search starts on CPU, processing the search tree until a first cutoff depth. Based on this partial backtracking tree, the algorithm analyzes the memory requirements of subsequent kernel generations. The proposed algorithm performs no dynamic allocation of memory on GPU, unlike related works from the literature. The proposed algorithm has been extensively tested using the N‐Queens Puzzle problem and instances of the Asymmetric Traveling Salesman Problem (ATSP) as test‐cases. The proposed CDP algorithm may, under some conditions, outperform its non‐CDP counterpart by a factor up to 25. But, it may also be up to twice slower. The CDP‐based implementation has much better worst case execution times and makes algorithm's performance less dependent on the tuning of parameters. Compared to other CDP‐based strategies from the literature, the proposed algorithm is on average 8× faster. The proposed algorithm is also hybridized with another CDP‐based strategy from the literature. The combination of strategies is in average 4.5× faster than the related strategy. We also identify some difficulties, limitations, and bottlenecks concerning the CDP programming model which may be useful for helping potential users.

show abstract

Section: Background and Related Workmentioning

confidence: 99%

Section: Background and Related Workmentioning

confidence: 99%

See 1 more Smart Citation

GPU‐accelerated backtracking using CUDA Dynamic Parallelism

Pessoa

Gmys

Júnior

et al. 2017

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…First, GPU operations are based on warps (which are groups of threads to be executed in single-instructionmultiple-data fashion), and different execution paths generated by backtracking algorithms may cause a so-called warp divergence problem. Second, GPU implementations for coalesced memory accesses are no longer straightforward due to irregular access patterns [19].…”

Section: Introductionmentioning

confidence: 99%

Fast Subgraph Matching on Large Graphs using Graphics Processors

Tran

Kim

2015

Database Systems for Advanced Applications

View full text Add to dashboard Cite

Subgraph matching is the task of finding all matches of a query graph in a large data graph, which is known as an NP-complete problem. Many algorithms are proposed to solve this problem using CPUs. In recent years, Graphics Processing Units (GPUs) have been adopted to accelerate fundamental graph operations such as breadthfirst search and shortest path, owing to their parallelism and high data throughput. The existing subgraph matching algorithms, however, face challenges in mapping backtracking problems to the GPU architectures. Moreover, the previous GPU-based graph algorithms are not designed to handle intermediate and final outputs. In this paper, we present a simple and GPU-friendly method for subgraph matching, called GpSM, which is designed for massively parallel architectures. We show that GpSM outperforms the state-of-the-art algorithms and efficiently answers subgraph queries on large graphs.

show abstract

“…Because of their massive data processing capability and their remarkable cost efficiency, GPUs are an attractive choice for providing the computing power needed to solve larger problem instances.The efficient implementation of the B&B algorithm on GPUs is a challenging task because the GPU programming model is at odds with the algorithm's highly irregular nature [1]. In this paper, we present a multi-GPU B&B algorithm for solving large permutation-based combinatorial optimization problems.…”

mentioning

confidence: 99%

IVM‐based parallel branch‐and‐bound using hierarchical work stealing on multi‐GPU systems

Gmys

Mezmaz

Melab

et al. 2016

Concurrency and Computation

View full text Add to dashboard Cite

International audienceTree-based exploratory methods, like Branch-and-Bound (B&B) algorithms, are highly irregular applications which makes their design and implementation on graphics processing unit (GPU) challenging. In this paper, we present a multi-GPU B&B algorithm for solving large permutation-based combinatorial optimization problems. To tackle the problem of the irregular workload, we propose a hierarchical work stealing (WS) strategy that balances the workload inside the GPU and between different GPUs and CPU cores. Our B&B is based on an Integer-Vector-Matrix data structure instead of a pool of permutations, and work units exchanged are intervals of factoradics instead of sets of nodes. Two variants of the algorithm, using the same hierarchical WS strategy, are proposed: one for combinatorial optimization problems where the evaluation of nodes is costly and one for fine-grained problems. The latter variant uses a new hypercube-based WS strategy and a trigger mechanism to balance the work load inside the GPU. The proposed approach has been extensively experimented using the flowshop scheduling, the n-queens and the asymmetric travelling salesman problems as test-cases. The reported results show that the proposed hierarchical WS mechanism is capable of handling fine and coarse-grained types of workloads efficiently, reaching near-linear speed-up on up to four GPUs for a set of ten flowshop instances and large instances of fine-grained problem

show abstract

Lessons Learned from Exploring the Backtracking Paradigm on the GPU

Cited by 40 publications

References 18 publications

GPU‐accelerated backtracking using CUDA Dynamic Parallelism

GPU‐accelerated backtracking using CUDA Dynamic Parallelism

Fast Subgraph Matching on Large Graphs using Graphics Processors

IVM‐based parallel branch‐and‐bound using hierarchical work stealing on multi‐GPU systems

Contact Info

Product

Resources

About