Jason Mak scite author profile

et al. 2012

We present parallel algorithms and implementations of a bzip2-like lossless data compression scheme for GPU architectures. Our approach parallelizes three main stages in the bzip2 compression pipeline: Burrows-Wheeler transform (BWT), move-to-front transform (MTF), and Huffman coding. In particular, we utilize a two-level hierarchical sort for BWT, design a novel scan-based parallel MTF algorithm, and implement a parallel reduction scheme to build the Huffman tree. For each algorithm, we perform detailed performance analysis, discuss its strengths and weaknesses, and suggest future directions for improvements. Overall, our GPU implementation is dominated by BWT performance and is 2.78× slower than bzip2, with BWT and MTFHuffman respectively 2.89× and 1.34× slower on average.

Numerical ocean modeling and simulation with CUDA

Choboter

Lupo

2011

ROMS is software that models and simulates an ocean region using a finite difference grid and time stepping. ROMS simulations can take from hours to days to complete due to the compute-intensive nature of the software. As a result, the size and resolution of simulations are constrained by the perfor mance limitations of modern computing hardware. To address these issues, the existing ROMS code can be run in parallel with either OpenMP or MPI. In this work, we implement a new parallelization of ROMS on a graphics processing unit (GPU) using CUDA Fortran. We exploit the massive parallelism offered by modern GPUs to gain a performance benefit at a lower cost and with less power. To test our implementation, we benchmark with idealistic marine conditions as well as real data collected from coastal waters near central California. Our implementation yields a speedup of up to 8x over a serial implementation and 2.5x over an OpenMP implementation, while demonstrating comparable performance to a MPI implementation.

GPU-accelerated and efficient multi-view triangulation for scene reconstruction

Hess-Flores

Recker

et al. 2014

This paper presents a framework for GPU-accelerated N -view triangulation in multi-view reconstruction that improves processing time and final reprojection error with respect to methods in the literature. The framework uses an algorithm based on optimizing an angular error-based L 1 cost function and it is shown how adaptive gradient descent can be applied for convergence. The triangulation algorithm is mapped onto the GPU and two approaches for parallelization are compared: one thread per track and one thread block per track. The better performing approach depends on the number of tracks and the lengths of the tracks in the dataset. Furthermore, the algorithm uses statistical sampling based on confidence levels to successfully reduce the quantity of feature track positions needed to triangulate an entire track. Sampling aids in load balancing for the GPU's SIMD architecture and for exploiting the GPU's memory hierarchy. When compared to a serial implementation, a typical performance increase of 3-4x can be achieved on a 4-core CPU. On a GPU, large track numbers are favorable and an increase of up to 40x can be achieved. Results on real and synthetic data prove that reprojection errors are similar to the best performing current triangulation methods but costing only a fraction of the computation time, allowing for efficient and accurate triangulation of large scenes.

A Comparative Study of GPU-Accelerated Multi-view Sequential Reconstruction Triangulation Methods for Large-Scale Scenes

Hess-Flores

Recker

et al. 2015

The angular error-based triangulation method and the parallax path method are both high-performance methods for large-scale multi-view sequential reconstruction that can be parallelized on the GPU. We map parallax paths to the GPU and test its performance and accuracy as a triangulation method for the first time. To this end, we compare it with the angular method on the GPU for both performance and accuracy. Furthermore, we improve the recovery of path scales and perform more extensive analysis and testing compared with the original parallax paths method. Although parallax paths requires sequential and piecewise-planar camera positions, in such scenarios, we can achieve a speedup of up to 14x over angular triangulation, while maintaining comparable accuracy.

Efficient dense reconstruction using geometry and image consistency constraints

Shashkov

Recker

et al. 2015

Abstract-We introduce a method for creating very dense reconstructions of datasets, particularly turn-table varieties. The method takes in initial reconstructions (of any origin) and makes them denser by interpolating depth values in two-dimensional image space within a superpixel region and then optimizing the interpolated value via image consistency analysis across neighboring images in the dataset. One of the core assumptions in this method is that depth values per pixel will vary gradually along a gradient for a given object. As such, turntable datasets, such as the dinosaur dataset, are particularly easy for our method. Our method modernizes some existing techniques and parallelizes them on a GPU, which produces results faster than other densification methods.