Warp-aware trace scheduling for GPUs

Jablin, James A.; Jablin, Thomas B.; Mutlu, Onur; Herlihy, Maurice

doi:10.1145/2628071.2628101

Cited by 18 publications

(9 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thus TLP and ILP are in opposition, and attaining full utilization requires carefully balancing both techniques. While TLP is commonly used across all of GPU computing, ILP is a less explored area, with prior work limited to dense linear algebra [22] and microcode optimization [23].…”

Section: Latency Hiding With Tlp and Ilpmentioning

confidence: 99%

Design Principles for Sparse Matrix Multiplication on the GPU

Yang

Buluç

Owens

2018

Euro-Par 2018: Parallel Processing

View full text Add to dashboard Cite

We implement two novel algorithms for sparse-matrix densematrix multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the popular compressed-sparse-row (CSR) format and thus do not require expensive format conversion. While previous SpMM work concentrates on thread-level parallelism, we additionally focus on latency hiding with instruction-level parallelism and load-balancing. We show, both theoretically and experimentally, that the proposed SpMM is a better fit for the GPU than previous approaches. We identify a key memory access pattern that allows efficient access into both input and output matrices that is crucial to getting excellent performance on SpMM. By combining these two ingredients-(i) merge-based loadbalancing and (ii) row-major coalesced memory access-we demonstrate a 4.1× peak speedup and a 31.7% geomean speedup over state-of-the-art SpMM implementations on real-world datasets.

show abstract

Section: Latency Hiding With Tlp and Ilpmentioning

confidence: 99%

Design Principles for Sparse Matrix Multiplication on the GPU

Yang

Buluç

Owens

2018

Euro-Par 2018: Parallel Processing

View full text Add to dashboard Cite

show abstract

“…A cl_context must be created which specifies on which device to run and also manages the resources on the device. All OpenCL work is performed within this are equipped with little or no branch prediction hardware like most CPUs [14]. As a result, it is crucial to write GPU kernels with as little branching as possible to maximize performance.…”

Section: Execution Modelmentioning

confidence: 99%

GPU-accelerated feature tracking

Graves

2016

2016 IEEE National Aerospace and Electronics Conference (NAECON) and Ohio Innovation Summit (OIS)

View full text Add to dashboard Cite

Graves, Alex. M.S., Department of Computer Science and Engineering, Wright State University, 2016. GPU-Accelerated Feature Tracking.The motivation of this research is to prove that GPUs can provide significant speedup of long-executing image processing algorithms by way of parallelization and massive data throughput. This thesis accelerates the well-known KLT feature tracking algorithm using OpenCL and an NVidia GeForce GTX 780 GPU. KLT is a fast, efficient and accurate feature tracker but can easily suffer from low frame rates when tracking many features in an HD video sequence. This research explains how KLT could benefit from GPGPU programming and provides the corresponding OpenCL implementation.Additionally, various optimization techniques are emphasized to further boost GPU performance. The experiments conducted prove that when tracking over 500 features in an HD dataset, GPU-based KLT provides a 92% reduction in total runtime compared to a CPU-based implementation. Furthermore, the experiments demonstrate that these features are tracked while maintaining similar accuracy to the CPU results.iv

show abstract

“…also propose a framework supporting a number of widely-used parallel patterns for efficient nested parallelism. [39] introduces warp-aware trace scheduling for GPUs based on speculating loads and arithmetic instructions upon divergence in order to exploit ILP. Recently, Schaub et.…”

Section: Related Work On Divergencementioning

confidence: 99%

Efficient warp execution in presence of divergence with collaborative context collection

Khorasani

Gupta

Bhuyan

2015

Proceedings of the 48th International Symposium on Microarchitecture

View full text Add to dashboard Cite

GPU's SIMD architecture is a double-edged sword confronting parallel tasks with control flow divergence. On the one hand, it provides a high performance yet powerefficient platform to accelerate applications via massive parallelism; however, on the other hand, irregularities induce inefficiencies due to the warp's lockstep traversal of all diverging execution paths. In this work, we present a software (compiler) technique named Collaborative Context Collection (CCC) that increases the warp execution efficiency when faced with thread divergence incurred either by different intra-warp task assignment or by intra-warp load imbalance. CCC collects the relevant registers of divergent threads in a warp-specific stack allocated in the fast shared memory, and restores them only when the perfect utilization of warp lanes becomes feasible. We propose code transformations to enable applicability of CCC to variety of program segments with thread divergence. We also introduce optimizations to reduce the cost of CCC and to avoid device occupancy limitation or memory divergence. We have developed a framework that automates application of CCC to CUDA generated intermediate PTX code. We evaluated CCC on real-world applications and multiple scenarios using synthetic programs. CCC improves the warp execution efficiency of real-world benchmarks by up to 56% and achieves an average speedup of 1.69x (maximum 3.08x).

show abstract

Warp-aware trace scheduling for GPUs

Cited by 18 publications

References 27 publications

Design Principles for Sparse Matrix Multiplication on the GPU

Design Principles for Sparse Matrix Multiplication on the GPU

GPU-accelerated feature tracking

Efficient warp execution in presence of divergence with collaborative context collection

Contact Info

Product

Resources

About