This article addresses the compilation of a sequential program for parallel execution on a modern GPU. To this end, we present a novel source-to-source compiler called PPCG. PPCG singles out for its ability to accelerate computations from any static control loop nest, generating multiple CUDA kernels when necessary. We introduce a multilevel tiling strategy and a code generation scheme for the parallelization and locality optimization of imperfectly nested loops, managing memory and exposing concurrency according to the constraints of modern GPUs. We evaluate our algorithms and tool on the entire PolyBench suite.
Abstract-The widespread usage of the discrete wavelet transform (DWT) has motivated the development of fast DWT algorithms and their tuning on all sorts of computer systems. Several studies have compared the performance of the most popular schemes, known as Filter Bank Scheme (FBS) and Lifting Scheme (LS), and have always concluded that LS is the most efficient option. However, there is no such study on streaming processors such as modern Graphics Processing Units (GPUs). Current trends have transformed these devices into powerful stream processors with enough flexibility to perform intensive and complex floating-point calculations. The opportunities opened up by these platforms, as well as the growing popularity of the DWT within the computer graphics field, make a new performance comparison of great practical interest. Our study indicates that FBS outperforms LS in current-generation GPUs. In our experiments, the actual FBS gains range between 10 percent and 140 percent, depending on the problem size and the type and length of the wavelet filter. Moreover, design trends suggest higher gains in future-generation GPUs.
This paper addresses the implementation of a 2-D Discrete Wavelet Transform on general-purpose microprocessors, focusing on both memory hierarchy and SIMD parallelization issues. Both topics are somewhat related, since SIMD extensions are only useful if the memory hierarchy is efficiently exploited. In this work, locality has been significantly improved by means of a novel approach called pipelined computation, which complements previous techniques based on loop tiling and non-linear layouts. As experimental platforms we have employed a Pentium-III (P-III) and a Pentium-4 (P-4) microprocessor. However, our SIMD-oriented tuning has been exclusively performed at source code level. Basically, we have reordered some loops and introduced some modifications that allow automatic vectorization. Taking into account the abstraction level at which the optimizations are carried out, the speedups obtained on the investigated platforms are quite satisfactory, even though further improvement can be obtained by dropping the level of abstraction (compiler intrinsics or assembly code).
Spatial/spectral algorithms have been shown in previous work to be a promising approach to the problem of extracting image endmembers from remotely sensed hyperspectral data. Such algorithms map nicely on high-performance systems such as massively parallel clusters and networks of computers. Unfortunately, these systems are generally expensive and difficult to adapt to onboard data processing scenarios, in which low-weight and low-power integrated components are highly desirable to reduce mission payload. An exciting new development in this context is the emergence of graphics processing units (GPUs), which can now satisfy extremely high computational requirements at low cost. In this letter, we propose a GPU-based implementation of the automated morphological endmember extraction algorithm, which is used in this letter as a representative case study of joint spatial/spectral techniques for hyperspectral image processing. The proposed implementation is quantitatively assessed in terms of both endmember extraction accuracy and parallel efficiency, using two generations of commercial GPUs from NVidia. Combined, these parts offer a thoughtful perspective on the potential and emerging challenges of implementing hyperspectral imaging algorithms on commodity graphics hardware.
Information retrieval from large databases is becoming crucial for many applications in different fields such as content searching in multimedia objects, text retrieval or computational biology. These databases are usually indexed off-line to enable an acceleration of on-line searches. Furthermore, the available parallelism has been exploited using clusters to improve query throughput. Recently some authors have proposed the use of Graphic Processing Units (GPUs) to accelerate bruteforce searching algorithms for metric-space databases. In this work we improve existing GPU brute-force implementations and explore the viability of GPUs to accelerate indexing techniques. This exploration includes an interesting discussion about the performance of both bruteforce and indexing-based algorithms that takes into account the intrinsic dimensionality of the element of the database.
Hyperspectral analysis algorithms exhibit inherent parallelism at multiple levels, and map nicely on high performance systems such as massively parallel clusters and networks of computers. Unfortunately, these systems are generally expensive and difficult to adapt to onboard data processing scenarios, in which low-weight and low-power integrated components are desirable to reduce mission payload. An exciting new development in this field is the emergence of programmable graphics hardware. Driven by the ever-growing demands of game industry, graphics processing units (GPUs) have evolved from expensive, applicationspecific units into highly parallel and programmable systems which can satisfy extremely high computational requirements at low cost. In this paper, we investigate GPU-based implementations of a morphological endmember extraction algorithm, which is used as a representative case study of joint spatial/spectral techniques for hyperspectral analysis. The proposed implementations are quantitatively compared and assessed in terms of both endmember extraction accuracy and parallel efficiency. Combined, these parts offer a thoughtful perspective on the potential and emerging challenges of implementing hyperspectral imaging algorithms on commodity graphics hardware.
In this paper we discuss several issues relevant to the vectorization of a 2-D Discrete Wavelet Transform on current microprocessors. Our research is based on previous studies about the efficient exploitation of the memory hierarchy, due to its tremendous impact on performance. We have extended this work with a more detailed analysis based on hardware performance counters and a study of vectorization, in particular, we have used the Intel Pentium SSE instruction set. Most of our optimizations are performed at source code level to allow automatic vectorization, though some compiler intrinsic functions have been introduced to enhance performance. Taking into account the abstraction at which the optimizations are performed, the results obtained on an Intel Pentium III microprocessor are quite satisfactory, even though further improvement can be obtained by a more extensive use of compiler intrinsics.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.