Shader Performance Analysis on a Modern GPU Architecture

Moya, Victor; Gonzalez, C. Diez; Roca, Josep; Fernandez, A.; Espasa, Roger

doi:10.1109/micro.2005.30

Cited by 34 publications

(19 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For instance, the vertex and fragment processing stages include several replicated units known as vertex processor (VP) and fragment processor (FP), respectively. The overall idea is that the GPU launches a thread per incoming vertex (or per group of fragments), which is dispatched to an idle processor [43], [45]. .…”

Section: The Graphics Pipelinementioning

confidence: 99%

Parallel Implementation of the 2D Discrete Wavelet Transform on Graphics Processing Units: Filter Bank versus Lifting

Tenllado

Setoaín

Prieto

et al. 2008

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Abstract-The widespread usage of the discrete wavelet transform (DWT) has motivated the development of fast DWT algorithms and their tuning on all sorts of computer systems. Several studies have compared the performance of the most popular schemes, known as Filter Bank Scheme (FBS) and Lifting Scheme (LS), and have always concluded that LS is the most efficient option. However, there is no such study on streaming processors such as modern Graphics Processing Units (GPUs). Current trends have transformed these devices into powerful stream processors with enough flexibility to perform intensive and complex floating-point calculations. The opportunities opened up by these platforms, as well as the growing popularity of the DWT within the computer graphics field, make a new performance comparison of great practical interest. Our study indicates that FBS outperforms LS in current-generation GPUs. In our experiments, the actual FBS gains range between 10 percent and 140 percent, depending on the problem size and the type and length of the wavelet filter. Moreover, design trends suggest higher gains in future-generation GPUs.

show abstract

Section: The Graphics Pipelinementioning

confidence: 99%

Parallel Implementation of the 2D Discrete Wavelet Transform on Graphics Processing Units: Filter Bank versus Lifting

Tenllado

Setoaín

Prieto

et al. 2008

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…The 3D games were taken from the ATTILA project webpage [ATTILA traces 2011]. The traces were gathered for frames 100 to 139 and 200 to 239 for each benchmark, similar to [Moya et al 2005]. The traces are listed in Table IV.…”

Section: Experimental Frameworkmentioning

confidence: 99%

Utilizing RF-I and intelligent scheduling for better throughput/watt in a mobile GPU memory system

Therdsteerasukdi

Byun

Cong

et al. 2012

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Smartphones and tablets are becoming more and more powerful, replacing desktops and laptops as the users' main computing system. As these systems support higher and higher resolutions with more complex 3D graphics, a high throughput and low power memory system is essential for the mobile GPU. In this article, we propose to improve throughput/watt in a mobile GPU memory system by using intelligent scheduling to reduce power, and multi-band radio frequency interconnect (MRF-I) to offset any throughput degradation caused by our intelligent scheduling. Overall, we are able to improve throughput 17% up to 66% while increasing throughput per watt by an average of 18% up to 26%.

show abstract

“…This architecture is referred to as the unified shader model. 1 Another fundamental aspect that GPUs exploit in order to speed up the rendering process is parallelization. The graphics pipeline is devised in such a way that interdependencies are avoided as much as possible, so performance can be easily increased by replicating hardware.…”

Section: Introductionmentioning

confidence: 99%

Area-delay trade-offs of texture decompressors for a graphics processing unit

Súñer

Ituero

López‐Vallejo

2011

VLSI Circuits and Systems V

View full text Add to dashboard Cite

Graphics Processing Units have become a booster for the microelectronics industry. However, due to intellectual property issues, there is a serious lack of information on implementation details of the hardware architecture that is behind GPUs. For instance, the way texture is handled and decompressed in a GPU to reduce bandwidth usage has never been dealt with in depth from a hardware point of view. This work addresses a comparative study on the hardware implementation of different texture decompression algorithms for both conventional (PCs and video game consoles) and mobile platforms.Circuit synthesis is performed targeting both a reconflgurable hardware platform and a 90nm standard cell library. Area-delay trade-offs have been extensively analyzed, which allows us to compare the complexity of decompressors and thus determine suitability of algorithms for systems with limited hardware resources.

show abstract

Shader Performance Analysis on a Modern GPU Architecture

Cited by 34 publications

References 11 publications

Parallel Implementation of the 2D Discrete Wavelet Transform on Graphics Processing Units: Filter Bank versus Lifting

Parallel Implementation of the 2D Discrete Wavelet Transform on Graphics Processing Units: Filter Bank versus Lifting

Utilizing RF-I and intelligent scheduling for better throughput/watt in a mobile GPU memory system

Area-delay trade-offs of texture decompressors for a graphics processing unit

Contact Info

Product

Resources

About