Flexible software profiling of GPU architectures

Stephenson, Mark W.; Hari, Siva Kumar Sastry; Lee, Yunsup; Ebrahimi, Eiman; Johnson, D.; Nellans, David; O’Connor, Mike; Keckler, Stephen W.

doi:10.1145/2872887.2750375

Cited by 26 publications

(21 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It allows users to practically assess the impact of errors on GPU applications. SASSIFI is based on the SASSI GPU assembly language instrumentation tool also devel-oped by the NVIDA Architecture Research Group [21]. Although not an official part of the CUDA software toolkit SASSI and SASSIFI are research prototypes which provide a selective instrumentation framework for NVIDIA GPU applications.…”

Section: Gpu Application Error Resilience Testingmentioning

confidence: 99%

Error Resilient GPU Accelerated Image Processing for Space Applications

Davidson

Bridges

2018

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Abstract-Significant advances in spaceborne imaging payloads have resulted in new big data problems in the Earth Observation (EO) field. These challenges are compounded onboard satellites due to a lack of equivalent advancement in onboard data processing and downlink technologies. We have previously proposed a new GPU accelerated onboard data processing architecture and developed parallelised image processing software to demonstrate the achievable data processing throughput and compression performance. However, the environmental characteristics are distinctly different to those on Earth, such as available power and the probability of adverse single event radiation effects. In this paper, we analyse new performance results for a low power embedded GPU platform, investigate the error resilience of our GPU image processing application and offer two new error resilient versions of the application. We utilise software based error injection testing to evaluate data corruption and functional interrupts. These results inform the new error resilient methods that also leverages GPU characteristics to minimise time and memory overheads. The key results show that our targeted redundancy techniques reduce the data corruption from a probability of up to 46% to now less than 2% for all test cases, with a typical execution time overhead of 130%.

show abstract

Section: Gpu Application Error Resilience Testingmentioning

confidence: 99%

Error Resilient GPU Accelerated Image Processing for Space Applications

Davidson

Bridges

2018

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…While non-unrolled vector addition is a simple example, address generation also makes up a significant amount of total dynamic instructions across complex and optimized code. To further illustrate this concept, we instrumented a set of CUDA benchmarks (Section 7.3) using SASSI (Stephenson et al 2015) to generate dynamic instruction execution histograms by instruction PC. In order to allocate integer instructions into the address generation (Agen), control (Control), and compute (Compute_Int, Compute_FP) categories, we further performed a backtrace using the source registers of relevant instructions.…”

Section: Address Generation Overheadsmentioning

confidence: 99%

Exposing Memory Access Patterns to Improve Instruction and Memory Efficiency in GPUs

Crago

Stephenson

Keckler

2018

ACM Trans. Archit. Code Optim.

Self Cite

View full text Add to dashboard Cite

Modern computing workloads often have high memory intensity, requiring high bandwidth access to memory. The memory request patterns of these workloads vary and include regular strided accesses and indirect (pointer-based) accesses. Such applications require a large number of address generation instructions and a high degree of memory-level parallelism. This article proposes new memory instructions that exploit strided and indirect memory request patterns and improve efficiency in GPU architectures. The new instructions reduce address calculation instructions by offloading addressing to dedicated hardware, and reduce destructive memory request interference by grouping related requests together. Our results show that we can eliminate 33% of dynamic instructions across 16 GPU benchmarks. These improvements result in an overall runtime improvement of 26%, an energy reduction of 18%, and a reduction in energy-delay product of 32%. CCS Concepts: • Computer systems organization → Parallel architectures;

show abstract

“…First, modern architectures are giving unprecedented insight into the GPU's hardware activity-for example, NVIDIA Tesla architectures can access over 200 counters and metrics. Furthermore, we expect the trend of increased GPU hardware transparency to continue; for example, new research to create more flexible tools for profiling of GPU hardware events is underway (see SASSI of Stephenson et al [2015]). Yet most power modeling research seeks to learn model parameters from power observations of ∼50 benchmarks.…”

Section: Conclusion and Future Research Directionsmentioning

confidence: 99%

“…are a few questions beginning to surface in the research. Additional research is probing how to apply the power modeling research for an optimal balance of power and performance (e.g., see Jia et al [2015]) and pioneering flexible profiling tools for monitoring GPU processes, (e.g., see Stephenson et al [2015]). Only in the most recent architectures are a large number of the GPU hardware events observable, and how to harness these for accurate understanding of power is thinly addressed.…”

Section: Introductionmentioning

confidence: 99%

Understanding GPU Power

2016

View full text Add to dashboard Cite

Modern graphics processing units (GPUs) have complex architectures that admit exceptional performance and energy efficiency for high-throughput applications. Although GPUs consume large amounts of power, their use for high-throughput applications facilitate state-of-the-art energy efficiency and performance. Consequently, continued development relies on understanding their power consumption. This work is a survey of GPU power modeling and profiling methods with increased detail on noteworthy efforts. As direct measurement of GPU power is necessary for model evaluation and parameter initiation, internal and external power sensors are discussed. Hardware counters, which are low-level tallies of hardware events, share strong correlation to power use and performance. Statistical correlation between power and performance counters has yielded worthwhile GPU power models, yet the complexity inherent to GPU architectures presents new hurdles for power modeling. Developments and challenges of counter-based GPU power modeling are discussed. Often building on the counter-based models, research efforts for GPU power simulation, which make power predictions from input code and hardware knowledge, provide opportunities for optimization in programming or architectural design. Noteworthy strides in power simulations for GPUs are included along with their performance or functional simulator counterparts when appropriate. Last, possible directions for future research are discussed.

show abstract

Flexible software profiling of GPU architectures

Cited by 26 publications

References 21 publications

Error Resilient GPU Accelerated Image Processing for Space Applications

Error Resilient GPU Accelerated Image Processing for Space Applications

Exposing Memory Access Patterns to Improve Instruction and Memory Efficiency in GPUs

Understanding GPU Power

Contact Info

Product

Resources

About