Robust SIMD: Dynamically Adapted SIMD Width and Multi-Threading Depth

Meng, Jiayuan; Sheaffer, Jeremy W.; Skadron, Kevin

doi:10.1109/ipdps.2012.20

Cited by 19 publications

(17 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most of these works require the reorganization of threads every cycle and introduce hardware overhead to support such reorganization. Several works studied the impact of SIMD width on performance [Meng et al 2012;Lashgar et al 2012] and concluded that large warps improve the memory access coalescing but suffer more synchronization overhead, memory divergences, and branch divergences. Their evaluations showed that most benchmarks exhibited performance degradation when running large warps where NK-threaded warps running on K-wide SIMD lanes.…”

Section: Related Workmentioning

confidence: 99%

“…As shown in Figure 9, the Fermi GPU chip has roughly 23mm × 23mm die size and is manufactured in 40nm technology [Valich 2010], so we estimate that the communication stage takes one-cycle latency to traverse the distance between two adjacent SMs, which is roughly 5mm long. The delay of 5mm wire is 0.3ns (3.3GHz) reported by CACTI 6.0 [Muralimanohar et al 2009]. We clock the NoC at 1.4GHz, which not only meets one cycle latency but also provides headroom for low-swing voltage operation that causes 0.6ns wire delay and 0.1ns signal regeneration that still meets timing.…”

Section: Communications In Clustersmentioning

confidence: 99%

See 1 more Smart Citation

Buddy SM

Zhang

Jing

Jiang

et al. 2015

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

A modern general-purpose graphics processing unit (GPGPU) usually consists of multiple streaming multiprocessors (SMs), each having a pipeline that incorporates a group of threads executing a common instruction flow. Although SMs are designed to work independently, we observe that they tend to exhibit very similar behavior for many workloads. If multiple SMs can be grouped and work in the lock-step manner, it is possible to save energy by sharing the front-end units among multiple SMs, including the instruction fetch, decode, and schedule components. However, such sharing brings architectural challenges and sometime causes performance degradation. In this article, we show our design, implementation, and evaluation for such an architecture, which we call Buddy SM. Specifically, multiple SMs can be opportunistically grouped into a buddy cluster. One SM becomes the master, and the rest become the slaves. The front-end unit of the master works actively for itself as well as for the slaves, whereas the front-end logics of the slaves are power gated. For efficient flow control and program correctness, the proposed architecture can identify unfavorable conditions and ungroup the buddy cluster when necessary. We analyze various techniques to improve the performance and energy efficiency of Buddy SM. Detailed experiments manifest that 37.2% front-end and 7.5% total GPU energy reduction can be achieved. This article is extended from a six-page conference paper entitled "Dynamic Front-End Sharing in Graphics Processing Units," presented at the 32nd IEEE International Conference on Computer Design (ICCD). We do the further research and extend the conference paper in several directions: -We propose an adaptive buddy cluster formation technique that can improve performance and save an average of 10% more energy than that in the conference paper (Sections 4.8 and 6.3). -We construct simple large warp architectures and an eight-buddy architecture to compare with two-buddy and four-buddy architectures (Section 6.5). We evaluate additional regrouping strategies (Sections 4.6 and 6.2). -We run additional experiments to evaluate the characteristics of the benchmarks and categorize them so that we can thoroughly analyze and explain the final performance results (Section 5.2). -We perform new experiments to investigate the direct impact of the Buddy SM architecture on applications:issuing stalls and the memory access latency (Section 6.4). -We add a background section to introduce the components in the SM front-end and the mechanism of instruction issue to help readers understanding of this work better (Section 3).

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Communications In Clustersmentioning

confidence: 99%

Buddy SM

Zhang

Jing

Jiang

et al. 2015

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…Similarly, long wavefronts can be time-sliced to execute on these narrow SIMD units or wide SIMDs as usual. Prior research has also investigated using a dynamic approach adapting SIMD width [30].…”

Section: B Heterogeneous Simdsmentioning

confidence: 99%

Pannotia: Understanding irregular GPGPU graph applications

Che

Beckmann

Reinhardt

et al. 2013

2013 IEEE International Symposium on Workload Characterization (IISWC)

Self Cite

146

View full text Add to dashboard Cite

Abstract-GPUs have become popular recently to accelerate general-purpose data-parallel applications. However, most existing work has focused on GPU-friendly applications with regular data structures and access patterns. While a few prior studies have shown that some irregular workloads can also achieve speedups on GPUs, this domain has not been investigated thoroughly.Graph applications are one such set of irregular workloads, used in many commercial and scientific domains. In particular, graph mining -as well as web and social network analysis-are promising applications that GPUs could accelerate. However, implementing and optimizing these graph algorithms on SIMD architectures is challenging because their data-dependent behavior results in significant branch and memory divergence.To address these concerns and facilitate research in this area, this paper presents and characterizes a suite of GPGPU graph applications, Pannotia, which is implemented in OpenCL and contains problems from diverse and important graph application domains. We perform a first-step characterization and analysis of these benchmarks and study their behavior on real hardware. We also use clustering analysis to illustrate the similarities and differences of the applications in the suite. Finally, we make architectural and scheduling suggestions that will improve their execution efficiency on GPUs.

show abstract

“…However, practically speaking, only few of those components are executing code at a time; and depending on the benchmark, some components are exercised more than others. To better estimate the energy consumption breakdown in the processor during typical benchmarks, we used the architectural simulator gem5 [35] to count the number of instructions that triggers the different components during the execution of several benchmarks used before in [37]. Those benchmarks are extracted from MineBench [38], SPLASH-2 [39], and Rodinia [40].…”

Section: Overall Energymentioning

confidence: 99%

Maximizing Energy Efficiency through Vertically Integrated Dynamic Reconfigurability

Arrabi¹

View full text Add to dashboard Cite

Power and energy consumption have become some of the main hurdles for performance advancement in modern digital systems. High transistor densities and clock speeds have created significant thermal and power delivery issues, and increasingly pervasive portable electronic devices have limited energy sources but high performance expectations. Power affects and is affected by almost every aspect of digital system design, enticing designers and researchers to approach this problem from every possible aspect.This work explores two such aspects: dynamic reconfigurability and vertical integration. Dynamic reconfigurability is the ability to change the configuration of the system dynamically during runtime, and vertical integration is using low-level design information in high-level decisions (and vice-versa) as well as co-optimizing multiple design levels simultaneously. Both of these concepts can be targeted to various system metrics; in this work, they are used to reduce the total energy consumption of the system with the goal of increasing system battery lifetime while still providing the necessary performance and functional capabilities.This work explores dynamic reconfigurability and vertical integration through the use of three practical systems. We go through the steps and analysis in designing these systems that will show the potential benefits of our methods. Through the steps of designing the systems, we show the impact of several significant aspects of vertically integrated dynamically configurable systems, thus providing a framework for identifying potential benefits of reconfigurability and co-optimization and for general guidelines of how to implement them on other systems. We also look at the granularity of configurability and the overhead that it incurs, as well as the circuit-level details and information that might affect and be affected by architectural or system-level decisions.The first system is a wireless body sensor node that consumes most of its energy wirelessly transmitting data. We investigate how adaptive software can compress the data before sending without sacrificing information using only the limited processing and memory resources that typically reside on body sensors. We go through the various aspects that affect our solution and the different design levels we have to take into account when applying the compression.The second example system is a custom digital signal-processing system utilizing Panoptic Dynamic Voltage Scaling (PDVS), in which we investigate adding fine-grained spatial and temporal granularities of voltage scaling to the processor. We use a vertically integrated approach to explore the trade-offs of the ability to switch the voltage of a single arithmetic component (spatial) for a single operation (temporal), as well as the different variables that need to be considered for scheduling such systems.The third system is a Field Programmable Core Array (FPCA), in which we investigate various methods for creating a configurable processor capable of switching betwee...

show abstract

Robust SIMD: Dynamically Adapted SIMD Width and Multi-Threading Depth

Cited by 19 publications

References 31 publications

Buddy SM

Buddy SM

Pannotia: Understanding irregular GPGPU graph applications

Maximizing Energy Efficiency through Vertically Integrated Dynamic Reconfigurability

Contact Info

Product

Resources

About