2012 IEEE 26th International Parallel and Distributed Processing Symposium 2012
DOI: 10.1109/ipdps.2012.20
|View full text |Cite
|
Sign up to set email alerts
|

Robust SIMD: Dynamically Adapted SIMD Width and Multi-Threading Depth

Abstract: Abstract-Architectures that aggressively exploit SIMD often have many datapaths execute in lockstep and use multithreading to hide latency. They can yield high throughput in terms of area-and energy-efficiency for many dataparallel applications. To balance productivity and performance, many recent SIMD organizations incorporate implicit cache hierarchies. Exaples of such architectures include Intel's MIC, AMD's Fusion, and NVIDIA's Fermi. However, unlike software-managed streaming memories used in conventional… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
17
0

Year Published

2013
2013
2022
2022

Publication Types

Select...
5
4

Relationship

1
8

Authors

Journals

citations
Cited by 19 publications
(17 citation statements)
references
References 31 publications
0
17
0
Order By: Relevance
“…Most of these works require the reorganization of threads every cycle and introduce hardware overhead to support such reorganization. Several works studied the impact of SIMD width on performance [Meng et al 2012;Lashgar et al 2012] and concluded that large warps improve the memory access coalescing but suffer more synchronization overhead, memory divergences, and branch divergences. Their evaluations showed that most benchmarks exhibited performance degradation when running large warps where NK-threaded warps running on K-wide SIMD lanes.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Most of these works require the reorganization of threads every cycle and introduce hardware overhead to support such reorganization. Several works studied the impact of SIMD width on performance [Meng et al 2012;Lashgar et al 2012] and concluded that large warps improve the memory access coalescing but suffer more synchronization overhead, memory divergences, and branch divergences. Their evaluations showed that most benchmarks exhibited performance degradation when running large warps where NK-threaded warps running on K-wide SIMD lanes.…”
Section: Related Workmentioning
confidence: 99%
“…As shown in Figure 9, the Fermi GPU chip has roughly 23mm × 23mm die size and is manufactured in 40nm technology [Valich 2010], so we estimate that the communication stage takes one-cycle latency to traverse the distance between two adjacent SMs, which is roughly 5mm long. The delay of 5mm wire is 0.3ns (3.3GHz) reported by CACTI 6.0 [Muralimanohar et al 2009]. We clock the NoC at 1.4GHz, which not only meets one cycle latency but also provides headroom for low-swing voltage operation that causes 0.6ns wire delay and 0.1ns signal regeneration that still meets timing.…”
Section: Communications In Clustersmentioning
confidence: 99%
“…Similarly, long wavefronts can be time-sliced to execute on these narrow SIMD units or wide SIMDs as usual. Prior research has also investigated using a dynamic approach adapting SIMD width [30].…”
Section: B Heterogeneous Simdsmentioning
confidence: 99%
“…However, practically speaking, only few of those components are executing code at a time; and depending on the benchmark, some components are exercised more than others. To better estimate the energy consumption breakdown in the processor during typical benchmarks, we used the architectural simulator gem5 [35] to count the number of instructions that triggers the different components during the execution of several benchmarks used before in [37]. Those benchmarks are extracted from MineBench [38], SPLASH-2 [39], and Rodinia [40].…”
Section: Overall Energymentioning
confidence: 99%