Dynamic Detection of Uniform and Affine Vectors in GPGPU Computations

Collange, Sylvain; Defour, David; Zhang, Yao

doi:10.1007/978-3-642-14122-5_8

Cited by 28 publications

(30 citation statements)

References 8 publications

(6 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On the contrary, close interthread locality would be harmful in the context of multi-core platforms with coherent private caches, by causing false sharing of cache lines. Collange has observed a substantially different behavior in GPGPU applications [18]. In that case, inter-thread proximity is much more common, as this type of locality contributes notoriously to performance improvements.…”

Section: Memory Access Patternsmentioning

confidence: 99%

Data and Instruction Uniformity in Minimal Multi-threading

Milanez

Collange

Pereira

et al. 2012

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Self Cite

View full text Add to dashboard Cite

Section: Memory Access Patternsmentioning

confidence: 99%

Data and Instruction Uniformity in Minimal Multi-threading

Milanez

Collange

Pereira

et al. 2012

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Self Cite

View full text Add to dashboard Cite

“…For instance, more flexibility could be obtained using Dynamic Warp Formation [24] or Simultaneous Branch Interweaving [25], Dynamic Warp Subdivision [9] could improve latency tolerance by allowing threads to diverge on partial cache misses, and Dynamic Scalarization [29] could further unify redundant dataflow across threads.…”

Section: Discussionmentioning

confidence: 99%

“…MMT and Execution Drafting primarily target data-flow redundancy. DITVA targets controlflow redundancy, although it could be extended to exploit data-flow redundancy through dynamic scalarization techniques proposed for SIMT [29]. Both MMT and Execution Drafting seek to run all threads together in lockstep as much as possible.…”

Section: E Power and Energymentioning

confidence: 99%

Dynamic Inter-Thread Vectorization Architecture: Extracting DLP from TLP

Kalathingal

Collange

Swamy

et al. 2016

2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Self Cite

View full text Add to dashboard Cite

Abstract-Threads of Single-Program Multiple-Data (SPMD) applications often execute the same instructions on different data. We propose the Dynamic Inter-Thread Vectorization Architecture (DITVA) to leverage this implicit data-level parallelism in SPMD applications by assembling dynamic vector instructions at runtime. DITVA extends an SIMD-enabled in-order SMT processor with an inter-thread vectorization execution mode. In this mode, multiple scalar threads running in lockstep share a single instruction stream and their respective instruction instances are aggregated into SIMD instructions. To balance thread-and data-level parallelism, threads are statically grouped into fixed-size independently scheduled warps. DITVA leverages existing SIMD units and maintains binary compatibility with existing CPU architectures.Our evaluation on the SPMD applications from the PARSEC and Rodinia OpenMP benchmarks shows that a 4-warp × 4-lane 4-issue DITVA architecture with a realistic bank-interleaved cache achieves 1.55× higher performance than a 4-thread 4-issue SMT architecture with AVX instructions while fetching and issuing 51% fewer instructions, achieving an overall 24% energy reduction.

show abstract

“…The first is the instructions that are not dependent on thread id. These operations are the scalar operations in nature and are also referred to as uniform vector operations [6]. The second is a special case of control divergence, where the SIMT kernel has the following 'if' statement: 'if (threadIdx.x == K) {…}' where K is a constant.…”

Section: Collaborative Execution Paradigm Iii: Scalar Workload mentioning

confidence: 99%

A Case for a Flexible Scalar Unit in SIMT Architecture

Yang

Xiang

Mantor

et al. 2014

2014 IEEE 28th International Parallel and Distributed Processing Symposium

View full text Add to dashboard Cite

Abstract-The wide availability and the Single-Instruction Multiple-Thread (SIMT)-style programming model have made graphics processing units (GPUs) a promising choice for high performance computing. However, because of the SIMT style processing, an instruction will be executed in every thread even if the operands are identical for all the threads. To overcome this inefficiency, the AMD's latest Graphics Core Next (GCN) architecture integrates a scalar unit into a SIMT unit. In GCN, both the SIMT unit and the scalar unit share a single SIMTstyle instruction stream. Depending on its type, an instruction is issued to either a scalar or a SIMT unit. In this paper, we propose to extend the scalar unit so that it can either share the instruction stream with the SIMT unit or execute a separate instruction stream. The program to be executed by the scalar unit is referred to as a scalar program and its purpose is to assist SIMT-unit execution. The scalar programs are either generated from SIMT programs automatically by the compiler or manually developed by expert developers. We make a case for our proposed flexible scalar unit through three collaborative execution paradigms: data prefetching, control divergence elimination, and scalar-workload extraction. Our experimental results show that significant performance gains can be achieved using our proposed approaches compared to the state-of-art SIMT style processing.

show abstract

Dynamic Detection of Uniform and Affine Vectors in GPGPU Computations

Cited by 28 publications

References 8 publications

Data and Instruction Uniformity in Minimal Multi-threading

Data and Instruction Uniformity in Minimal Multi-threading

Dynamic Inter-Thread Vectorization Architecture: Extracting DLP from TLP

A Case for a Flexible Scalar Unit in SIMT Architecture

Contact Info

Product

Resources

About