Simultaneous branch and warp interweaving for sustained GPU performance

Brunie, Nicolas; Collange,; Diamos,

doi:10.1109/isca.2012.6237005

Cited by 56 publications

(54 citation statements)

References 25 publications

(17 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…For instance, more flexibility could be obtained using Dynamic Warp Formation [24] or Simultaneous Branch Interweaving [25], Dynamic Warp Subdivision [9] could improve latency tolerance by allowing threads to diverge on partial cache misses, and Dynamic Scalarization [29] could further unify redundant dataflow across threads.…”

Section: Discussionmentioning

confidence: 99%

“…They do not support exceptions or interruptions, which prevents their use with a general-purpose system software stack. Various works extend the SIMT model to support more generic code [22], [23] or more flexible execution [24], [25], [26]. However, they all target applications specifically written for GPUs, rather than general-purpose parallel applications.…”

Section: E Power and Energymentioning

confidence: 99%

See 1 more Smart Citation

Dynamic Inter-Thread Vectorization Architecture: Extracting DLP from TLP

Kalathingal

Collange

Swamy

et al. 2016

2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Self Cite

View full text Add to dashboard Cite

Abstract-Threads of Single-Program Multiple-Data (SPMD) applications often execute the same instructions on different data. We propose the Dynamic Inter-Thread Vectorization Architecture (DITVA) to leverage this implicit data-level parallelism in SPMD applications by assembling dynamic vector instructions at runtime. DITVA extends an SIMD-enabled in-order SMT processor with an inter-thread vectorization execution mode. In this mode, multiple scalar threads running in lockstep share a single instruction stream and their respective instruction instances are aggregated into SIMD instructions. To balance thread-and data-level parallelism, threads are statically grouped into fixed-size independently scheduled warps. DITVA leverages existing SIMD units and maintains binary compatibility with existing CPU architectures.Our evaluation on the SPMD applications from the PARSEC and Rodinia OpenMP benchmarks shows that a 4-warp × 4-lane 4-issue DITVA architecture with a realistic bank-interleaved cache achieves 1.55× higher performance than a 4-thread 4-issue SMT architecture with AVX instructions while fetching and issuing 51% fewer instructions, achieving an overall 24% energy reduction.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: E Power and Energymentioning

confidence: 99%

Dynamic Inter-Thread Vectorization Architecture: Extracting DLP from TLP

Kalathingal

Collange

Swamy

et al. 2016

2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Self Cite

View full text Add to dashboard Cite

show abstract

“…The major bottleneck of this GPU deployment was the control flow divergence which is penalizing considering the partial SIMD execution (Single Instruction Multiple Data) of the GPU. Hardware [9] and software [10], [11] general solutions have been proposed recently to address this problem on GPU. However, these solutions are not efficient in our context as we have a very fine computation grain for each GPU thread.…”

Section: Motivations and Contributionsmentioning

confidence: 99%

GPU-Accelerated Generation of Correctly Rounded Elementary Functions

Fortin

Gouicem

Graillat

2016

ACM Trans. Math. Softw.

View full text Add to dashboard Cite

Abstract-The IEEE 754-2008 standard recommends the correct rounding of some elementary functions. This requires to solve the Table Maker's Dilemma which implies a huge amount of CPU computation time. We consider in this paper accelerating such computations, namely Lefèvre algorithm on Graphics Processing Units (GPUs) which are massively parallel architectures with a partial SIMD execution (Single Instruction Multiple Data).We first propose an analysis of the Lefèvre hard-to-round argument search using the concept of continued fractions. We then propose a new parallel search algorithm much more efficient on GPU thanks to its more regular control flow. We also present an efficient hybrid CPU-GPU deployment of the generation of the polynomial approximations required in Lefèvre algorithm. In the end, we manage to obtain overall speedups up to 53.4x on one GPU over a sequential CPU execution, and up to 7.1x over a multi-core CPU, which enable a much faster solving of the Table Maker's Dilemma for the double precision format.

show abstract

“…Resource underutilization due to branch divergence or thread-level divergence has been well studied [2] reason for resource underutilization, however, is due to TB-level resource management. Shared memory multiplexing [26] targets at the shared memory management and is complementary to our proposed WarpMan scheme.…”

Section: Related Workmentioning

confidence: 99%