Dynamic Inter-Thread Vectorization Architecture: Extracting DLP from TLP

Kalathingal, Sajith; Collange, Sylvain; Swamy, Bharath Narasimha; Seznec, André

doi:10.1109/sbac-pad.2016.11

Cited by 4 publications

(2 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, it does not vectorize loads and stores. Other approaches are adopted by Kalathingal et al [Kalathingal et al 2016], which use instructions present in different threads to compose large shared vector instructions. Stephens et al [Stephens et al 2017] present a set of size-agnostic vector instructions.…”

Section: Related Workmentioning

confidence: 99%

On the SPEC-CPU 2017 opportunities for dynamic vectorization possibilities on PIM architectures

Sokulski¹,

Santos²,

Alves³

2022

Anais Do XXIII Simpósio Em Sistemas Computacionais De Alto Desempenho (SSCAD 2022)

View full text Add to dashboard Cite

Processing-In-Memory (PIM) devices usually implement vector instructions to efficiently utilize the large main memory bandwidth. One possible way to vectorize applications for such PIM systems is to convert CPU instructions into PIM vector instructions dynamically. In this work, we present a study on the feasibility of the dynamic conversion between these instructions for the Vector-In-Memory Architecture (VIMA). Our results show that 24 % of the loops from some SPEC-CPU 2017 applications are suitable for this conversion. Furthermore, we conclude that dynamic conversion mechanisms must to be able to efficiently deal with memory access conflicts, a problem present in 99 % of all possible conversions to VIMA.

show abstract

Section: Related Workmentioning

confidence: 99%

On the SPEC-CPU 2017 opportunities for dynamic vectorization possibilities on PIM architectures

Sokulski¹,

Santos²,

Alves³

2022

Anais Do XXIII Simpósio Em Sistemas Computacionais De Alto Desempenho (SSCAD 2022)

View full text Add to dashboard Cite

show abstract

“…branch unit. Evaluations for all baselines consider a multi-core environment by simulating the processors using a throughput-limited DRAM memory of 2GB/s per core, representative of current multi-core systems[35].Several configurations for the ConSSTEP architecture were also tested, as detailed in the forthcoming chapter. Of the FUs specified in the data input file, half the FUs are assigned complex integer functionality by the simulator, where the rest are assigned basic ALU functionality.…”

mentioning

confidence: 99%

Configurable simultaneously single-threaded (multi-)engine processor

Tino¹

2021

Preprint

View full text Add to dashboard Cite

As the multi-core computing era continues to progress, the need to increase single- thread performance, throughput, and seemingly adapt to thread-level parallelism (TLP) remain important issues. Though the number of cores on each processor continues to increase, expected performance gains have lagged. Accordingly, com- puting systems often include Simultaneously Multi-Threaded (SMT) processors as a compromise between sequential and parallel performance on a single core. These processors effectively improve the throughput and utilization of a core, however often at the expense of single-thread performance as threads per core scale. Accordingly, applications which require higher single-thread performance must often resort to single-thread core multi-processor systems which incur additional area overhead and power dissipation. In attempts to improve single- and multi-thread core efficiency, this work introduces the concept of a Configurable Simultaneously Single-Threaded (Multi-)Engine Processor (ConSSTEP). ConSSTEP is a nuanced approach to multi- threaded processors, achieving performance gains and energy efficiency by invoking low overhead reconfigurable properties with full software compatibility. Experimen- tal results demonstrate that ConSSTEP is able to increase single-thread Instruc- tions Per Cycle (IPC) up to 1.39x and 2.4x for 2-thread and 4-thread workloads, respectively, improving throughput and providing up to 2x energy efficiency when compared to a conventional SMT processor.

show abstract

DITVA: Dynamic Inter-Thread Vectorization Architecture

Kalathingal

Collange

Swamy

et al. 2018

Journal of Parallel and Distributed Computing

Self Cite

View full text Add to dashboard Cite

International audienceIn the Single-Program Multiple-Data (SPMD) programming model, threads of an application exhibit very similar control flows and often execute the same instructions, but on different data. In this paper, we propose the Dynamic Inter-thread Vectorization Architecture (DITVA) to leverage the implicit Data Level Parallelism that exists across threads on SPMD applications. By assembling dynamic vector instructions at runtime, DITVA extends an in-order SMT processor with a dynamic inter-thread vector execution mode akin to the Single-Instruction, Multiple-Thread model of Graphics Processing Units. In this mode, multiple scalar threads running in lockstep share a single instruction stream and their respective instruction instances are aggregated into SIMD instructions. DITVA can leverage existing SIMD units and maintains binary compatibility with existing CPU architec-tures. To balance thread-and data-level parallelism, threads are statically grouped into fixed-size independently scheduled warps. Additionally, to maximize dynamic vector-ization opportunities, we adapt the fetch steering policy to favor thread synchronization within warps and thus improve lockstep execution. Our experimental evaluation of the DITVA architecture on the SPMD applications from the PARSEC and Rodinia OpenMP benchmarks show that a 4-warp × 4-lane 4-issue DITVA architecture with a realistic bank-interleaved cache achieves 1.55× higher performance compared to a 4-thread 4-issue SMT architecture with AVX instructions , while fetching and issuing 51% fewer instructions, and achieving an overall 24% energy reduction. DITVA also enables applications limited by memory to scale with higher bandwidth architectures. For instance, when the bandwidth is increased from 2GB/s to 16GB/s, we find that memory bound applications show an improvement in performance by 3× in comparison with the baseline SMT. Therefore, DITVA appears as a cost-effective design for achieving very high single-core performance on SPMD parallel sections

show abstract

Dynamic Inter-Thread Vectorization Architecture: Extracting DLP from TLP

Cited by 4 publications

References 34 publications

On the SPEC-CPU 2017 opportunities for dynamic vectorization possibilities on PIM architectures

On the SPEC-CPU 2017 opportunities for dynamic vectorization possibilities on PIM architectures

Configurable simultaneously single-threaded (multi-)engine processor

DITVA: Dynamic Inter-Thread Vectorization Architecture

Contact Info

Product

Resources

About