2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) 2016
DOI: 10.1109/sbac-pad.2016.11
|View full text |Cite
|
Sign up to set email alerts
|

Dynamic Inter-Thread Vectorization Architecture: Extracting DLP from TLP

Abstract: Abstract-Threads of Single-Program Multiple-Data (SPMD) applications often execute the same instructions on different data. We propose the Dynamic Inter-Thread Vectorization Architecture (DITVA) to leverage this implicit data-level parallelism in SPMD applications by assembling dynamic vector instructions at runtime. DITVA extends an SIMD-enabled in-order SMT processor with an inter-thread vectorization execution mode. In this mode, multiple scalar threads running in lockstep share a single instruction stream … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
2
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
2
1
1

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(2 citation statements)
references
References 34 publications
0
2
0
Order By: Relevance
“…In addition, it does not vectorize loads and stores. Other approaches are adopted by Kalathingal et al [Kalathingal et al 2016], which use instructions present in different threads to compose large shared vector instructions. Stephens et al [Stephens et al 2017] present a set of size-agnostic vector instructions.…”
Section: Related Workmentioning
confidence: 99%
“…In addition, it does not vectorize loads and stores. Other approaches are adopted by Kalathingal et al [Kalathingal et al 2016], which use instructions present in different threads to compose large shared vector instructions. Stephens et al [Stephens et al 2017] present a set of size-agnostic vector instructions.…”
Section: Related Workmentioning
confidence: 99%
“…branch unit. Evaluations for all baselines consider a multi-core environment by simulating the processors using a throughput-limited DRAM memory of 2GB/s per core, representative of current multi-core systems[35].Several configurations for the ConSSTEP architecture were also tested, as detailed in the forthcoming chapter. Of the FUs specified in the data input file, half the FUs are assigned complex integer functionality by the simulator, where the rest are assigned basic ALU functionality.…”
mentioning
confidence: 99%