2013
DOI: 10.1145/2514641.2514644
|View full text |Cite
|
Sign up to set email alerts
|

Multicore-based vector coprocessor sharing for performance and energy gains

Abstract: For most of the applications that make use of a dedicated vector coprocessor, its resources are not highly utilized due to the lack of sustained data parallelism which often occurs due to vector-length variations in dynamic environments. The motivation of our work stems from: (a) the mandate for multicore designs to make efficient use of on-chip resources for low power and high performance; (b) the omnipresence of vector operations in high-performance scientific and emerging embedded applications; (c) the need… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2013
2013
2018
2018

Publication Types

Select...
4
2

Relationship

3
3

Authors

Journals

citations
Cited by 14 publications
(7 citation statements)
references
References 15 publications
0
7
0
Order By: Relevance
“…The major differences of this paper from our earlier work [15,20] are: (a) Actual implementations on an FPGA using synthesizable VHDL (instead of higher-level SystemVerilog); (b) a larger number of benchmarked applications involving also many more scenarios; (c) the inclusion of fused multiply-add (MADD) and divide (VDIV) instructions in the vector lanes; (d) the production of power and energy consumption results followed by a relevant analysis; (e) scalability analysis for various configurations of the vector coprocessor involving 2, 4, 8, 16 and 32 lanes; (f) performance results for random scenarios involving two threads that contain vector kernels interleaved with idle times; and (g) synthesis frequency scalability analysis.…”
Section: Introductionmentioning
confidence: 65%
See 1 more Smart Citation
“…The major differences of this paper from our earlier work [15,20] are: (a) Actual implementations on an FPGA using synthesizable VHDL (instead of higher-level SystemVerilog); (b) a larger number of benchmarked applications involving also many more scenarios; (c) the inclusion of fused multiply-add (MADD) and divide (VDIV) instructions in the vector lanes; (d) the production of power and energy consumption results followed by a relevant analysis; (e) scalability analysis for various configurations of the vector coprocessor involving 2, 4, 8, 16 and 32 lanes; (f) performance results for random scenarios involving two threads that contain vector kernels interleaved with idle times; and (g) synthesis frequency scalability analysis.…”
Section: Introductionmentioning
confidence: 65%
“…This context resembles fine-grain multithreading in superscalar processors, and increased throughput is expected because there are no data dependencies between instructions coming from different processors. More details about our VP architecture and Scheduler can be found in [20]. Table 2 shows resource consumption figures for our VP with 8 lanes and 8 memory banks configuration implemented in the Virtex XC5VLX100T FPGA device.…”
Section: Scheduling Proceduresmentioning
confidence: 99%
“…''Ideal'' times are obtained by removing any MB delay in issuing instructions to the VP. ''Ideal without private memories'' times are similar to ideal but, instead of having a private memory in each lane, each lane has access to all memory banks in the vector memory using a crossbar that connects lanes to memories (similar to the architecture in [13]). Under the worst case scenario for vector load and store instructions, only one element per clock cycle can be transferred between the lanes and the vector memory.…”
Section: Comparison With Prior Workmentioning
confidence: 99%
“…However, these vector-oriented designs do not address: a) the need to share resources in multicores for higher utilization while releasing silicon for the implementation of more cores or the enhancement of existing cores; b) runtime resource management of vector resources assigned to the cores since the collective needs of simultaneously running applications are normally in a fluid state; and c) runtime energy saving techniques that take into account individual application needs for vector processing [12,13].…”
Section: Introductionmentioning
confidence: 99%
“…VP sharing increases efficiency and lowers energy consumption. We present here the 40nm ASIC VP realization of a shared VP design that we first proposed in [12,13] in order to demonstrate its feasibility, and also investigate interesting design tradeoffs for embedded-system implementations. Sections II and III summarize the shared VP architecture and the ASIC design flow.…”
Section: Introductionmentioning
confidence: 99%