2012
DOI: 10.1145/2086696.2086711
|View full text |Cite
|
Sign up to set email alerts
|

Improving performance of nested loops on reconfigurable array processors

Abstract: Pipelining algorithms are typically concerned with improving only the steady-state performance, or the kernel time. The pipeline setup time happens only once and therefore can be negligible compared to the kernel time. However, for Coarse-Grained Reconfigurable Architectures (CGRAs) used as a coprocessor to a main processor, pipeline setup can take much longer due to the communication delay between the two processors, and can become significant if it is repeated in an outer loop of a loop nest. In this paper w… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
23
0

Year Published

2012
2012
2023
2023

Publication Types

Select...
3
2
2

Relationship

1
6

Authors

Journals

citations
Cited by 26 publications
(23 citation statements)
references
References 21 publications
0
23
0
Order By: Relevance
“…Besides the hardware structure of CGRA, the configuration process plays an increasingly important role in terms of improving performance and reducing power consumption [10,11,12]. The proposed CGRA features dynamic configuration, where no more configuration contexts are updated during its computation.…”
Section: Hierarchical Context Cache Structurementioning
confidence: 99%
“…Besides the hardware structure of CGRA, the configuration process plays an increasingly important role in terms of improving performance and reducing power consumption [10,11,12]. The proposed CGRA features dynamic configuration, where no more configuration contexts are updated during its computation.…”
Section: Hierarchical Context Cache Structurementioning
confidence: 99%
“…Several software pipelining techniques [4]- [7] have been proposed for CGRA. For example, in [4], the simulated annealing-based modulo scheduling is used to map loops onto CGRAs for the first time.…”
Section: Related Workmentioning
confidence: 99%
“…This is because the existing methods [4]- [7] are only good at pipelining single level loop, i.e., the innermost loop in the nested loop. Although they can find the best II for the innermost loops, they will lead to poor PE utilization rate when the parallelism of innermost loop is insufficient compared with CGRA resources.…”
Section: Introductionmentioning
confidence: 99%
“…With interleaved iteration assignment, the four cores will first access A[0], A [1], A [2], and A [3], which are all in different banks, thus no bank conflict. However, with sequential iteration assignment, the cores will first access A[0], A [4], A [8], and A [12], which are all in the same bank, thus generating many bank conflicts. 2 If the stride of array access expression is greater than one (e.g., A[2i]), only some banks may have all the array elements ever accessed; others have (a) Example code (b) Bank conflict Fig.…”
Section: Microcore Mappingmentioning
confidence: 99%
“…While one can avoid the compiler scalability issue on larger CGRAs by executing multiple, unrelated threads/applications simultaneously [2,3,4], a more preferable solution would be a scalable framework that allows not only a large CGRA to be used at its entirety but also the size of CGRA mapping target to be changed depending on the workload or application requirements. Despite its apparent challenge, it can be done easily, by exploiting SIMD (Single Instruction Multiple Data) or data parallelism existing in many multimedia and graphics applications.…”
Section: Introductionmentioning
confidence: 99%