Improving performance of nested loops on reconfigurable array processors

Kim, Yongjoo; Lee, Jong-Eun; Toan, X.; Paek, Yunheung

doi:10.1145/2086696.2086711

Cited by 26 publications

(23 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Besides the hardware structure of CGRA, the configuration process plays an increasingly important role in terms of improving performance and reducing power consumption [10,11,12]. The proposed CGRA features dynamic configuration, where no more configuration contexts are updated during its computation.…”

Section: Hierarchical Context Cache Structurementioning

confidence: 99%

Coarse-grained reconfigurable architecture with hierarchical context cache structure and management approach

Wang

Cao

Liu

et al. 2017

IEICE Electron. Express

View full text Add to dashboard Cite

This paper proposes a novel coarse-grained reconfigurable array (CGRA) with hierarchical context cache structure and efficient cache management approaches, including time-frequency weighted (TFW) context cache replacement strategy and context multi-casting (CMC) mechanism. By fully exploiting inherent configuration features, the configuration performance is improved by 18.2% with half context memory cost. Our CGRA was implemented under the process of TSMC 65 nm, which can work at the frequency of 200 MHz with the area of 23.2 mm 2 . Compared to the previous CGRAs, our work has the advantage of 3.8∼12× performance improvement and 2.3∼15.7× energy efficiency increase.

show abstract

Section: Hierarchical Context Cache Structurementioning

confidence: 99%

Coarse-grained reconfigurable architecture with hierarchical context cache structure and management approach

Wang

Cao

Liu

et al. 2017

IEICE Electron. Express

View full text Add to dashboard Cite

show abstract

“…Several software pipelining techniques [4]- [7] have been proposed for CGRA. For example, in [4], the simulated annealing-based modulo scheduling is used to map loops onto CGRAs for the first time.…”

Section: Related Workmentioning

confidence: 99%

“…This is because the existing methods [4]- [7] are only good at pipelining single level loop, i.e., the innermost loop in the nested loop. Although they can find the best II for the innermost loops, they will lead to poor PE utilization rate when the parallelism of innermost loop is insufficient compared with CGRA resources.…”

Section: Introductionmentioning

confidence: 99%

Improving Nested Loop Pipelining on Coarse-Grained Reconfigurable Architectures

Yin

Liu

Peng

et al. 2016

IEEE Trans. VLSI Syst.

View full text Add to dashboard Cite

Coarse-grained reconfigurable architecture (CGRA) is a promising architecture with high performance, high power efficiency, and attraction of flexibility. The computation-intensive portions of applications, i.e., loops, are often implemented on CGRAs for acceleration. The loop pipelining techniques are usually used to exploit the parallelism of loops. However, for nested loops, the existing loop pipelining methods often result in poor hardware utilization and low execution performance. To tackle this problem, this paper makes three contributions: 1) we propose the use of affine transformation to facilitate nested loop pipelining; 2) based on polyhedral model, we present a precise and general formulation of the nested loop pipelining problem on a CGRA; and 3) using the insights from problem formulation, we design a joint affine transformation and multipipeline merging approach to improve the performance of nested loop on CGRA. The experimental results show that our approach can improve the performance of nested loops up to 35% on average, compared with the state-of-the-art techniques.

show abstract

“…With interleaved iteration assignment, the four cores will first access A[0], A [1], A [2], and A [3], which are all in different banks, thus no bank conflict. However, with sequential iteration assignment, the cores will first access A[0], A [4], A [8], and A [12], which are all in the same bank, thus generating many bank conflicts. 2 If the stride of array access expression is greater than one (e.g., A[2i]), only some banks may have all the array elements ever accessed; others have (a) Example code (b) Bank conflict Fig.…”

Section: Microcore Mappingmentioning

confidence: 99%

“…While one can avoid the compiler scalability issue on larger CGRAs by executing multiple, unrelated threads/applications simultaneously [2,3,4], a more preferable solution would be a scalable framework that allows not only a large CGRA to be used at its entirety but also the size of CGRA mapping target to be changed depending on the workload or application requirements. Despite its apparent challenge, it can be done easily, by exploiting SIMD (Single Instruction Multiple Data) or data parallelism existing in many multimedia and graphics applications.…”

Section: Introductionmentioning

confidence: 99%

Exploiting Both Pipelining and Data Parallelism with SIMD Reconfigurable Architecture

Kim

Lee

et al. 2012

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Abstract. Reconfigurable Architecture (RA), which provides extremely high energy efficiency for certain domains of applications, have one problem that current mapping algorithms for it do not scale well with the number of cores. One approach to this problem is using SIMD (Single Instruction Multiple Data) paradigm. However, SIMD can complicate the mapping problem by adding an additional dimension, i.e., iteration mapping, to the already inter-dependent problems of data mapping and operation mapping, and can significantly affect performance through memory bank conflicts. In this paper we introduce SIMD reconfigurable architecture, which allows for SIMD mapping at multiple levels of granularity, and investigate ways to minimize bank conflicts in a SIMD reconfigurable architecture with the related sub-problems taken into consideration. We further present data tiling and evaluate a conflict-free scheduling algorithm as a way to eliminate bank conflicts for a certain class of iteration and data mapping.

show abstract

Improving performance of nested loops on reconfigurable array processors

Cited by 26 publications

References 21 publications

Coarse-grained reconfigurable architecture with hierarchical context cache structure and management approach

Coarse-grained reconfigurable architecture with hierarchical context cache structure and management approach

Improving Nested Loop Pipelining on Coarse-Grained Reconfigurable Architectures

Exploiting Both Pipelining and Data Parallelism with SIMD Reconfigurable Architecture

Contact Info

Product

Resources

About