Operation and data mapping for CGRAs with multi-bank memory

Kim, Yongjoo; Lee, Jong-Eun; Shrivastava, Aviral; Paek, Yunheung

doi:10.1145/1755888.1755892

Cited by 12 publications

(8 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For the CGRA coprocessor, we assume that its input/output are provided on its local memory, which may be multi-banked to provide high bandwidth toward the processing elements in the coprocessor [Bougard et al 2008;Kim et al 2010]. Recent CGRA coprocessors [Bougard et al 2008;Mei et al 2004] can access any data on its local memory using addressed load/store operations, but the addresses must be linear (or at least easily computable using arithmetic operations only).…”

Section: System Architecturementioning

confidence: 99%

Configurable range memory for effective data reuse on programmable accelerators

Lee

Seo

Paek

et al. 2014

ACM Trans. Des. Autom. Electron. Syst.

Self Cite

View full text Add to dashboard Cite

While programmable accelerators such as application-specific processors and reconfigurable architectures can dramatically speed up compute-intensive kernels of an application, application performance can still be severely limited by the communication between processors. To minimize the communication overhead, a shared memory such as a scratchpad memory may be employed between the main processor and the accelerator coprocessor. However, this setup poses a significant challenge to the main processor, which now must manage data on the scratchpad explicitly, resulting in superfluous data copying due to the inflexibility of a scratchpad. In this article, we present an enhancement of a scratchpad, Configurable Range Memory (CRM), whose address range can be reprogrammed to minimize unnecessary data copying between processors and therefore promote data reuse on the accelerator, and also present a software management algorithm for the CRM. Our experimental results involving detailed simulation of full multimedia applications demonstrate that our CRM architecture can reduce the communication overhead quite effectively, reducing the kernel execution time by up to 28% and the application runtime by up to 12.8%, in addition to considerable system energy reduction, compared to the conventional architecture based on a scratchpad. ACM Reference Format:Jongeun Lee, Seongseok Seo, Jongkyung Paek, and Kiyoung Choi. 2014. Configurable range memory for effective data reuse on programmable accelerators.

show abstract

Section: System Architecturementioning

confidence: 99%

Configurable range memory for effective data reuse on programmable accelerators

Lee

Seo

Paek

et al. 2014

ACM Trans. Des. Autom. Electron. Syst.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Recent studies [16], [17] propose various solutions to reduce the data transfer overhead between the system memory, a local memory, and the processing elements from the architecture and compiler perspective. In particular, the ADRES architecture [18] allows tight coupling between main processor and CGRA, by reconfiguring some processing elements of the CGRA as a VLIW processor.…”

Section: Related Workmentioning

confidence: 99%

“…For the CGRA coprocessor we assume that its input/output are provided on its local memory, which may be multi-banked to provide high bandwidth toward the processing elements in the coprocessor [17], [18]. Recent CGRA coprocessors [18], [19] can access any data on its local memory using addressed load/store operations, but the addresses must be linear (or at least easily computable using arithmetic operations only).…”

Section: System Architecturementioning

confidence: 99%

CRM: Configurable Range Memory for Fast Reconfigurable Computing

Paek

Lee

Choi

2011

2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PHD Forum

Self Cite

View full text Add to dashboard Cite

Application-specific hardware and reconfigurable processors can dramatically speed up compute-intensive kernels of applications, offloading the burden of main processor. To minimize the communication overhead in such a coprocessor approach, the two processors can share an on-chip memory, which may be considered by each processor as a scratchpad memory. However, this setup poses a significant challenge to the main processor, which now must manage data on the scratchpad explicitly, often resulting in superfluous data copy. This paper presents an enhancement to scratchpad, called Configurable Range Memory (CRM), that can reduce the need for explicit management and thus reduce data copy and promote data reuse on the shared memory. Our experimental results using benchmarks from DSP and multimedia applications demonstrate that our CRM architecture can significantly reduce the communication overhead compared to the architecture without shared memory, while not requiring explicit data management.

show abstract

“…If all the code and data of that task that is mapped to the SPE fit in the local memory of the SPE, then very power-efficient execution is achieved. In fact, the peak power-efficiency of the IBM Cell processor is 5.1 Giga operations per second per watt [17]. Contrast this with the power-efficiency of traditional shared memory multi-cores, e.g., the Intel Core2 Quad is only 0.35 Giga operations per second per watt [17].…”

Section: Introductionmentioning

confidence: 98%

“…In fact, the peak power-efficiency of the IBM Cell processor is 5.1 Giga operations per second per watt [17]. Contrast this with the power-efficiency of traditional shared memory multi-cores, e.g., the Intel Core2 Quad is only 0.35 Giga operations per second per watt [17]. However, if the code and data of the application do not fit into the local memory, then the global memory must be leveraged to contain them through explicit DMA calls.…”

Section: Introductionmentioning

confidence: 99%

Vector class on limited local memory (LLM) multi-core processors

Bai

Shrivastava

2011

Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems

Self Cite

View full text Add to dashboard Cite

Limited Local Memory (LLM) multi-core architecture is a promising solution for scalable memory hierarchy. LLM architecture, e.g., IBM Cell/B.E. is a purely distributed memory architecture in which each core can directly access only its small local memory, and that is why it is extremely powerefficient. Vector is a popular container class in the C++ Standard Template Library (STL), which provides the functionality similar to a dynamic array. Due to the small nonvirtualized memory in the LLM architecture, vector library implementation cannot be used as it is. In this paper, we propose and implement a scheme to manage vector class in the local memory present in each core of LLM multi-core architecture. Our scalable solution can transparently maintain vector data between the shared global memory and the local memories. In addition, different data transfer granularities are provided by our vector class to achieve better performance. We also propose a mechanism to ensure the validity of pointers-to-elements when the vector elements are moved into the global memory. Experimental result shows that our vector class can improve the programmability of vector class significantly while the overhead can be contained within 7%.

show abstract

Operation and data mapping for CGRAs with multi-bank memory

Cited by 12 publications

References 15 publications

Configurable range memory for effective data reuse on programmable accelerators

Configurable range memory for effective data reuse on programmable accelerators

CRM: Configurable Range Memory for Fast Reconfigurable Computing

Vector class on limited local memory (LLM) multi-core processors

Contact Info

Product

Resources

About