Seongseok Seo scite author profile

et al. 2014

For loop accelerators such as coarse-grained reconfigurable architectures (CGRAs) and GP-GPUs, nested loops represent an important source of parallelism. Existing solutions to mapping nested loops on CGRAs, however, are either designed for perfectly nested loops only, or expensive and inflexible. Efficient CGRA mapping of imperfect loops with arbitrary nesting depth still remains a challenge. In this paper we propose a compiler-hardware co-operative approach that is flexible and yet able to generate efficient mappings for imperfect nested loops. It is based on loop flattening, but to mitigate the negative impact of flattening we combine loop fission and a light-weight architecture extension that is designed to accelerate common operation patterns appearing frequently in flattened loops. Our experimental results using imperfect loops from multimedia and DSP domains demonstrate that our special operations can cover a large portion of nested loop operations, improve performance of nested loops by nearly 30% over using loop flattening only, and achieve near-ideal executions on CGRAs for imperfect loops.

Configurable range memory for effective data reuse on programmable accelerators

ACM Trans. Des. Autom. Electron. Syst.

Paek

et al. 2014

While programmable accelerators such as application-specific processors and reconfigurable architectures can dramatically speed up compute-intensive kernels of an application, application performance can still be severely limited by the communication between processors. To minimize the communication overhead, a shared memory such as a scratchpad memory may be employed between the main processor and the accelerator coprocessor. However, this setup poses a significant challenge to the main processor, which now must manage data on the scratchpad explicitly, resulting in superfluous data copying due to the inflexibility of a scratchpad. In this article, we present an enhancement of a scratchpad, Configurable Range Memory (CRM), whose address range can be reprogrammed to minimize unnecessary data copying between processors and therefore promote data reuse on the accelerator, and also present a software management algorithm for the CRM. Our experimental results involving detailed simulation of full multimedia applications demonstrate that our CRM architecture can reduce the communication overhead quite effectively, reducing the kernel execution time by up to 28% and the application runtime by up to 12.8%, in addition to considerable system energy reduction, compared to the conventional architecture based on a scratchpad. ACM Reference Format:Jongeun Lee, Seongseok Seo, Jongkyung Paek, and Kiyoung Choi. 2014. Configurable range memory for effective data reuse on programmable accelerators.

Evaluator-executor transformation for efficient pipelining of loops with conditionals

Jeong¹,

ACM Trans. Archit. Code Optim.

2013

Control divergence poses many problems in parallelizing loops. While predicated execution is commonly used to convert control dependence into data dependence, it often incurs high overhead because it allocates resources equally for both branches of a conditional statement regardless of their execution frequencies. For those loops with unbalanced conditionals, we propose a software transformation that divides a loop into two or three smaller loops so that the condition is evaluated only in the first loop, while the less frequent branch is executed in the second loop in a way that is much more efficient than in the original loop. To reduce the overhead of extra data transfer caused by the loop fission, we also present a hardware extension for a class of Coarse-Grained Reconfigurable Architectures (CGRAs). Our experiments using MiBench and computer vision benchmarks on a CGRA demonstrate that our techniques can improve the performance of loops over predicated execution by up to 65% (37.5%, on average), when the hardware extension is enabled. Without any hardware modification, our software-only version can improve performance by up to 64% (33%, on average), while simultaneously reducing the energy consumption of the entire CGRA including configuration and data memory by 22%, on average.

Mapping Imperfect Loops to Coarse-Grained Reconfigurable Architectures

Sim

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

et al. 2016

journalofmediterraneanareastudies

A study on the "salvo et reservato" Clause in Genoa's Medieval Insuarence Contracts

Seo¹

2007

Evaluator-executor transformation for efficient pipelining of loops with conditionals

Jeong¹,

2013

TACO