Ozana Silvia Dragomir scite author profile

Loops are an important source of optimization. In this paper, we propose a new technique for optimizing loops that contain kernels mapped on a reconfigurable fabric. We assume the Molen machine organization and programming paradigm as our framework. The method we propose extends our previous work on loop unrolling for reconfigurable architectures by combining unrolling with shifting to relocate the function calls contained in the loop body such that in every iteration of the transformed loop, software functions (running on GPP) execute in parallel with multiple instances of the kernel (running on FPGA). The algorithm is based on profiling information about the kernel's execution times on GPP and FPGA, memory transfers and area utilization. In the experimental part, we apply this method to a loop nest extracted from MPEG2 encoder containing the DCT kernel. The achieved speedup is 19.65x over software execution and 1.8x over loop unrolling.

show abstract

Optimal Loop Unrolling and Shifting for Reconfigurable Architectures

Dragomir

Stefanov

Bertels

2009

ACM Trans. Reconfigurable Technol. Syst.

View full text Add to dashboard Cite

In this article, we present a new technique for optimizing loops that contain kernels mapped on a reconfigurable fabric. We assume the Molen machine organization as our framework. We propose combining loop unrolling with loop shifting, which is used to relocate the function calls contained in the loop body such that in every iteration of the transformed loop, software functions (running on GPP) execute in parallel with multiple instances of the kernel (running on FPGA). The algorithm computes the optimal unroll factor and determines the most appropriate transformation (which can be the combination of unrolling plus shifting or either of the two). This method is based on profiling information about the kernel's execution times on GPP and FPGA, memory transfers and area utilization. In the experimental part, we apply this method to several kernels from loop nests extracted from real-life applications (DCT and SAD from MPEG2 encoder, Quantizer from JPEG, and Sobel's Convolution) and perform an analysis of the results, comparing them with the theoretical maximum speedup by Amdahl's Law and showing when and how our transformations are beneficial.

show abstract

Loop distribution for K-loops on Reconfigurable Architectures

Dragomir

Bertels

2011

View full text Add to dashboard Cite

Within the context of Reconfigurable Architectures, we define a kernel loop (K-loop) as a loop containing in the loop body one or more kernels mapped on the reconfigurable hardware. In this paper, we analyze how loop distribution can be used in the context of K-loops. We propose an algorithm for splitting K-loops that contain more than one kernel and intraiteration dependencies. The purpose is to create smaller loops (Ksub-loops) that have more speedup potential when parallelized. Making use of partial reconfigurability, the K-sub-loops can take advantage of having more area available for multiple kernel instances to execute in parallel on the FPGA. In order to study the potential for performance improvement of using the loop distribution on K-loops, we make use of a suite of randomly generated test cases. The results show an improvement of more than 40% over previously proposed methods in more than 60% of the cases. The algorithm is also validated with a K-loop extracted from the MJPEG application. A speedup of maximum 8.22 is achieved when mapping MJPEG on VirtexIIPro with partial reconfiguration and 13.41 when statically mapping it on the Virtex-4.

show abstract

Recursive Variable Expansion: A Loop Transformation for Reconfigurable Systems

Nawaz

Dragomir

Marconi

et al. 2007

View full text Add to dashboard Cite

Loops are an important source ofperformance improvement, for which there exists a large number of compiler based optimizations. Few optimizations assume that the loop will be fully mapped on hardware. In this paper; we discuss a loop transformation called Recursive Variable Expansion, which can be efficiently implemented in hardware.It removes all the data dependencies from the program and then the parallelism is only bounded by the amount of resources one has. To show the performance improvement and the utilization of resources, we have chosen four kernels from widely used applications (FIR, DCT, Sobel edge detection algorithm and matrix multiplication). The hardware implementation of these kernels proved to be 1.5 to 77 times faster (depending on application) than the code compiled and run on PowerPC.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.