Scientific kernels on VIRAM and imagine media processors

Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques

Xue

et al. 2008

The memory access limits the performance of stream processors. By exploiting the reuse of data held in the Stream Register File (SRF), an on-chip storage, the number of memory accesses can be reduced. In current stream compilers reuse is only attempted for simple stream references, those whose start and end are known. Compiler analysis from outside of stream processors does not directly enable the consideration of other complex stream references. In this paper we propose a transformation to automatically optimize stream programs to exploit the reuse supplied by loop-dependent stream references. The transformation is based on three results: algorithms to recognize the reuse supplied by stream references, a new abstract expression called the Stream Reuse Graph (SRG) to depict the reuse and the optimization of the SRG for the transformation. Both the reuse between whole sequences accessed by stream references and that between partial sequences are exploited in the paper. In particular, the problem of exploiting partial stream reuse does not have its parallel in the traditional data reuse exploitation setting (for scalars and arrays). Finally, we have implemented our techniques using the StreamC/KernelC compiler for Imagine. Experimental results show a resultant speedup of 1.14 to 2.54 times using a range of typical stream processing application kernels.

show abstract

Section: Reusing Streamsmentioning

confidence: 99%

Exploiting loop-dependent stream reuse for stream processors

Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques

Xue

et al. 2008

show abstract

“…Memory access still dominates most stream programs' performance [Ahn et al 2006], especially for scientific stream programs [Narayanan et al 2002]. The only way to reduce the off-chip memory bandwidth requirement is to exploit the reuse of streams held in the SRF.…”

Section: Reusing Streamsmentioning

confidence: 99%

Exploiting the reuse supplied by loop-dependent stream references for stream processors

ACM Trans. Archit. Code Optim.

et al. 2008

Memory accesses limit the performance of stream processors. By exploiting the reuse of data held in the Stream Register File (SRF), an on-chip, software controlled storage, the number of memory accesses can be reduced. In current stream compilers, reuse exploitation is only attempted for simple stream references, those whose start and end are known. Compiler analysis, from outside of stream processors, does not directly enable the consideration of other more complex stream references. In this article, we propose a transformation to automatically optimize stream programs to exploit the reuse supplied by loop-dependent stream references. The transformation is based on three results: lemmas identifying the reuse supplied by stream A preliminary version of this article entitled Exploiting Loop-Dependent Stream Reuse for Stream Processors appeared in Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques ]. This extended version makes the following new contributions over the previous paper: (i) it analyzes the SRF pressure brought by the stream reuse exploration; (ii) it selects the streams to reuse such that as many memory accesses are eliminated as possible within the limited SRF capacity by modeling the task as a knapsack problem; (iii) it uses a greedy approximation algorithm to find a solution, which is independent of stream processor architecture, to improve SRF utilization; (iv) it changes the unrolling factor of the loop from the least common multiple of all sub-RGs' unrolling factors to the maximum of them. 11:2• X. Yang et al.references, a new abstract representation called the Stream Reuse Graph (SRG) depicting the identified reuse, and the optimization of the SRG for our transformation. Both the reuse between the whole sequences accessed by stream references and between partial sequences is exploited in the article. In particular, partial reuse and its treatment are quite new and have never, to the best of our knowledge, appeared in scalar and vector processing. At the same time, reusing streams increases the pressure on the SRF, and this presents a problem of which reuse should be exploited within limited SRF capacity. We extend our analysis to achieve this objective. Finally, we implement our techniques based on the StreamC/KernelC compiler that has been optimized with the best existing compilation techniques for stream processors. Experimental results show a resultant speed-up of 1.14 to 2.54 times using a range of benchmarks.

show abstract

Implementing and Optimizing a Data-Intensive Hydrodynamics Application on the Stream Processor

Lecture Notes in Computer Science