Memory accesses limit the performance of stream processors. By exploiting the reuse of data held in the Stream Register File (SRF), an on-chip, software controlled storage, the number of memory accesses can be reduced. In current stream compilers, reuse exploitation is only attempted for simple stream references, those whose start and end are known. Compiler analysis, from outside of stream processors, does not directly enable the consideration of other more complex stream references. In this article, we propose a transformation to automatically optimize stream programs to exploit the reuse supplied by loop-dependent stream references. The transformation is based on three results: lemmas identifying the reuse supplied by stream A preliminary version of this article entitled Exploiting Loop-Dependent Stream Reuse for Stream Processors appeared in Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques ]. This extended version makes the following new contributions over the previous paper: (i) it analyzes the SRF pressure brought by the stream reuse exploration; (ii) it selects the streams to reuse such that as many memory accesses are eliminated as possible within the limited SRF capacity by modeling the task as a knapsack problem; (iii) it uses a greedy approximation algorithm to find a solution, which is independent of stream processor architecture, to improve SRF utilization; (iv) it changes the unrolling factor of the loop from the least common multiple of all sub-RGs' unrolling factors to the maximum of them.
11:2• X. Yang et al.references, a new abstract representation called the Stream Reuse Graph (SRG) depicting the identified reuse, and the optimization of the SRG for our transformation. Both the reuse between the whole sequences accessed by stream references and between partial sequences is exploited in the article. In particular, partial reuse and its treatment are quite new and have never, to the best of our knowledge, appeared in scalar and vector processing. At the same time, reusing streams increases the pressure on the SRF, and this presents a problem of which reuse should be exploited within limited SRF capacity. We extend our analysis to achieve this objective. Finally, we implement our techniques based on the StreamC/KernelC compiler that has been optimized with the best existing compilation techniques for stream processors. Experimental results show a resultant speed-up of 1.14 to 2.54 times using a range of benchmarks.