A general algorithm for tiling the register level

Jimenez, M Manuel; Llaberia, J.M.; Fernandez, A.; Morancho, Enric

doi:10.1145/277830.277859

Cited by 5 publications

(6 citation statements)

References 19 publications

(31 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In particular, he only allows one inner loop to have bounds that are affine function of only one iteration variable of tiled loops. Our previous work [14] extends that of [4] [5] by allowing several loops to have affine bounds of multiple tiled loops iteration variables. Moreover, he does not compare their performance results to hand-optimized codes.…”

Section: Related Workmentioning

confidence: 98%

“…In previous work [13] [14][15], we proposed a compiler technique able to automatically optimize numerical codes that define non-rectangular iteration spaces. In this paper we want to show that this compiler technique can rival hand-optimized codes.…”

Section: Motivationmentioning

confidence: 99%

“…In non-rectangular loop nests, the action of fully unrolling the loops is far from being trivial due to the irregular nature of the iteration space. In [14], we proposed a compiler algorithm to perform tiling at the register level that handles arbitrary iteration space shapes (and not only simple rectangular shapes) and we showed it to be effective [12].…”

Section: Automatically-optimized Codesmentioning

confidence: 99%

“…Now, when multilevel tiling includes the register level, another difficulty appears. The complexity and the amount of code (number of loop nests) generated by our register tiling technique [14] both depend on the number of bounds of the loops that have to be fully unrolled (the innermost loops after multilevel tiling). Therefore, it is critical to compute exact bounds 1 and avoid the generation of redundant bounds 2 .…”

Section: Automatically-optimized Codesmentioning

confidence: 99%

“…When the iteration space is non-rectangular, the bounds of the element loops do not determine a constant number of iterations and, consequently, it is not possible to directly fully unroll those loops. In [14][15] we show how index set splitting [25] can be used repeatedly to divide the tiled iteration space into partitions, such that, in each of these partitions the maximum number of loops that provide data reuse at the register level (element loops) can be fully unrolled. The idea consists on taking the loop nest resulting from the iteration space tiling phase and breaking it into several loop nests (partitions), one of which will traverse all (and only) core-tiles while the remaining loop nests will traverse the different types of boundary-tiles.…”

Section: Iss Phasementioning

confidence: 99%

See 4 more Smart Citations

On the performance of hand vs. automatically optimized numerical codes

Jimenez

Llaberia²,

Fernandez³

Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550)

View full text Add to dashboard Cite

In this paper, we compare automatic-optimized codes against hand-optimized codes. The automatic-optimized codes have been generated using our own developed tool that implements compiler techniques proposed in our previous work. Our compiler techniques focus on applying multilevel tiling to non-rectangular loop nests. This type of loop nests are commonly found in linear algebra algorithms, typically used in numerical codes. As hand-optimized codes, we use two different numerical libraries: the BLAS3 library provided by the manufacturers and the RISC-BLAS library proposed in [8]. Results will show how compiler technology can make it possible for non-rectangular loop nests to achieve as high performance as hand-optimized codes on modern microprocessors. MotivationExisting compiler technology is oriented mostly towards simple numerical codes containing loop nests that describe rectangular iteration spaces [4][20] [22][24]. This is understandable since transformations are easy to apply on this type of loop nests. However, several linear algebra algorithms also contain complex loop nests defining non-rectangular iteration spaces and current commercial compilers are unable to restructure and optimize these types of codes.This fact has led many programmers to restructure their algorithms by hand to perform well on particular architectures, a situation that has led to machine-specific programs. Additionally, manufacturers have tried to minimize the complexity of writing optimized codes by providing numerical libraries that attain high performance under their particular machine. The BLAS3 library [10], for example, provides a set of standard linear algebra operations. On top of the BLAS standard interface, higher level library packages such as LAPACK [2] have been built. However, not all applications can take advantage of these libraries and there are many situations in which none of the routines provided can specifically solve the task at hand. We believe that restructuring a code should be the job of the compiler. Compilers should handle the machine-specific details required to attain high performance on each particular architecture.To illustrate how current commercial compilers achieve poor performance on non-rectangular loop nests, Fig. 1 shows the performance (in Mflop/s) obtained by the linear algebra problems SGEMM and STRMM, varying the problem size. SGEMM consists of a very simple rectangular loop nest, performing a rectangular matrix multiply while STRMM consists of a non-rectangular loop nest, performing also a matrix multiply but with one of the matrices being triangular. The circle curves show the performance obtained if we directly compile the codes using the f77 compiler with maximum level of optimization. The triangle curves show the performance obtained if we call the vendor-optimized BLAS3 library [10] to perform the operations. We can see how in non-rectangular loop nests (STRMM) current compilers achieve poor performance compared with the hand-optimized code provided by the BLAS3 library. By contrast, i...

show abstract

Section: Related Workmentioning

confidence: 98%

Section: Motivationmentioning

confidence: 99%

Section: Automatically-optimized Codesmentioning

confidence: 99%

Section: Automatically-optimized Codesmentioning

confidence: 99%

Section: Iss Phasementioning

confidence: 99%

See 3 more Smart Citations

On the performance of hand vs. automatically optimized numerical codes

Jimenez

Llaberia²,

Fernandez³

Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550)

View full text Add to dashboard Cite

show abstract

Dynamic Voltage and Frequency Scaling for Scientific Applications

Hsu

Kremer

2003

Languages and Compilers for Parallel Computing

View full text Add to dashboard Cite

Buffer and Register Allocation for Memory Space Optimization

Bouchebaba

Girodias

Coelho

et al. 2007

J VLSI Sign Process Syst Sign Im

View full text Add to dashboard Cite

In today_s embedded systems, memory hierarchy is rapidly becoming a major factor in terms of power, performance and area. This is especially true for embedded multimedia applications using temporary multi-dimensional arrays that are typically used to store intermediate results during multimedia processing. In this paper, we propose a new technique that optimizes the use of the cache and the registers. It consists in combining buffer and register allocation to reduce the size of the temporary arrays. Firstly we use the concept of live data to replace each array by a buffer of lower size. Then we replace references to these buffers by registers. The buffer allocation step keeps only useful data in memory and the register allocation step allows taking advantage of data reuse in internal loops. Codes considered in this paper are multimedia applications structured as a sequence of loop nests. The experiments are made on Unix environment and on the StepNP simulator (MPSoC platform of STMicroelctronics). They show that our technique yields significant reduction of the number of data cache and TLB misses.

show abstract

A general algorithm for tiling the register level

Cited by 5 publications

References 19 publications

On the performance of hand vs. automatically optimized numerical codes

On the performance of hand vs. automatically optimized numerical codes

Dynamic Voltage and Frequency Scaling for Scientific Applications

Buffer and Register Allocation for Memory Space Optimization

Contact Info

Product

Resources

About