In this paper, we compare automatic-optimized codes against hand-optimized codes. The automatic-optimized codes have been generated using our own developed tool that implements compiler techniques proposed in our previous work. Our compiler techniques focus on applying multilevel tiling to non-rectangular loop nests. This type of loop nests are commonly found in linear algebra algorithms, typically used in numerical codes. As hand-optimized codes, we use two different numerical libraries: the BLAS3 library provided by the manufacturers and the RISC-BLAS library proposed in [8]. Results will show how compiler technology can make it possible for non-rectangular loop nests to achieve as high performance as hand-optimized codes on modern microprocessors.
MotivationExisting compiler technology is oriented mostly towards simple numerical codes containing loop nests that describe rectangular iteration spaces [4][20] [22][24]. This is understandable since transformations are easy to apply on this type of loop nests. However, several linear algebra algorithms also contain complex loop nests defining non-rectangular iteration spaces and current commercial compilers are unable to restructure and optimize these types of codes.This fact has led many programmers to restructure their algorithms by hand to perform well on particular architectures, a situation that has led to machine-specific programs. Additionally, manufacturers have tried to minimize the complexity of writing optimized codes by providing numerical libraries that attain high performance under their particular machine. The BLAS3 library [10], for example, provides a set of standard linear algebra operations. On top of the BLAS standard interface, higher level library packages such as LAPACK [2] have been built. However, not all applications can take advantage of these libraries and there are many situations in which none of the routines provided can specifically solve the task at hand. We believe that restructuring a code should be the job of the compiler. Compilers should handle the machine-specific details required to attain high performance on each particular architecture.To illustrate how current commercial compilers achieve poor performance on non-rectangular loop nests, Fig. 1 shows the performance (in Mflop/s) obtained by the linear algebra problems SGEMM and STRMM, varying the problem size. SGEMM consists of a very simple rectangular loop nest, performing a rectangular matrix multiply while STRMM consists of a non-rectangular loop nest, performing also a matrix multiply but with one of the matrices being triangular. The circle curves show the performance obtained if we directly compile the codes using the f77 compiler with maximum level of optimization. The triangle curves show the performance obtained if we call the vendor-optimized BLAS3 library [10] to perform the operations. We can see how in non-rectangular loop nests (STRMM) current compilers achieve poor performance compared with the hand-optimized code provided by the BLAS3 library. By contrast, i...