The Basic Linear Algebra Subprograms (BLAS) define one of the most heavily used performance-critical APIs in scientific computing today. It has long been understood that the most important of these routines, the dense Level 3 BLAS, may be written efficiently given a highly optimized general matrix multiply routine. In this paper, however, we show that an even larger set of operations can be efficiently maintained using a much simpler matrix multiply kernel. Indeed, this is how our own project, ATLAS (which provides one of the most widely used BLAS implementations in use today), supports a large variety of performance-critical routines.Linear algebra is rich in operations which are highly optimizable, in the sense that a highly tuned code may run multiple orders of magnitude faster than a naively coded routine. However, these optimizations are platform specific, such that an optimization for a given computer architecture will actually cause a slow-down on another architecture. To handle this problem, a standard API of performance-critical linear algebra kernels was created, called the Basic Linear Algebra Subprograms (BLAS) [1][2][3][4][5], which provides such linear algebra kernels as matrix multiply, triangular solve, etc. Given this API, then, the traditional method of achieving high-performance linear algebra routines called on the high-performance community to produce hand-optimized routines for each new architecture of interest. This is a painstaking process, typically requiring many man-months of highly trained (both in linear algebra and computational optimization) personal. The incredible pace of hardware evolution makes this technique untenable in the long run, particularly so when considering that there are many software layers (e.g. operating systems, compilers, etc.), which also effect these kinds of optimizations, that are changing at similar, but independent rates.A new paradigm is needed, therefore, for the production of highly efficient routines in the modern age of computing, and our own project, Automatically Tuned Linear Algebra Software (ATLAS) [6][7][8][9][10] represents an implementation of such a set of new techniques. We call this paradigm 'Automated Empirical Optimization of Software', or AEOS. In an AEOS-enabled package such as ATLAS, the package provides many ways of doing the required operations, and uses empirical timings in order to choose the best method for a given architecture. Thus, if written generally enough, an AEOS-aware package can automatically adapt to a new computer architecture in a matter of hours, rather than requiring months or even years of highly trained professionals' time, as dictated by traditional methods.Today, ATLAS-tuned libraries represent one of the most widely used BLAS libraries in existence. They are used in problem-solving environments such as MAPLE, MATLAB and Octave, compilers such as Absoft Pro Fortran, as well as a wide variety of operating systems, including OS X, FreeBSD, and most versions of Linux. Finally, ATLAS BLAS are used in a host of softwa...