No abstract
SUMMARYThis paper describes the LINPACK Benchmark and some of its variations commonly used to assess the performance of computer systems. Aside from the LINPACK Benchmark suite, the TOP500 and the HPL codes are presented. The latter is frequently used to obtained results for TOP500 submissions. Information is also given on how to interpret the results of the benchmark and how the results fit into the performance evaluation process.
The Basic Linear Algebra Subprograms (BLAS) define one of the most heavily used performance-critical APIs in scientific computing today. It has long been understood that the most important of these routines, the dense Level 3 BLAS, may be written efficiently given a highly optimized general matrix multiply routine. In this paper, however, we show that an even larger set of operations can be efficiently maintained using a much simpler matrix multiply kernel. Indeed, this is how our own project, ATLAS (which provides one of the most widely used BLAS implementations in use today), supports a large variety of performance-critical routines.Linear algebra is rich in operations which are highly optimizable, in the sense that a highly tuned code may run multiple orders of magnitude faster than a naively coded routine. However, these optimizations are platform specific, such that an optimization for a given computer architecture will actually cause a slow-down on another architecture. To handle this problem, a standard API of performance-critical linear algebra kernels was created, called the Basic Linear Algebra Subprograms (BLAS) [1][2][3][4][5], which provides such linear algebra kernels as matrix multiply, triangular solve, etc. Given this API, then, the traditional method of achieving high-performance linear algebra routines called on the high-performance community to produce hand-optimized routines for each new architecture of interest. This is a painstaking process, typically requiring many man-months of highly trained (both in linear algebra and computational optimization) personal. The incredible pace of hardware evolution makes this technique untenable in the long run, particularly so when considering that there are many software layers (e.g. operating systems, compilers, etc.), which also effect these kinds of optimizations, that are changing at similar, but independent rates.A new paradigm is needed, therefore, for the production of highly efficient routines in the modern age of computing, and our own project, Automatically Tuned Linear Algebra Software (ATLAS) [6][7][8][9][10] represents an implementation of such a set of new techniques. We call this paradigm 'Automated Empirical Optimization of Software', or AEOS. In an AEOS-enabled package such as ATLAS, the package provides many ways of doing the required operations, and uses empirical timings in order to choose the best method for a given architecture. Thus, if written generally enough, an AEOS-aware package can automatically adapt to a new computer architecture in a matter of hours, rather than requiring months or even years of highly trained professionals' time, as dictated by traditional methods.Today, ATLAS-tuned libraries represent one of the most widely used BLAS libraries in existence. They are used in problem-solving environments such as MAPLE, MATLAB and Octave, compilers such as Absoft Pro Fortran, as well as a wide variety of operating systems, including OS X, FreeBSD, and most versions of Linux. Finally, ATLAS BLAS are used in a host of softwa...
This article discusses the core factorization routines included in the ScaLAPACK library. These routines allow the factorization and solution of a dense system of linear equations via LU, QR, and Cholesky. They are implemented using a block cyclic data distribution, and are built using de facto standard kernels for matrix and vector operations (BLAS and its parallel counterpart PBLAS) and message passing communication (BLACS). In implementing the ScaLAPACK routines, a major objective was to parallelize the corresponding sequential LAPACK using the BLAS, BLACS, and PBLAS as building blocks, leading to straightforward parallel implementations without a significant loss in performance. We present the details of the implementation of the ScaLAPACK factorization routines, as well as performance and scalability results on the Intel iPSC/860, Intel Touchstone Delta, and Intel Paragon System.
One of the main obstacles to the efficient solution of scientific problems is the problem of tuning software, both to the available architecture and to the user problem at hand. We describe approaches for obtaining tuned high-performance kernels, and for automatically choosing suitable algorithms. Specifically, we describe the generation of dense and sparse blas kernels, and the selection of linear solver algorithms. However, the ideas presented here extend beyond these areas, which can be considered proof of concept.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.