Abstract. This article briefly describes the history of floating-point arithmetic, the development and features of IEEE standards for such arithmetic, desirable features of new implementations of floating-point hardware, and discusses workin-progress aimed at making decimal floating-point arithmetic widely available across many architectures, operating systems, and programming languages.
WHAT IS FLOATING-POINT ARITHMETIC?Floating-point arithmetic is a technique for storing and operating on numbers in a computer where the base, range, and precision of the number system are usually fixed by the computer design.Conceptually, a floating-point number has a sign, an exponent, and a significand (the older term mantissa is now deprecated), allowing a representation of the form . 1/ sign significand base exponent . The base point in the significand may be at the left, or after the first digit, or at the right. The point and the base are implicit in the representation: neither is stored.The sign can be compactly represented by a single bit, and the exponent is most commonly a biased unsigned bitfield, although some historical architectures used a separate exponent sign and an unbiased exponent. Once the sizes of the sign and exponent fields are fixed, all of the remaining storage is available for the significand, although in some older systems, part of this storage is unused, and usually, ignored. On modern systems, the storage order is conceptually sign, exponent, and significand, but addressing conventions on byte-addressable systems (the big endian versus little endian theologies) can alter that order, and some historical designs reordered them, and sometimes split the exponent and significand fields into two interleaved parts. Except when the low-level storage format must be examined by software, such as for binary data exchange, these differences are handled by hardware, and are rarely of concern to programmers.The data size is usually closely related to the computer word size. Indeed, the venerable Fortran programming language mandates a single-precision floating-point format occupying the same storage as an integer, and a doubleprecision format occupying exactly twice the space. This requirement is heavily relied on by Fortran software for array dimensioning, argument passing, and in COMMON and EQUIVALENCE statements for storage alignment and layout. Some vendors later added support for a third format, called quadruple-precision, occupying four words. Wider formats have yet to be offered by commercially-viable architectures, although we address this point later in this article.Floating-point arithmetic can be contrasted with fixed-point arithmetic, where there is no exponent field, and it is the programmer's responsibility to keep track of where the base point lies. Address arithmetic, and signed and unsigned