4-1 FLOATING-POINT NUMBERS - GENERAL VIEW ****************************************** The real number system ---------------------- Scientific and engineering calculations are performed in the REAL NUMBER SYSTEM, a highly abstract mathematical construct. A real number is by definition a special infinite set of rational numbers (integer fractions) - the so called Dedkind Cuts or an equivalent formulation. The arithmetical operations are defined between such sets and is a natural extension of the arithmetic of rational numbers. The real numbers have wonderful properties: 1) There is no lower or upper bound, in simple language they go from minus infinity to plus infinity. 2) Infinite density - there is a real number between any two real numbers. 3) A lot of algebraic axioms are satisfied, e.g. the 'field axioms'. 4) Completeness - they contain all their 'limit points' (the limit of every converging sequence is also 'real'). 5) They are ordered. Many of these properties are not satisfied by computer arithmetic, see the chapter on errors in floating-point computations for a short review on properties that stay true in floating-point arithmetic. In order to crunch quickly a lot of numbers, computers need a fixed size representation of real numbers, that way the hardware can efficiently perform the arithmetical operations. The problems arising from using a fixed size representation are the subject of the following chapters. Finite number systems are discrete ---------------------------------- If you use a fixed size representation, let's say N binary digits (BITS) long, you have at most 2**N bit-patterns, and so at most 2**N representable numbers. Such a finite set will have to be bounded - have a largest number and a smallest number. We have already one problem, our computations must not exceed these bounds. In every bounded segment, there are infinitely many real numbers, but we have at most 2**N available bit-patterns, so many real numbers will have to be represented by one bit-pattern. Of course one bit-pattern can't represent many numbers equally well, it will represent one of them exactly and the others will be misrepresented. We call numbers that can be represented exactly, FLOATING-POINT NUMBERS (FPN), the term 'real numbers' will be reserved for the mathematical constructs. Roundoff errors are unavoidable ------------------------------- Before we begin to study actual representations of real numbers, let us develop a little an idea mentioned in the previous section. We said that in a finite number system, many real numbers will have to be represented by one bit-pattern, and that bit-pattern will represent exactly only one of them. In other words many real numbers will be 'rounded off' to that one bit-pattern. This 'rounding off' may occur whenever we will enter a real number to the computer (except in the rare case we will enter an exactly representable number). The same 'rounding off' may occur whenever we perform an arithmetical operation. The result of an arithmetical operation usually will have more binary digits than its operands, and will have to be converted to one of the 'allowed' bit-patterns. To make this more concrete, let's have an example using base 10 real numbers, and suppose that only two digit mantissas are allowed (the fractional parts may have only 2 decimal digits): 0.12E+02 + 0.34E+00 = 12.00E+00 + 0.34E+00 = 12.34E+00 ==> 0.12E+02 This example is a bit artificial and incompletely defined (in our fixed representation, only the size of the fractional part was specified, the exponents were left unspecified), but the idea is clear, we can see that computer arithmetic has to replace almost every number and temporary result by a rounded form. Instead of computing: X + Y We will really compute: round(round(X) + round(Y)) The function 'round' can't be specified in general, it depends on the representation and the floating-point arithmetical algorithms we use, see the chapter 'radix conversion and rounding' for more information. A possible implementation of round() for decimal floating-point numbers (represented in radix 10) is: e = INT(LOG10(X) + 1.0) (number of decimal digits in X) INT(X * (10**(p-e)) + 0.5) round(X) = ---------------------- 10**(p-e) The parameter p is the number of decimal digits in the representation. Note that multiplying and dividing by (10**n) are just shifts of the decimal point, and not error generating arithmetic operations. Such seemingly complicated formulas can be implemented efficiently (in radix 2) in hardware or reduced to a very small micro-code program executed by the CPU. In the following sections we will see that roundoff errors are an endless source of errors, some of them unexpectedly large. By the way, the distinction between real and floating-point numbers can be summarized symbolically in our new notation by: FPN = round(REAL) A little basic theory --------------------- Every real number x can be written in the form: x = f X (2 ** e) Where 'e' is an integer called the EXPONENT, and 'f' is a binary fraction called the MANTISSA. The mantissa may satisfy one of the normalization conditions: 1 <= |f| < 2 (IEEE) 1/2 <= |f| < 1 (DEC) The mantissa is then said to be a NORMALIZED. The IEEE normalization condition is equivalent to the requirement that the MOST SIGNIFICANT BIT (MSB) in the mantissa = 1. The DEC condition requires the two most significant bits to be 0,1. On IBM 360, IBM 370 and Nova (Data General) computers, the base of the exponent was 16 (it gives a larger range at the cost of precision): x = f X (16 ** e) The normalization condition was that the first HEX digit of the fraction was not equal to 0, i.e. not all first 4 binary digits were 0. The advantages of normalizing floating-point numbers are: 1) The representation is unique, there is exactly one way to write a real number in such a form. 2) It's easy to compare two normalized numbers, you separately test the sign, exponent and mantissa. 3) In a normalized form, a fixed size mantissa will use all the 'digit cells' to store significant digits. 4) The IEEE and DEC normalization conditions makes the representation always start with a 1-bit, this bit can be omitted, and its place used for data. The omitted bit is called the "hidden bit". The normalized representation is used in almost all floating point implementations, 'denormalized numbers' are used only to minimize accuracy loss due to underflow (see next chapter). Just like with rounding, we will have to normalize after arithmetical operations, the result wouldn't be normalized in general. Floating Point numbers in practise ---------------------------------- In our finite machines, we can keep only a finite number of the binary digits of 'f' and 'e', let's say 'm' and 'n' digits respectively. The vendor predetermine a few combinations of 'm' and 'n', usually one or two combinations that the hardware executes efficiently, and maybe one more that gives better precision. The following table compares some floats used in practice, the REAL*n notation is a common extension to FORTRAN, 'n' is the number of bytes used in the representation. The representation radix, size (in bits) of the various parts composing the floating-point number, and the exponent bias are given. The number of bits in the fraction part is counted without the "hidden bit", if normalized mantissas are used, so the sizes here are "physical" rather than "logical". Table of float types (incomplete) ================================= Float name Radix Sign Exponent Fraction Bias ---------- ----- ---- -------- -------- ----- IBM 370: * REAL*4 16 1 7 24 64 0.f * 16**(e-64) * REAL*8 16 1 7 56 64 VAX: * REAL*4 (F_FLOAT) 2 1 8 23 128 0.1f * 2**(e-128) * REAL*8 (D_FLOAT) 2 1 8 55 128 0.1f * 2**(e-128) * REAL*8 (G_FLOAT) 2 1 11 52 1024 0.1f * 2**(e-1024) * REAL*16(H_FLOAT) 2 1 15 113 16384 0.1f * 2**(e-16384) Cray: Single precision 2 1 15 48 16384 Double precision 2 1 15 96 16384 IEEE * REAL*4 2 1 8 23 127 1.f * 2**(e-127) extended 2 1 11+ 31+ * REAL*8 2 1 11 52 1023 1.f * 2**(e-1023) extended 2 1 15+ 63+ REAL*10 2 1 15 64 16383 Intel (IEEE): * Short real 2 1 8 23 127 1.f * 2**(e-127) * Long real 2 1 11 52 1023 1.f * 2**(e-1023) Temp real 2 1 15 64 16383 0.f * 2**(e-16384) MIL 1750A: REAL*4 None 8 24 None f * 2**e REAL*8 None ? ?? None HP 21MX: Varian: Honeywell: Remarks: 1) Formats that use a sign bit (all except MIL 1750A), use the sign convention: 0 = +, 1 = - MIL 1750A uses a 2's complement mantissa with a 2's complement exponent. 2) '#' at the first column means that normalized mantissas are used. Note that on IBM 370 the first hexadecimal digit of the fraction (4 bits), couldn't be zero. An important note ----------------- The next chapter will provide a detailed example that will make the abstract concepts more clear. To simplify our discussion, we will give an incomplete treatment of this highly technical subject, and with no proofs. Readers interested in a deeper treatment of these subjects are referred to: Goldberg, David What Every Computer Scientist Should Know about Floating-Point arithmetic ACM Computing Surveys Vol. 23 #1 March 1991, pp. 5-48 +---------------------------------------------------------------------+ | SUMMARY | | ======= | | 1) x = f X (2 ** e) 2 > |f| => 1 b is integer | | 2) There are a lot of float types | | 3) IEEE/REAL*4 = 1 Sign bit, 8 exponent bits, 23 mantissa bits | +---------------------------------------------------------------------+Return to contents page