4-9 MINIMIZING FLOATING-POINT ERRORS ************************************* A few remarks ------------- o Some practical methods for checking the severity of floating-point errors can be found in the chapter: 'practical issues'. o The chapter on 'FORTRAN pitfalls' discusses various programming practises that may amplify floating-point (and other) errors, and it is very important to avoid them. o Note that an interval/stochastic arithmetic package is not just a diagnostic tool for FP errors, a result without an error estimation is not very useful, as errors can never be eliminated completely in experimental data and computation. Carefully written programs -------------------------- This term was probably coined by Sterbenz (see bibliography below), and means programs that are numerically correct. this is not easy to achieve as the following example will show. Sterbenz discusses the implementation of a Fortran FUNCTION that returns the average of two REAL numbers, the specifications for the routine are: o The sign must be always correct. o The result should be as close as possible to (x+y)/2 and stay within a predefined bound. o min(x,y) <= average(x,y) <= max(x,y) o average(x,y) = average(y,x) o average(x,y) = 0 if and only if x = -y unless an underflow occurred. o average(-x,-y) = -average(x,y) o An overflow should never occur. o An underflow should never occur, unless the mathematical average is strictly less than the smallest representable real number. Even a simple task like this, requires considerable knowledge to program in a good way, there are 4 (at least) possible average formulas: 1) (x + y) / 2 2) x/2 + y/2 3) x + ((y - x) / 2) 4) y + ((x - y) / 2) Sterbenz have a very interesting discussion on choosing the most appropriate formulas, he also consider techniques like scaling up the input variables if they are small. Grossly oversimplifying, we have: Formula #1 may raise an overflow if x,y have the same sign #2 may degrade accuracy, but is safe from overflows #3,4 may raise an overflow if x,y have opposite signs We will use formulas #1,3,4 according to the signs of the input numbers: real function average (x, y) real x, y, zero, two, av1, av2, av3, av4 logical samesign parameter (zero = 0.0e+00, two = 2.0e+00) av1(x,y) = (x + y) / two av2(x,y) = (x / two) + (y / two) av3(x,y) = x + ((y - x) / two) av4(x,y) = y + ((x - y) / two) if (x .ge. zero) then if (y .ge. zero) then samesign = .true. else samesign = .false. endif else if (y .ge. zero) then samesign = .false. else samesign = .true. endif endif if (samesign) then if (y .ge. x) then average = av3(x,y) else average = av4(x,y) endif else average = av1(x,y) endif return end Programming using exception handling ------------------------------------ Computing the average of two numbers may serve as an example for the system-dependent technique of writing faster numerical code using exception handling. Most of the time the formula (x + y)/2 is quite adequate, the FUNCTION above is needed essentially to avoid overflow, so the following scheme may be used: call reset_overflow_flag result = (x + y) / 2.0 call check_overflow_flag(status) if (status .eq. .true.) result = average(x,y) In this way the "expansive" call to the average routine may be eliminated in most cases. Of course, two system-dependent calls were added for reseting and checking the overflow flag, but this may be still worth it if the ratio between the two algorithms is large enough (which is not the case here). Using REAL*16 (QUAD PRECISION) ------------------------------ This is the most simple solution on machines that supports this data type. REAL*16 takes more CPU time than REAL*8/REAL*4, but introduces very small roundoff errors, and has a huge range. Performance cost of different size floats is VERY machine dependent (see the performance chapter). A crude example program: program reals real*4 x4, y4, z4 real*8 x8, y8, z8 real*16 x16, y16, z16 x4 = 1.0e+00 y4 = 0.9999999e+00 z4 = x4 - y4 write(*,*) sqrt(z4) x8 = 1.0d+00 y8 = 0.9999999d+00 z8 = x8 - y8 write(*,*) sqrt(z8) x16 = 1.0q+00 y16 = 0.9999999q+00 z16 = x16 - y16 write(*,*) sqrt(z16) end Normalization of equations -------------------------- Floating-point arithmetic is best when dealing with numbers with magnitudes of the order of 1.0, the 'representation density' is not maximal but we are in the 'middle' of the range. Usually you can decrease the range of numbers appearing in the computation, by transforming the system of units, so that you get dimensionless equations. The diffusion equation will serve as an example, I apologize for the horrible notation: Ut = K * Uxx Where the solution is U(X,T), the lowercase letters denote the partial derivatives, and K is a constant. Let: L be a typical length in the problem U0 a typical value of U Substitute: X' = X / L U' = U / U0 Then: Ux = Ux' / L Uxx = Ux'x' / (L*L) Substitute in the original equation: (U' * U0)t = (K / (L*L)) * (U' * U0)x'x' ((L * L) / K) U't = U'x'x' Substitute: T' = (K * T) / (L * L) And you get: U't' = U'x'x' With: X' = X / L U' = U / U0 T' = (K * T) / (L * L) Multi-precision arithmetic -------------------------- That is a really bright idea, you can simulate floating-point numbers with very large sizes, using character strings (or other data types), and create routines for doing arithmetic on these giant numbers. Of course such software simulated arithmetic will be slow. By the way, the function overloading feature of Fortran 90, makes using multi-precision arithmetic packages with existing programs easy. Two free packages are "mpfun" and "bmp" (Brent's multiple precision), which are available from Netlib. Using special tricks -------------------- A good example are the following tricks for summing a series. The first is sorting the numbers and adding them in ascending order. An example program: program rndof integer i real sum sum = 0.0 do i = 1, 10000000, 1 sum = sum + 1.0 / real(i) end do write (*,*) 'Decreasing order: ', sum sum = 0.0 do i = 10000000, 1, -1 sum = sum + 1.0 / real(i) end do write (*,*) 'Increasing order: ', sum end There is no need here for sorting, as the series is monotonic. Executing 2 * 10**7 iterations will take some CPU seconds, but the result is very illuminating. Another way (though not as good as doubling the precision) is using the Kahan Summation Formula. Suppose the series is stored in an array X(1:N) SUM = X(1) C = 0.0 DO J = 2, N Y = X(J) - C T = SUM + Y C = (T - SUM) - Y SUM = T ENDDO Yet another method is using Knuth's formula. The recommended method is sorting and adding. Another example is using the standard formulae for solving the quadratic equation (real numbers are written without mantissa to enhance readability): a*(x**2) + b*x + c = 0 (a .ne. 0) When b**2 is much larger than abs(4*a*c), the discriminat is nearly equal to abs(b), and we may get "catastrophic cancellation". Multiplying and dividing by the same number we get alternative formulae: -b + (b**2 - 4*a*c)**0.5 -2 * c x1 = ------------------------ = ----------------------- 2*a b + (b**2 - 4*a*c)**0.5 -b - (b**2 - 4*a*c)**0.5 2 * c x2 = ------------------------ = ------------------------ 2*a -b + (b**2 - 4*a*c)**0.5 If "b" is much larger than "a*c", use one of the standard and one of the alternative formulae. The first alternative formula is suitable when "b" is positive, the other when it's negative. Using integers instead of floats -------------------------------- See the chapter: "The world of integers". Manual safeguarding ------------------- You can check manually every dangerous arithmetic operation, special routines may be constructed to perform arithmetical operations in a safer way, or get an error message if this cannot be done. Hardware support ---------------- IEEE conforming FPUs can raise an exception whenever a roundoff was performed on an arithmetical result. You can write an exception handler that will report the exceptions, but as the result of most operations may have to be rounded, your program will be slowed down, and you will get huge log files. Rational arithmetic ------------------- Every number can be represented (possibly with an error) as a quotient of two integers, the dividend and divisor can be kept along the computation, without actually performing the division. See Knuth for technical details. It seems this method is not used. Bibliography ------------ An excellent article on floating-point arithmetic: David Goldberg What Every Computer Scientist Should Know about Floating-Point arithmetic ACM Computing Surveys Vol. 23 #1 March 1991, pp. 5-48 An old but still useful book: Sterbenz, Pat H. Floating-Point Computation Prentice-Hall, 1974 ISBN 0-13-322495-3 An old classic presented in a mathematical rigorous way (oouch!): Donald E. Knuth The Art of Computer Programming Volume II, sections 4.2.1 - 4.2.3 Addison-Wesley, 1969 The Silicon Graphics implementation of the IEEE standard, republished later in another issue of Pipeline: How a Floating Point Number is represented on an IRIS-4D Pipeline July/August 1990 The homepage of Prof. William Kahan, the well-known expert on floating-point arithmetic: http://http.cs.berkeley.edu/~wkahan/ A short nice summary on floating-point arithmetic: CS267: Supplementary Notes on Floating PointReturn to contents page