4-9  MINIMIZING FLOATING-POINT ERRORS
 *************************************

 A few remarks
 -------------

 o  Some practical methods for checking the severity of floating-point
    errors can be found in the chapter: 'practical issues'.

 o  The chapter on 'FORTRAN pitfalls' discusses various programming 
    practises that may amplify floating-point (and other) errors, 
    and it is very important to avoid them. 

 o  Note that an interval/stochastic arithmetic package is not just a 
    diagnostic tool for FP errors, a result without an error estimation 
    is not very useful, as errors can never be eliminated completely
    in experimental data and computation.



 Carefully written programs 
 --------------------------
 This term was probably coined by Sterbenz (see bibliography below), 
 and means programs that are numerically correct.  this is not easy
 to achieve as the following example will show.

 Sterbenz discusses the implementation of a Fortran FUNCTION that 
 returns the average of two REAL numbers, the specifications for the 
 routine are:

   o  The sign must be always correct.
   o  The result should be as close as possible to (x+y)/2 
      and stay within a predefined bound.

   o  min(x,y)  <=  average(x,y)  <=  max(x,y)
   o  average(x,y)  =  average(y,x)
   o  average(x,y)  =  0  if and only if  x = -y  unless an 
      underflow occurred.
   o  average(-x,-y)  =  -average(x,y)

   o  An overflow should never occur.
   o  An underflow should never occur, unless the mathematical 
      average is strictly less than the smallest representable 
      real number.

 Even a simple task like this, requires considerable knowledge to 
 program in a good way, there are 4 (at least) possible average 
 formulas:

   1)   (x + y) / 2
   2)   x/2 + y/2
   3)   x + ((y - x) / 2)
   4)   y + ((x - y) / 2)

 Sterbenz have a very interesting discussion on choosing the most
 appropriate formulas, he also consider techniques like scaling up 
 the input variables if they are small.  Grossly oversimplifying, 
 we have:

 Formula #1   may raise an overflow if x,y have the same sign
         #2   may degrade accuracy, but is safe from overflows 
         #3,4 may raise an overflow if x,y have opposite signs 

 We will use formulas #1,3,4 according to the signs of the input 
 numbers:


      real function average (x, y)
      real	x, y, zero, two, av1, av2, av3, av4
      logical	samesign
      parameter	(zero = 0.0e+00, two = 2.0e+00)
      av1(x,y) = (x + y) / two
      av2(x,y) = (x / two) + (y / two)
      av3(x,y) = x + ((y - x) / two)
      av4(x,y) = y + ((x - y) / two)

      if (x .ge. zero) then
        if (y .ge. zero) then 
          samesign = .true.
        else
          samesign = .false.
        endif
      else
        if (y .ge. zero) then 
          samesign = .false.
        else
          samesign = .true.
        endif
      endif

      if (samesign) then
        if (y .ge. x) then
          average = av3(x,y)
        else
          average = av4(x,y)
        endif
      else
        average = av1(x,y)
      endif

      return
      end



 Programming using exception handling
 ------------------------------------
 Computing the average of two numbers may serve as an example for
 the system-dependent technique of writing faster numerical code 
 using exception handling. 

 Most of the time the formula  (x + y)/2  is quite adequate, 
 the FUNCTION above is needed essentially to avoid overflow, 
 so the following scheme may be used:

      call reset_overflow_flag
      result = (x + y) / 2.0
      call check_overflow_flag(status)
      if (status .eq. .true.) result = average(x,y)

 In this way the "expansive" call to the average routine may be 
 eliminated in most cases.  Of course, two system-dependent calls
 were added for reseting and checking the overflow flag, but this 
 may be still worth it if the ratio between the two algorithms is
 large enough (which is not the case here). 



 Using REAL*16 (QUAD PRECISION)
 ------------------------------
 This is the most simple solution on machines that supports this 
 data type.  REAL*16 takes more CPU time than REAL*8/REAL*4, 
 but introduces very small roundoff errors, and has a huge range.

 Performance cost of different size floats is VERY machine dependent
 (see the performance chapter). 

 A crude example program:

      program reals
      real*4        x4,  y4,  z4
      real*8        x8,  y8,  z8
      real*16       x16, y16, z16

      x4 = 1.0e+00
      y4 = 0.9999999e+00
      z4 = x4 - y4
      write(*,*) sqrt(z4)

      x8 = 1.0d+00
      y8 = 0.9999999d+00
      z8 = x8 - y8
      write(*,*) sqrt(z8) 

      x16 = 1.0q+00
      y16 = 0.9999999q+00
      z16 = x16 - y16
      write(*,*) sqrt(z16)

      end



 Normalization of equations
 --------------------------
 Floating-point arithmetic is best when dealing with numbers with
 magnitudes of the order of 1.0, the 'representation density' is not
 maximal but we are in the 'middle' of the range.

 Usually you can decrease the range of numbers appearing in the 
 computation, by transforming the system of units, so that you get 
 dimensionless equations.

 The diffusion equation will serve as an example, I apologize for 
 the horrible notation:

    Ut = K * Uxx     

 Where the solution is U(X,T), the lowercase letters denote the 
 partial derivatives, and K is a constant.

 Let:
       L  be a typical length in the problem
       U0 a typical value of U 

 Substitute:

       X' = X / L     U' = U / U0

 Then:

       Ux  = Ux' / L
       Uxx = Ux'x' / (L*L)

 Substitute in the original equation:

       (U' * U0)t = (K / (L*L)) * (U' * U0)x'x'

       ((L * L) / K) U't = U'x'x'

 Substitute:

       T' = (K * T) / (L * L)

 And you get:

       U't' = U'x'x'

       With:   X' = X / L     U' = U / U0     T' = (K * T) / (L * L)



 Multi-precision arithmetic
 --------------------------
 That is a really bright idea, you can simulate floating-point numbers
 with very large sizes, using character strings (or other data types),
 and create routines for doing arithmetic on these giant numbers.
 Of course such software simulated arithmetic will be slow.

 By the way, the function overloading feature of Fortran 90, makes using
 multi-precision arithmetic packages with existing programs easy.

 Two free packages are "mpfun" and "bmp" (Brent's multiple precision),
 which are available from Netlib.



 Using special tricks
 --------------------
 A good example are the following tricks for summing a series.
 The first is sorting the numbers and adding them in ascending order.

 An example program:


      program rndof
      integer       i
      real          sum

      sum = 0.0
      do i = 1, 10000000, 1
        sum = sum + 1.0 / real(i)
      end do
      write (*,*) 'Decreasing order: ', sum

      sum = 0.0
      do i = 10000000, 1, -1
        sum = sum + 1.0 / real(i)
      end do
      write (*,*) 'Increasing order: ', sum

      end


 There is no need here for sorting, as the series is monotonic.
 Executing 2 * 10**7 iterations will take some CPU seconds, but the 
 result is very illuminating.


 Another way (though not as good as doubling the precision) is using 
 the Kahan Summation Formula.

 Suppose the series is stored in an array X(1:N)

      SUM = X(1)
      C = 0.0
      DO J = 2, N
        Y = X(J) - C
        T = SUM + Y
        C = (T - SUM) - Y
        SUM = T
      ENDDO


 Yet another method is using Knuth's formula.  The recommended method 
 is sorting and adding.


 Another example is using the standard formulae for solving the 
 quadratic equation (real numbers are written without mantissa
 to enhance readability):

   a*(x**2) + b*x + c = 0     (a .ne. 0)

 When  b**2  is much larger than  abs(4*a*c), the discriminat is
 nearly equal to  abs(b), and we may get "catastrophic cancellation".
 Multiplying and dividing by the same number we get alternative 
 formulae: 

          -b + (b**2 - 4*a*c)**0.5              -2 * c
   x1  =  ------------------------  =  -----------------------
                  2*a                  b + (b**2 - 4*a*c)**0.5

          -b - (b**2 - 4*a*c)**0.5               2 * c
   x2  =  ------------------------  =  ------------------------
                  2*a                  -b + (b**2 - 4*a*c)**0.5

 If "b" is much larger than "a*c", use one of the standard and one 
 of the alternative formulae.  The first alternative formula is 
 suitable when "b" is positive, the other when it's negative.



 Using integers instead of floats
 --------------------------------
 See the chapter: "The world of integers".



 Manual safeguarding
 -------------------
 You can check manually every dangerous arithmetic operation, 
 special routines may be constructed to perform arithmetical 
 operations in a safer way, or get an error message if this
 cannot be done.



 Hardware support
 ----------------
 IEEE conforming FPUs can raise an exception whenever a roundoff
 was performed on an arithmetical result. 

 You can write an exception handler that will report the exceptions,
 but as the result of most operations may have to be rounded, your 
 program will be slowed down, and you will get huge log files.



 Rational arithmetic
 -------------------
 Every number can be represented (possibly with an error) as a 
 quotient of two integers, the dividend and divisor can be kept 
 along the computation, without actually performing the division.
 See Knuth for technical details. 

 It seems this method is not used.



 Bibliography
 ------------

 An excellent article on floating-point arithmetic:

     David Goldberg
     What Every Computer Scientist Should Know about 
       Floating-Point arithmetic
     ACM Computing Surveys
     Vol. 23 #1  March 1991, pp. 5-48
    
 An old but still useful book:

     Sterbenz, Pat H.
     Floating-Point Computation
     Prentice-Hall, 1974
     ISBN 0-13-322495-3

 An old classic presented in a mathematical rigorous 
 way (oouch!):

     Donald E. Knuth
     The Art of Computer Programming
     Volume II, sections 4.2.1 - 4.2.3
     Addison-Wesley, 1969

 The Silicon Graphics implementation of the IEEE standard, 
 republished later in another issue of Pipeline:

     How a Floating Point Number is represented on an IRIS-4D
     Pipeline
     July/August 1990

 The homepage of Prof. William Kahan, the well-known expert
 on floating-point arithmetic:

      http://http.cs.berkeley.edu/~wkahan/ 

 A short nice summary on floating-point arithmetic:

      CS267: Supplementary Notes on Floating Point 




Return to contents page