/***********************************************************************

Date: Sun, 05 Dec 1999 15:40:47 -0800
From: Vaughan Pratt <pratt@cs.stanford.edu>

Compiled with gcc, no optimization.

TIMINGS
                                   user system  elapsed   CPU
200 MHz Ultrasparc  Solaris 2.5.1  5.82  22.68  0:28.61  99.6%
300 MHz Ultrasparc  Solaris 2.6    4.30  14.97  0:19.28  99.9%
450 MHz Pentium-II  Linux RH5.1    2.20   0.00  0:02.23  98.6%
550 MHz K-7 Athlon  Linux RH6.0    0.82   0.00  0:01.80  45.5%

One factor in the difference is that x86 floating point registers are
80 bits.  However without optimization the arithmetic is done in main
memory (in effect) so that denormalization sets in at 2^-1023 (the
11-bit exponent of 64-bit floating point) rather than 2^-16383 (the
15-bit exponent of the x86's 80-bit floating point).  This is why all
four machines ended up with i = 1075.

With -O1 through -O4 on the x86, the arithmetic takes place in the
registers so i goes to 16446.  I haven't looked at the code to see what
gcc does with y in the transition to the second loop, but with -O1 the
program takes zero time indicating that y is saved between the two loops
thereby clearing it to 0, while with -O2 through -O4 it again takes a
second or two, which presumably indicates that x=2e6*y uses the y left
behind in the register rather than the y from main memory.

As an aside I find this roller-coaster dependence on optimization level
sucky, but as David's correspondent points out, the ISV's have more
important things to worry about than what purists consider right.

Vaughan Pratt

------------------------------------------------------------------------
Additional result by Nelson H. F. Beebe <beebe@math.utah.edu>

Vendor/Model     O/S               user system  elapsed   CPU

xxx MHz Apple    Rhapsody 5.5      0.180u 0.010s 0:00.18 105.5% (cc -g)
PowerMac G3                        0.090u 0.010s 0:00.08 125.0% (cc -O1)
				   0.080u 0.020s 0:00.08 125.0% (cc -O2)
				   0.080u 0.020s 0:00.08 125.0% (cc -O3)
				   0.080u 0.020s 0:00.08 125.0% (cc -O4)


# NB: Output is: 1023 2.22507e-308 (i.e., flush-to-zero without gradual underflow):
466MHz DEC Alpha OSF/1 4.0g        0.029u 0.006s 0:00.04 50.0% (c89 -g)
                                   0.020u 0.005s 0:00.07 28.5% (c89 -O1)
                                   0.023u 0.005s 0:00.03 66.6% (c89 -O2)
                                   0.021u 0.006s 0:00.03 66.6% (c89 -O3)
                                   0.021u 0.009s 0:00.03 66.6% (c89 -O4)


# NB: Output is: 1075 4.94066e-324:
466MHz DEC Alpha OSF/1 4.0g        0.155u 27.103s 0:27.30 99.8% (c89 -ieee -g)
				   0.235u 26.904s 0:27.15 99.9% (c89 -ieee -O1)
				   0.254u 26.896s 0:27.23 99.6% (c89 -ieee -O2)
				   0.332u 26.865s 0:27.22 99.8% (c89 -ieee -O3)
				   0.387u 27.050s 0:27.45 99.9% (c89 -ieee -O4)


# NB: For +O3 and +O4, the compiler optimized away the final loop:
99 MHz           HP-UX 10.01       12.23u 0.03s 0:12.32  99.5% (c89 -g)
HP-9000/735			   12.01u 0.03s 0:12.07  99.7% (c89 -O)
				   12.17u 0.03s 0:12.25  99.5% (c89 +O1)
				   12.01u 0.03s 0:12.09  99.5% (c89 +O2)
				   0.01u  0.03s 0:00.04 100.0% (c89 +O3)
				   0.01u  0.03s 0:00.04 100.0% (c89 +O4)


# NB: Output correct for -g, but get "16446 0" for all -On
600 MHz Intel    GNU/Linux         1.650u 0.000s 0:01.65 100.0% (gcc -g)
Pentium III      2.2.12-20smp      0.000u 0.000s 0:00.00   0.0% (gcc -O1)
                 (Redhat 6.1)      1.190u 0.000s 0:01.19 100.0% (gcc -O2)
		                   1.190u 0.000s 0:01.19 100.0% (gcc -O3)
		                   1.190u 0.000s 0:01.19 100.0% (gcc -O4)


# NB: Output correct for -g, but get "16446 0" for all -On.  Here, cc
# == egcs-2.91.66; tests with gcc 2.95.2 showed that it ignored the
# -ffloat-store option.
600 MHz Intel    GNU/Linux         1.650u 0.000s 0:01.65 100.0% (cc -ffloat-store -g)
Pentium III      2.2.12-20smp      1.680u 0.000s 0:01.68 100.0% (cc -ffloat-store -O1)
                 (Redhat 6.1)      0.830u 0.000s 0:00.83 100.0% (cc -ffloat-store -O2)
				   0.820u 0.010s 0:00.83 100.0% (cc -ffloat-store -O3)
				   0.830u 0.000s 0:00.83 100.0% (cc -ffloat-store -O4)


# NB: cc == egcs-2.91.66; tests with gcc 2.95.2 showed that it ignored the
# -ffloat-store option.
300 MHz Intel    GNU/Linux         3.550u 0.020s 0:03.67 97.2% (cc -ffloat-store -g)
Pentium II MMX   2.2.5-22          3.560u 0.030s 0:03.81 94.2% (cc -ffloat-store -O1)
                 (Redhat 6.0)      1.670u 0.050s 0:01.81 95.0% (cc -ffloat-store -O2)
		                   1.730u 0.030s 0:01.82 96.7% (cc -ffloat-store -O3)
				   1.710u 0.000s 0:01.75 97.7% (cc -ffloat-store -O4)
				   1.650u 0.010s 0:01.87 88.7% (cc -ffloat-store -O5)


xxx MHz IBM      AIX 4.2           0.320u 0.020s 0:00.37 91.8%  (c89 -g)
RS/6000 43P		           0.170u 0.010s 0:00.17 105.8% (c89 -O1)
                                   0.170u 0.010s 0:00.17 105.8% (c89 -O2)
                                   0.150u 0.020s 0:00.16 106.2% (c89 -O3)


33MHz Motorola   NeXT Mach 3.3     1.093u 271.342s 5:08.20 88.3% (gcc -g)
68040			           0.952u 128.427s 2:11.70 98.2% (gcc -O1)
				   1.265u 127.940s 2:11.80 98.0% (gcc -O2)
				   0.843u 128.065s 2:18.11 93.3% (gcc -O3)
				   1.078u 128.140s 2:11.82 98.0% (gcc -O4)


150 MHz SGI      IRIX 5.3          8.762u 20.656s 0:29.48  99.7% (cc -ansi -g)
Challenge L                        8.818u 14.902s 0:23.26 101.9% (cc -ansi -O1)
MIPS R4400			   5.512u 12.547s 0:17.55 102.8% (cc -ansi -O2)
				   5.516u 12.564s 0:17.70 102.0% (cc -ansi -O3)


# NB: For -O2 and -O3, the compiler optimized away the final loop:
180 MHz SGI      IRIX 6.5          0.115u 0.006s 0:00.12  91.6%   (c89 -g)
Origin 200                         0.126u 0.006s 0:00.12 100.0%   (c89 -O1)
MIPS R10000			   0.003u 0.006s 0:00.00   0.0%   (c89 -O2)
				   0.003u 0.006s 0:00.00   0.0%   (c89 -O3)


400 MHz Sun      Solaris 2.7       2.23u 11.24s 0:13.55  99.4% (c89 -g)
UltraSPARC                         1.95u 11.27s 0:13.28  99.5% (c89 -O1)
Enterprise 5500                    2.08u 11.45s 0:13.53 100.0% (c89 -O2)
                                   1.99u 11.30s 0:13.31  99.8% (c89 -O3)
				   2.22u 11.09s 0:13.33  99.8% (c89 -O4)
				   1.96u 11.36s 0:13.34  99.8% (c89 -O5)

***********************************************************************/

#include <stdio.h>
#include <stdlib.h>

/* Perform 2 million denormalized floating point subtractions */

int
main()
{
    int i;
    double x, y;
    for (x = 1, y = 2, i = 0; x; x /= 2, y /= 2, i++);
    (void)printf("%d %g\n", i, y); /* Sanity check: expect 1075 4.94066e-324 */
    for (x = 2e6 * y; x > 0; x -= y);
    return (EXIT_SUCCESS);
}
