/*********************************************************************** Date: Sun, 05 Dec 1999 15:40:47 -0800 From: Vaughan Pratt Compiled with gcc, no optimization. TIMINGS user system elapsed CPU 200 MHz Ultrasparc Solaris 2.5.1 5.82 22.68 0:28.61 99.6% 300 MHz Ultrasparc Solaris 2.6 4.30 14.97 0:19.28 99.9% 450 MHz Pentium-II Linux RH5.1 2.20 0.00 0:02.23 98.6% 550 MHz K-7 Athlon Linux RH6.0 0.82 0.00 0:01.80 45.5% One factor in the difference is that x86 floating point registers are 80 bits. However without optimization the arithmetic is done in main memory (in effect) so that denormalization sets in at 2^-1023 (the 11-bit exponent of 64-bit floating point) rather than 2^-16383 (the 15-bit exponent of the x86's 80-bit floating point). This is why all four machines ended up with i = 1075. With -O1 through -O4 on the x86, the arithmetic takes place in the registers so i goes to 16446. I haven't looked at the code to see what gcc does with y in the transition to the second loop, but with -O1 the program takes zero time indicating that y is saved between the two loops thereby clearing it to 0, while with -O2 through -O4 it again takes a second or two, which presumably indicates that x=2e6*y uses the y left behind in the register rather than the y from main memory. As an aside I find this roller-coaster dependence on optimization level sucky, but as David's correspondent points out, the ISV's have more important things to worry about than what purists consider right. Vaughan Pratt ------------------------------------------------------------------------ Additional result by Nelson H. F. Beebe Vendor/Model O/S user system elapsed CPU xxx MHz Apple Rhapsody 5.5 0.180u 0.010s 0:00.18 105.5% (cc -g) PowerMac G3 0.090u 0.010s 0:00.08 125.0% (cc -O1) 0.080u 0.020s 0:00.08 125.0% (cc -O2) 0.080u 0.020s 0:00.08 125.0% (cc -O3) 0.080u 0.020s 0:00.08 125.0% (cc -O4) # NB: Output is: 1023 2.22507e-308 (i.e., flush-to-zero without gradual underflow): 466MHz DEC Alpha OSF/1 4.0g 0.029u 0.006s 0:00.04 50.0% (c89 -g) 0.020u 0.005s 0:00.07 28.5% (c89 -O1) 0.023u 0.005s 0:00.03 66.6% (c89 -O2) 0.021u 0.006s 0:00.03 66.6% (c89 -O3) 0.021u 0.009s 0:00.03 66.6% (c89 -O4) # NB: Output is: 1075 4.94066e-324: 466MHz DEC Alpha OSF/1 4.0g 0.155u 27.103s 0:27.30 99.8% (c89 -ieee -g) 0.235u 26.904s 0:27.15 99.9% (c89 -ieee -O1) 0.254u 26.896s 0:27.23 99.6% (c89 -ieee -O2) 0.332u 26.865s 0:27.22 99.8% (c89 -ieee -O3) 0.387u 27.050s 0:27.45 99.9% (c89 -ieee -O4) # NB: For +O3 and +O4, the compiler optimized away the final loop: 99 MHz HP-UX 10.01 12.23u 0.03s 0:12.32 99.5% (c89 -g) HP-9000/735 12.01u 0.03s 0:12.07 99.7% (c89 -O) 12.17u 0.03s 0:12.25 99.5% (c89 +O1) 12.01u 0.03s 0:12.09 99.5% (c89 +O2) 0.01u 0.03s 0:00.04 100.0% (c89 +O3) 0.01u 0.03s 0:00.04 100.0% (c89 +O4) # NB: Output correct for -g, but get "16446 0" for all -On 600 MHz Intel GNU/Linux 1.650u 0.000s 0:01.65 100.0% (gcc -g) Pentium III 2.2.12-20smp 0.000u 0.000s 0:00.00 0.0% (gcc -O1) (Redhat 6.1) 1.190u 0.000s 0:01.19 100.0% (gcc -O2) 1.190u 0.000s 0:01.19 100.0% (gcc -O3) 1.190u 0.000s 0:01.19 100.0% (gcc -O4) # NB: Output correct for -g, but get "16446 0" for all -On. Here, cc # == egcs-2.91.66; tests with gcc 2.95.2 showed that it ignored the # -ffloat-store option. 600 MHz Intel GNU/Linux 1.650u 0.000s 0:01.65 100.0% (cc -ffloat-store -g) Pentium III 2.2.12-20smp 1.680u 0.000s 0:01.68 100.0% (cc -ffloat-store -O1) (Redhat 6.1) 0.830u 0.000s 0:00.83 100.0% (cc -ffloat-store -O2) 0.820u 0.010s 0:00.83 100.0% (cc -ffloat-store -O3) 0.830u 0.000s 0:00.83 100.0% (cc -ffloat-store -O4) # NB: cc == egcs-2.91.66; tests with gcc 2.95.2 showed that it ignored the # -ffloat-store option. 300 MHz Intel GNU/Linux 3.550u 0.020s 0:03.67 97.2% (cc -ffloat-store -g) Pentium II MMX 2.2.5-22 3.560u 0.030s 0:03.81 94.2% (cc -ffloat-store -O1) (Redhat 6.0) 1.670u 0.050s 0:01.81 95.0% (cc -ffloat-store -O2) 1.730u 0.030s 0:01.82 96.7% (cc -ffloat-store -O3) 1.710u 0.000s 0:01.75 97.7% (cc -ffloat-store -O4) 1.650u 0.010s 0:01.87 88.7% (cc -ffloat-store -O5) xxx MHz IBM AIX 4.2 0.320u 0.020s 0:00.37 91.8% (c89 -g) RS/6000 43P 0.170u 0.010s 0:00.17 105.8% (c89 -O1) 0.170u 0.010s 0:00.17 105.8% (c89 -O2) 0.150u 0.020s 0:00.16 106.2% (c89 -O3) 33MHz Motorola NeXT Mach 3.3 1.093u 271.342s 5:08.20 88.3% (gcc -g) 68040 0.952u 128.427s 2:11.70 98.2% (gcc -O1) 1.265u 127.940s 2:11.80 98.0% (gcc -O2) 0.843u 128.065s 2:18.11 93.3% (gcc -O3) 1.078u 128.140s 2:11.82 98.0% (gcc -O4) 150 MHz SGI IRIX 5.3 8.762u 20.656s 0:29.48 99.7% (cc -ansi -g) Challenge L 8.818u 14.902s 0:23.26 101.9% (cc -ansi -O1) MIPS R4400 5.512u 12.547s 0:17.55 102.8% (cc -ansi -O2) 5.516u 12.564s 0:17.70 102.0% (cc -ansi -O3) # NB: For -O2 and -O3, the compiler optimized away the final loop: 180 MHz SGI IRIX 6.5 0.115u 0.006s 0:00.12 91.6% (c89 -g) Origin 200 0.126u 0.006s 0:00.12 100.0% (c89 -O1) MIPS R10000 0.003u 0.006s 0:00.00 0.0% (c89 -O2) 0.003u 0.006s 0:00.00 0.0% (c89 -O3) 400 MHz Sun Solaris 2.7 2.23u 11.24s 0:13.55 99.4% (c89 -g) UltraSPARC 1.95u 11.27s 0:13.28 99.5% (c89 -O1) Enterprise 5500 2.08u 11.45s 0:13.53 100.0% (c89 -O2) 1.99u 11.30s 0:13.31 99.8% (c89 -O3) 2.22u 11.09s 0:13.33 99.8% (c89 -O4) 1.96u 11.36s 0:13.34 99.8% (c89 -O5) ***********************************************************************/ #include #include /* Perform 2 million denormalized floating point subtractions */ int main() { int i; double x, y; for (x = 1, y = 2, i = 0; x; x /= 2, y /= 2, i++); (void)printf("%d %g\n", i, y); /* Sanity check: expect 1075 4.94066e-324 */ for (x = 2e6 * y; x > 0; x -= y); return (EXIT_SUCCESS); }