new2 new2
WWW http://www.math.utah.edu/~beebe

OpenMP: overview and resource guide

Last updates: Tue May 8 19:16:06 2001    Fri Nov 12 15:26:10 2004    Thu Nov 13 18:30:20 2008    Mon Mar 1 16:28:36 2010

OpenMP is a relatively new (1997) development in parallel computing. It is a language-independent specification of multithreading, and implementations are available from several vendors, including

OpenMP is implemented as comments or directives in Fortran, C, and C++ code, so that its presence is invisible to compilers lacking OpenMP support. Thus, you can develop code that will run everywhere, and when OpenMP is available, will run even faster.

The OpenMP Consortium maintains a very useful Web site at http://www.openmp.org/, with links to vendors and resources.

There is an excellent overview of the advantages of OpenMP over POSIX threads ( pthreads ) and PVM/MPI in the paper OpenMP: A Proposed Industry Standard API for Shared Memory Processors, also available in HTML and PDF . This is a must-read if you are getting started in parallel programming. It contains two simple examples programmed with OpenMP, ptheads, and MPI.

The paper also gives a very convenient tabular comparison of OpenMP directives with Silicon Graphics parallelization directives.

OpenMP can be used on uniprocessor and multiprocessor systems with shared memory. It can also be used in programs that run on homogeneous or heterogeneous distributed memory environments, which are typically supported by systems like Linda, MPI , and PVM, although the OpenMP part of the code will only provide parallelization on those processors providing shared memory.

In distributed memory environments, the programmer must manually partition data between processors, and make special library calls to move the data back and forth. While that kind of code can also be used in shared memory systems, OpenMP is much simpler to program. Thus, you can start parallelization of an application using OpenMP, and then later add MPI or PVM calls: the two forms of parallelization can peacefully coexist in your program.

An extensive bibliography on multithreading, including OpenMP, is available at http://www.math.utah.edu/pub/tex/bib/index-table-m.html#multithreading. MPI and PVM are covered in a separate bibliography: http://www.math.utah.edu/pub/tex/bib/index-table-p.html#pvm

OpenMP benchmark: computation of pi

This simple benchmark for the computation of pi is taken from the paper above. Its read statement has been modified to read from stdin instead of the non-redirectable /dev/tty, and an extra final print statement has been added to show an accurate value of pi.

Follow this link for the source code a shell script to run the benchmark, a UNIX Makefile, and a small awk program to extract the timing results for inclusion in tables like the ones below.

Here is a table of compiler options needed to enable OpenMP directives during compilation:
Compaq/DECf90-omp
Compaq/DECf95-omp
IBMxlf90_r-qsmp=omp -qfixed
IBMxlf95_r-qsmp=omp -qfixed
PGIpgf77-mp
PGIpgf90-mp
PGIpgcc-mp
PGIpgCC-mp
SGIf77-mp

Once you have compiled with OpenMP support, the executable may still not run multithreaded, unless you preset an environment variable that defines the number of threads to use. On most of the above systems, this variable is called OMP_NUM_THREADS. This has no effect on the IBM systems; I'm still trying to find out what is expected there.

When the Compaq/DEC benchmark below was run, there was one other single-CPU-bound process on the machine, so we should expect to have only 3 available CPUs. As the number of threads increases beyond the number of available CPUs, we should expect a performance drop, unless those threads have idle time, such as from I/O activity. For this simple benchmark, the loop is completely CPU bound. Evidently, 3 threads make almost perfect use of the machine, at a cost of only two simple OpenMP directives added to the original scalar program.

Plot of Compaq/DEC Alpha 4100-5/466 speedup
Compaq/DEC Alpha 4100-5/466: Four 466MHz CPUs
100,000,000 iterations
Number of threads Wallclock Time (sec) Speedup
18.3101.000
24.0302.062
32.7802.989
42.1303.901
53.4702.395
62.9302.836
72.5203.298
82.2803.645

Plot of Intel Pentium-III/600 speedup
Intel Pentium III: Two 600 MHz CPUs
100,000,000 iterations
Number of threads Wallclock Time (sec) Speedup
16.2101.000
23.1101.997
34.0001.552
44.3901.415

Plot of SGI Origin-200 speedup
SGI Origin 200: Four 195MHz R10000 CPUs
100,000,000 iterations
Number of threads Wallclock Time (sec) Speedup
128.611.000
214.331.997
39.612.977
47.633.750
59.792.922
69.802.919
79.852.905
813.152.176

The previous two systems were essentially idle when the benchmark was run, and, as expected, the optimal speedup is obtained when the thead count matches the number of CPUs.

The next one is a large shared system on which the load average was about 40 (that is, about 2/3 busy) when the benchmark was run. With a large number of CPUs, the work per thread is reduced, and eventually, communication and scheduling overhead dominates computation. Consequently, the number of iterations was tripled for this benchmark. Since large tables of numbers are less interesting, the speedup is shown graphically as well. At 100% efficiency, the speedup would be a 45-degree line in the plot. With a machine of this size, it is almost impossible to ever find it idle, though it would be interesting to see how well the benchmark would scale without competition from other users for the CPUs.

Plot of SGI Origin 2000 speedup
SGI Origin 2000: Sixty-four 195MHz R10000 CPUs
300,000,000 iterations
Number of threads Wallclock Time (sec) Speedup
132.6511.000
216.3481.997
310.9432.984
48.2723.947
57.1784.549
65.7945.635
74.9276.627
84.4467.344
94.0218.120
103.5779.128
113.4099.578
123.02110.808
132.92811.151
142.64512.344
152.49313.097
162.41413.526
172.20814.788
182.17015.047
192.05115.920
202.05115.920
212.08215.683
221.79118.231
231.82417.901
242.45713.289
252.58612.626
263.13410.418
275.2006.279
285.4545.987
293.4319.516
302.42713.453
313.02110.808
322.41813.503
335.0926.412
347.6014.296
358.7903.715
366.3695.127
376.2325.239
385.5885.843
396.4705.047
407.1664.556
416.2185.251
427.4504.383
436.2985.184
446.4755.043
4515.4112.119
467.4664.373
478.2933.937
486.8724.751
498.8843.675
508.0064.078
519.6143.396
5225.2231.294
5310.7893.026
5432.9580.991
5535.8160.912
5636.2130.902
578.3013.933
5811.4872.842
5971.5260.456
6010.3613.151
6152.5180.622
6233.0810.987
6332.4931.005
6495.3220.343
Plot of Compaq AlphaServer ES40 DEC6600/500 speedup
Compaq AlphaServer ES40 DEC6600/500
(4 EV6 21264 CPUs, 500 MHz, 4GB RAM)
OSF/1 4.0F
1,000,000,000 iterations
Number of threads Wallclock Time (sec) Speedup
126.4701.000
213.2601.996
38.8402.994
46.6503.980
58.0803.276
66.7703.910
76.8503.864
86.6703.969
97.2003.676
107.1303.712
117.1203.718
126.6903.957
137.1803.687
147.3003.626
157.1703.692
166.7103.945
Plot of Compaq AlphaServer ES40 Sierra/667 speedup
Compaq AlphaServer ES40 Sierra/667
(32 EV6.7 21264A CPUs, 667 MHz, 8GB RAM)
100,000,000 iterations
Number of threads Wallclock Time (sec) Speedup
12.5001.000
21.6001.562
31.3001.923
41.5001.667
52.0001.250
62.0001.250
71.8001.389
81.2002.083
91.5001.667
101.9001.316
111.9001.316
121.9001.316
133.2000.781
142.4001.042
151.9001.316
162.2001.136
171.9001.316
181.8001.389
192.1001.190
201.6001.562
212.6000.962
221.5001.667
231.8001.389
241.6001.562
251.5001.667
262.1001.190
271.8001.389
281.7001.471
292.2001.136
302.4001.042
312.1001.190
322.5001.000
332.5001.000
341.9001.316
351.8001.389
362.5001.000
371.6001.562
381.6001.562
392.2001.136
402.5001.000
412.2001.136
421.5001.667
433.1000.806
442.4001.042
452.5001.000
462.4001.042
472.5001.000
481.6001.562
493.3000.758
502.2001.136
512.6000.962
523.2000.781
532.4001.042
541.8001.389
553.0000.833
564.9000.510
571.8001.389
582.7000.926
593.1000.806
602.7000.926
613.6000.694
623.0000.833
632.3001.087
643.7000.676
Sun SPARC Enterprise T5240
(two 8-core CPUs, 128 threads, 1200 MHz UltraSPARC T2 Plus, 64GB RAM)
Solaris 10
108 iterations
Plot of Sun SPARC Enterprise T5240 speedup
109 iterations
Plot of Sun SPARC Enterprise T5240 speedup
1010 iterations
Plot of Sun SPARC Enterprise T5240 speedup
Test machine for benchmarking (vendor withheld)
(4 CPUs, 16 threads/CPU) GNU/Linux
Plot of test machine speedup