您的位置:首页 > 运维架构

Begin Parallel Programming With OpenMP

2015-01-07 19:51 555 查看
Improve program performance using OpenMP on multi-core systems

Introduction

Parallel processing has always been an interesting method to improve program performance. Lately, there are more computers with multi processors or multi-core CPUs thus making parallel processing available to the masses. However, having multiple processing
units on the hardware does not make existing programs take advantage of it. Programmers must take the initiative to implement parallel processing capabilities to their programs to fully utilize the hardware available. OpenMP is a set of programming APIs which
include several compiler directives and a library of support functions. It was first developed foruse with Fortran and now it is available forC and C++ as well.

Types of Parallel Programming

Before we begin with OpenMP, it is important to know why we need parallel processing. In a typical case, a sequential code will execute in a thread which is executed on a single processing unit. Thus, if a computer has 2 processors or more (or 2 cores, or
1 processor with HyperThreading), only a single processor will be used forexecution, thus wasting the other processing power. Rather than letting the other processor sit idle (or process other threads from other programs), we can use it to speed up our algorithm.

Parallel processing can be divided into two groups, task based and data based.

Task based: Divide different tasks to different CPUs to be executed in parallel. For example, a Printing thread and a Spell Checking thread running simultaneously in a word processor. Each thread is a separate task.

Data based: Execute the same task, but divide the work load on the data over several CPUs. For example, to convert a color image to grayscale. We can convert the top half of the image on the first CPU, while the lower half is converted
on the second CPU (or as many CPUs as you have), thus processing in half the time.
There are several methods to do parallel processing:

Use MPI: Message Passing Interface - MPI is most suited fora system with multiple processors and multiple memory. For example, a cluster of computers with their own local memory. You can use MPI to divide workload across this cluster,
and merge the result when it is finished. Available with Microsoft Compute Cluster Pack.

Use OpenMP: OpenMP is suited forshared memory systems like we have on our desktop computers. Shared memory systems are systems with multiple processors but each are sharing a single memory subsystem. Using OpenMP is just like writing your
own smaller threads but letting the compiler do it. Available in Visual Studio 2005 Professional and Team Suite.

Use SIMD intrinsics: Single Instruction Multiple Data (SIMD) has been available on mainstream processors such as Intel's MMX, SSE, SSE2, SSE3, Motorola's (or IBM's) Altivec and AMD's 3DNow!. SIMD intrinsics are primitive functions to parallelize
data processing on the CPU register level. For example, the addition of two unsigned char will take the whole register size, although the size of this data type is just 8-bit, leaving 24-bit in the register to be filled with 0 and wasted. Using SIMD (such
as MMX), we can load 8 unsigned chars (or 4 shorts or 2 integers) to be executed in parallel on the register level. Available in Visual Studio 2005 using SIMD intrinsics or with Visual C++ Processor Pack with Visual C++ 6.0.



Getting Started

Setting Up Visual Studio

To use OpenMP with Visual Studio 2005, you need:

Visual Studio 2005 Professional or better
A multi processor or multi core system to see speed improvement
An algorithm to parallelize!
Visual Studio 2005 uses OpenMP standard 2.0 and is located in the vcomp.dll (Located in Windows>WinSxS> (ia64/amd64/x86)Microsoft.VC80.(Debug)OpenMP.(*) directory) dynamic library.

To use OpenMP, you must do the following:

Include
<omp.h>

Enable OpenMP compiler switch in Project Properties
That's all!



Understanding the Fork-and-Join Model

OpenMP uses the fork-and-join parallelism model. In fork-and-join, parallel threads are created and branched out from a master thread to execute an operation and will only remain until the operation has finished, then all the threads are destroyed, thus
leaving only one master thread.

The process of splitting and joining of threads including synchronization forend result are handled by OpenMP.


How Many Threads Do I Need?

A typical question is how many threads do I actually need? Are more threads better? How do I control the number of threads in my code? Are number of threads related to number of CPUs?

The number of threads required to solve a problem is generally limited to the number of CPUs you have. As you can see in the Fork-and-Join figure above, whenever threads are created, a little time is taken to create a thread and later to join the end result
and destroy the threads. When the problem is small, and the number of CPUs are less than the number of threads, the total execution time will be longer (slower) because more time has been spent to create threads, and later switch between the threads (due to
preemptive behaviour) then to actually solve the problem. Whenever a thread context is switched, data must be saved/loaded from the memory. This takes time.

The rule is simple, since all the threads will be executing the same operation (hence the same priority), 1 thread is sufficient per CPU (or core). The more CPUs you have, the more threads you can create.

Most compiler directives in OpenMP use the Environment Variable
OMP_NUM_THREADS
to determine the number of threads to create. You can control the number of threads with the following functions:


Collapse |
Copy Code
// Get the number of processors in this system
int iCPU = omp_get_num_procs();

// Now set the number of threads
omp_set_num_threads(iCPU);

Of course, you can put any value foriCPU in the code above (if you do not want to call
omp_get_num_procs
), and you can call the
omp_set_num_threads

functions as many times as you like fordifferent parts of your code formaximum control. If
omp_set_num_threads
is not called, OpenMP will use the
OMP_NUM_THREADS
Environment Variable.

Parallel forLoop

Let us start with a simple parallel
for
loop. The following is a code to convert a 32-bit Color (RGBA) image to 8-bit Grayscale image.


Collapse |
Copy Code
// pDest is an unsigned char array of size width * height
// pSrc is an unsigned char array of size width * height * 4 (32-bit)
// To avoid floating point operation, all floating point weights
// in the original grayscale formula
// has been changed to integer approximation

// Use pragma forto make a parallel forloop
omp_set_num_threads(threads);

#pragma omp parallel for
for(int z = 0; z < height*width; z++)
{
pDest[z] = (pSrc[z*4+0]*3735 + pSrc[z*4 + 1]*19234+ pSrc[z*4+ 2]*9797)>>15;
}

The
#pragma omp parallel for
directive will parallelize the

for
loop according to the number of threads set. The following is the performance gained fora 3264x2488 image on a 1.66GHz Core Duo system (2 Cores).


Collapse |
Copy Code
Thread(s) : 1 Time 0.04081 sec
Thread(s) : 2 Time 0.01906 sec
Thread(s) : 4 Time 0.01940 sec
Thread(s) : 6 Time 0.02133 sec
Thread(s) : 8 Time 0.02029 sec

As you can see, by executing the problem using 2 threads on a dual-core CPU, the time has been cut by half. However, as the number of threads is increased, the performance does not due to increased time to fork and join.

Parallel double forLoop

The same problem above (converting color to grayscale) can also be written in a double
for
loop way. This can be written like this:


Collapse |
Copy Code
for(int y = 0; y < height; y++)
for(int x = 0; x< width; x++)
pDest[x+y*width] = (pSrc[x*4 + y*4*width + 0]*3735 +
pSrc[x*4 + y*4*width + 1]*19234+ pSrc[x*4 + y*4*width + 2]*9797)>>15;

In this case, there are two solutions.

Solution 1

We have made the inner loop parallel using the parallel
for
directive. As you can see, when 2 threads are being used, the execution time has actually increased! This is because, forevery iteration of
y
, a fork-join operation is performed, and at the end contributed to the increased execution time.


Collapse |
Copy Code
for(int y = 0; y < height; y++)
{
#pragma omp parallel for
for(int x = 0; x< width; x++)
{
pDest[x+y*width] = (pSrc[x*4 + y*4*width + 0]*3735 +
pSrc[x*4 + y*4*width + 1]*19234+ pSrc[x*4 + y*4*width + 2]*9797)>>15;
}
}



Collapse |
Copy Code
Thread(s): 1 Time 0.04260 sec
Thread(s): 2 Time 0.05171 sec

Solution 2

Instead of making the inner loop parallel, the outer loop is the better choice. Here, another directive is introduced - the private directive. The private clause directs the compiler to make variables private so multiple copies of a variable do not execute.
Here, you can see that the execution time did indeed get reduced by half.


Collapse |
Copy Code
int x = 0; 
#pragma omp parallel forprivate(x)
for(int y = 0; y < height; y++)
{
for(x = 0; x< width; x++)
{
pDest[x+y*width] = (pSrc[x*4 + y*4*width + 0]*3735 +
pSrc[x*4 + y*4*width + 1]*19234+ pSrc[x*4 + y*4*width + 2]*9797)>>15;
}
}



Collapse |
Copy Code
Thread(s) : 1 Time 0.04039 sec
Thread(s) : 2 Time 0.02020 sec

Get to Know the Number of Threads

At any time, we can obtain the number of OpenMP threads running by calling the function:


Collapse |
Copy Code
int omp_get_thread_num();

Summary

By using OpenMP, you can gain performance on multi-core systems forfree, without much coding other than a line or too. There is no excuse not to use OpenMP. The benefits are there, and the coding is simple.

There are several other OpenMP directives such as firstprivate, critical sections and reductions which are left foranother article.



Points of Interest

You can also perform similar operation using Windows threads, but it will be more difficult to implement, because of issues such as synchronization and thread management.

Additional Notes

The original RGB to Grayscale formula is given as
Y = 0.299 R + 0.587 G + 0.114 B
.

References

Parallel Programming in C with MPI and OpenMP (1st ed.), Michael J. Quinn, McGraw Hill, 2004

Scientific Parallel Computing, L. Ridgway Scott, Terry Clark, Babak Bagheri, Princeton University Press, 2005

Introduction to Parallel Computing: A Practical Guide with Examples in An W.P. Petersen, P. Arbenz, Oxford University Press, 2004.

Digital Color Imaging Handbook, Gaurav Sharma, CRC Press, 2003

From http://www.codeproject.com/Articles/19065/Begin-Parallel-Programming-With-OpenMP
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  OpenMP