Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I thought about which factors would influence the static scheduling overhead in OpenMP. In my opinion it is influenced by:

  • CPU performance
  • specific implementation of the OpenMP run-time library
  • the number of threads

But am I missing further factors? Maybe the size of the tasks, ...?

And furthermore: Is the overhead linearly dependent on the number of iterations? In this case I would expect that having static scheduling and 4 cores, the overhead increases linearly with 4*i iterations. Correct so far?

EDIT: I am only interested in the static (!) scheduling overhead itself. I am not talking about thread start-up overhead and time spent in synchronisation and critical section overhead.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
163 views
Welcome To Ask or Share your Answers For Others

1 Answer

You need to separate the overhead for OpenMP to create a team/pool of threads and the overhead for each thread to operate separate sets of iterators in a for loop.

Static scheduling is easy to implement by hand (which is sometimes very useful). Let's consider what I consider the two most important static scheduling schedule(static) and schedule(static,1) then we can compare this to schedule(dynamic,chunk).

#pragma omp parallel for schedule(static)
for(int i=0; i<N; i++) foo(i);

is equivalent to (but not necessarily equal to)

#pragma omp parallel
{
    int start = omp_get_thread_num()*N/omp_get_num_threads();
    int finish = (omp_get_thread_num()+1)*N/omp_get_num_threads();
    for(int i=start; i<finish; i++) foo(i);
}

and

#pragma omp parallel for schedule(static,1)
for(int i=0; i<N; i++) foo(i);

is equivalent to

#pragma omp parallel 
{
    int ithread = omp_get_thread_num();
    int nthreads = omp_get_num_threads();
    for(int i=ithread; i<N; i+=nthreads) foo(i);
}

From this you can see that it's quite trivial to implement static scheduling and so the overhead is negligible.

On the other hand if you want to implement schedule(dynamic) (which is the same as schedule(dynamic,1)) by hand it's more complicated:

int cnt = 0;
#pragma omp parallel
for(int i=0;;) {
    #pragma omp atomic capture
    i = cnt++;
    if(i>=N) break;
    foo(i);                                    
}

This requires OpenMP >=3.1. If you wanted to do this with OpenMP 2.0 (for MSVC) you would need to use critical like this

int cnt = 0;
#pragma omp parallel
for(int i=0;;) {
    #pragma omp critical   
    i = cnt++;
    if(i>=N) break;
    foo(i);
} 

Here is an equivalent to schedule(dynamic,chunk) (I have not optimized this using atomic accesss):

int cnt = 0;
int chunk = 5;
#pragma omp parallel
{
    int start, finish;
    do {
        #pragma omp critical
        {
            start = cnt;
            finish = cnt+chunk < N ? cnt+chunk : N;
            cnt += chunk;
        }
        for(int i=start; i<finish; i++) foo(i);
    } while(finish<N);
}

Clearly using atomic accesses is going to cause more overhead. This also shows why using larger chunks for schedule(dynamic,chunk) can reduce the overhead.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...