the joy of openMP

I was doing some hand optimizations of a large f90 codebase a month ago, and found various instances where PARALLEL DO was not cutting it. Of course for plain vanilla nested loops this is ideal. In reality, when hand optimizing code, there is usually redundancy in the form of repeated calculation in the innermost loop, and these calculations are best moved to outer loops. But then one must be careful to avoid race conditions. While this is entirely doable, and best checked with a thread profiler like helgrind, performance and indeed scaling with cores can be poor, depending on memory. A much better and scalable way to parallelize is via work share constructs eg.,

!$OMP PARALLEL PRIVATE(iam,nthreads,chunk,istart,iend) SHARED(A,B,C)

nthreads = omp_get_num_threads()
iam = omp_get_thread_num()
chunk = (N+nthreads-1)/nthreads
istart = iam*chunk + 1
iend= min((iam +1)*chunk,N)
CALL MatrixMult(iend-istart, M, N, A(istart:iend,1:M), B, C(istart:iend,1:N))


Without knowing the exact logic of the MatMult routine, one can see that data for input matrix A is divided among threads in chunks along rows, and thus output C also. The beauty of placing the work share constructs around the routine is also encapsulation; the method itself can contain horrible logic, dependencies between loops etc.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s