I was doing some hand optimizations of a large f90 codebase a month ago, and found various instances where PARALLEL DO was not cutting it. Of course for plain vanilla nested loops this is ideal. In reality, when hand optimizing code, there is usually redundancy in the form of repeated calculation in the innermost loop, and these calculations are best moved to outer loops. But then one must be careful to avoid race conditions. While this is entirely doable, and best checked with a thread profiler like helgrind, performance and indeed scaling with cores can be poor, depending on memory. A much better and scalable way to parallelize is via work share constructs eg.,
!$OMP PARALLEL PRIVATE(iam,nthreads,chunk,istart,iend) SHARED(A,B,C) nthreads = omp_get_num_threads() iam = omp_get_thread_num() chunk = (N+nthreads-1)/nthreads istart = iam*chunk + 1 iend= min((iam +1)*chunk,N) CALL MatrixMult(iend-istart, M, N, A(istart:iend,1:M), B, C(istart:iend,1:N)) !$OMP END PARALLEL
Without knowing the exact logic of the MatMult routine, one can see that data for input matrix A is divided among threads in chunks along rows, and thus output C also. The beauty of placing the work share constructs around the routine is also encapsulation; the method itself can contain horrible logic, dependencies between loops etc.