code profile w/ bash/awk/octave

A highly useful task in computing is profiling code by sections, in order to find bottlenecks, particularly with regards to scaling across cores. While many profilers will do this to some degree, there’s no substitute for rolling your own. I like to use awk and octave commands within a bash script to accomplish this. For instance a recent example involved f90 code utilizing !$OMP pragmas. The codebase was significant and well written, however it consisted of > 150 subroutines and additional loops in drivers/main; inserting timing statements and parsing by hand was out of the question. The first step was to gather some baseline numbers eg.,

nohup ./Execute > foo0.txt &
nohup ./Execute > foo2.txt &
nohup ./Execute > foo4.txt &
nohup ./Execute > foo8.txt &

Implicit to each Execute script was environmental variables eg., for setting CPU affinity and of course OMP_NUM_THREADS. The former is an interesting topic in it’s own right, there’s an interesting article here. Binding OpenMP threads to physical processing units can provide significant gains as well- but I digress. The next step in profiling is to determine subroutine names:

// determine subroutine names

awk 'BEGIN{j=0;} /CALL/ {for(i=1;i<=NF;i++){if($i=="CALL"){print j,NR,$(i+1)}}j++;}' src/source.f90 | sed 's/(/ /' | awk '{print $1,$2,$3}' > subroutines.txt

In this manner, a mapping between a literal name in the source and a numerical position is created, for later use in creating statistics. The next step is to place some timing statements around subroutine names, and declare the variables used by the timing functions:

awk 'BEGIN{j=0} \!/CALL/ {print $0}; /CALL/ {print "call date_and_time(values=time_array_0)";
print "start_time = time_array_0(6)*60 + time_array_0(7) + 0.001*time_array_0(8)"; print $0;
print "call date_and_time(values=time_array_1)";
print "finish_time=time_array_1(6)*60 + time_array_1(7) + 0.001*time_array_1(8)";
print "write(6, \47 ( 8x,1a,1f16.6 ) \47 ) \47 TIME/CALL No. : ",j,"\47,1000*(finish_time-start_time)"; j++}' source.f90 > foo.txt

awk '{print $0}; /IMPLICIT NONE/ { print "INTEGER :: time_array_0(8), time_array_1(8)";
print "REAL :: start_time, finish_time";}' src/foo.txt > src/foo.f90

In this manner we have taken the original source and produced a file which after building will provide times and subroutine numbers throughout execution. It remains then to copy the original source away out of reach, swap in the new source and rebuild:

//cp and rebuild
cp source.f90 source.f90_
mv foo.f90 source.f90
cd ..
make clean

// create records as before
nohup ./Execute > fooSub8_5.txt &

//filter reports for time steps no. & subroutine no. & time
awk 'BEGIN{j=0} /tsteps/ {j++}; /TIME\/CALL/ {print j,$4,$5}' fooSub0_5.txt > resSub0_5.txt

In this example I’ve kept track also of time iterations performed by the code; records 4 and 5 have the timing and subroutine call number information. Using Octave this information is readily examined; in this case each subroutine had multiple invocations in a single main execution. It was therefore useful to bin the data and observe how the times changed with increased cores, in order to find bottlenecks and target problematic code. Below is an example of using hist to bin single core time data in octave.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s