Hagedoorn Construction

November 25, 2009

I’ve been writing GPU code for Kirchhoff Time Migration, a technique applied in seismic imaging. Among the tools used by geophysicists and other workers in oil + gas to perform seismic imaging, the most ubiquitous is probably KTM. This method is directly preceeded by the seismic image construction technique developed primarily by Hagedoorn, although both approaches essentially represent Greens functions solutions to the scalar wave equation. Data surveys are constructed using an array of pulsed sources and corresponding receivers. Data recorded at receiver locations (referred to as ‘traces’) generally correspond to a particular source-receiver pair, and in the simplest case source and receiver are coincident. An image into the earth is constructed by first calculating the traveltime for a given src/rec position and image point, using these values as well as the one way traveltime and velocity at the depth in question. This traveltime is then used as an index into the trace data, to find the amplitude point (energy) which contributes to the image point. Many points from different traces constructively and destructively interfere to create the subsequent image. The initial approach developed by Hagedoorn used little more than a compass and ruler to perform this imaging task, which is to quote the vernacular ‘old school’ and provides great pedagogical insight into KTM. There is a fantastic article by Bleistein here, and below is an example of a Hagedoorn construction performed in octave; the surface is the superposition of the arcs, like figure 2 of the article.



UNIX redirect,pipes & reconfigurable data processing

November 12, 2009

In various sciences one is often invested in processing voluminous data which follows a given format, with many examples following (hopefully) the same format. A typical data file might have a line of environment variables, followed by raw data, which repeats in columns and/or rows eg.,

4,23,45,56,78
12,23,45,56,67,78,89,87,…,72
32,34,23,21,23,56,43,23,…,34
34,54,32,45,89,76,54,98,…,58
67,67,88,32,34,21,22,97,…,51

A typical section of C code which would use this might look like:

float config[8];
scanf(“%f,%f,%f,%f,%f”,&config[1],&config[2],&config[3],
&config[4],&config[5]);

printf(“%f,%f,%f,%f,%f\n”,config[1],config[2],config[3],
config[4],config[5]);

int RECORDS = (int) config[1];
int i;
if ( RECORDS < 100){
int SIZE = sizeof(float)*RECORDS;

float *deltaCSMean = malloc(SIZE);
float *deltaCSSigma = malloc(SIZE);
float *CQMean = malloc(SIZE);
float *CQSigma = malloc(SIZE);
float *etaMean = malloc(SIZE);
float *etaSigma = malloc(SIZE);
float *brdF2 = malloc(SIZE);
float *brdF1 = malloc(SIZE);
float *amplitude = malloc(SIZE);

for (i=0; i< RECORDS; i++){
scanf(“%f,%f,%f,%f,%f,%f,%f,%f,%f”,&deltaCSMean[i],
&deltaCSSigma[i],&CQMean[i],&CQSigma[i],
&etaMean[i],&etaSigma[i],
&brdF2[i],&brdF1[i],&amplitude[i]);
printf(“%f,%f,%f,%f,%f,%f,%f,%f,%f\n”,deltaCSMean[i],
deltaCSSigma[i],CQMean[i],CQSigma[i],
etaMean[i],etaSigma[i],brdF2[i],brdF1[i],amplitude[i]);
}
}

Using redirect a simple invocation of executable foo using input foo1.txt and output foo2.txt would of course be:

./foo < foo1.txt > foo2.txt

which is fine until one needs different data from the file. Supposing in this example data repeats along columns, then one can use awk and a pipe to reshape the data input, removing the need to edit and recompile the C source for different data subsets:

gawk ‘BEGIN{FS=”,”; print “4,12,23,44,56″; getline;} {for(i=lowBnd;i<upBnd;i++){printf(“%f,”,$i) }; printf(“%f\n”,$upBnd) }’ foo1.txt | ./foo > foo2.txt

where lowBnd and upBnd correspond to the data column limits in the input file. Note that the environment variables are written during the BEGIN block, allowing the specification of (for instance) a different job for the new data set.


tutorial from Nvidia GTC

October 16, 2009

My talk is attached in *pdf. Overall the conference felt like a rock’n'roll show, my first non-academic meeting of this nature. Thoroughly enjoyable although a little difficult to tell what people are doing since it’s mostly proprietary work. The announcement of fermi was certainly exciting; ~8x times the double prec performance, ECC and many more cores/memory…

Talk outline:

  • Company Background

  • CUDA accelerates Geophysics:
    • Data Processing w/ Linear Algebra

    • Sparse Matrices
    • Seismic Imaging:Kirchhoff Time Migration
      • Frequency Filtering

      • Traveltime/Weight Calculations
      • GPU Migration
  • Summary

nvidiaSummit0909


CUDA + OCTAVE

September 30, 2009

I’ve been experimenting with CUDA and OCTAVE; there is at least one company who have produced GPU enabled MEX functions. The big difficulty is of course that there is no support for internal floats within OCTAVE (afaik) and similarly with Matlab. However if one can leave the data and work with it on the device for some time, then there are only two explicit conversions btwn float <-> double needed. Or you could sacrifice some performance in CUDA in return for using doubles. At any rate here’s an example Makefile, happy experimenting. For this example I gutted the matrix Mul example from the CUDA sdk; the wrapper *oct source code (*cc, really C++ with octave extensions) contains an extern C section which references the cuda kernel (*cu). Don’t forget to indent instructions under ‘all’ with a single tab for make.


#! /usr/bin/env make
#make file for octfile/cuda
#Mac OSX 10.5.8 intel core 2 duo
#cuda include/lib
CUDA_INC_PATH=/usr/local/cuda/include
CUDA_LIB_PATH=/usr/local/cuda/lib
#octfile compiler
CC=mkoctfile
#basic flags
CFLAGS= -I$(CUDA_INC_PATH)
LDFLAGS= -L$(CUDA_LIB_PATH) -lcudart -lcuda

all:
$(CC) $(CFLAGS) -c cudaMatrixMul.cc -o cudaMatrixMul.o
nvcc  -c matMul_kernel.cu -o matMul_kernel.o -Wall
$(CC) $(LDFLAGS) cudaMatrixMul.o matMul_kernel.o -o cudaMatrixMul.oct

#clean:
rm -f  cudaMatrixMul.o matMul_kernel.o


CUDA/GPU install

April 15, 2009

assuming you’re installing GPU + CUDA from scratch, here’s the steps
for a linux i686:

1. Find and download appropriate driver from Nvidia. With the old GPU card still in place, switch to terminal ‘ctrl-alt-F1′ and as root stop the X-server by issuing ‘init 3′. Then install via ’sh NV*run’, answering yes to all the prompts. Issue ‘init 5′ and shut down… when done, replace video card

2. Reboot and install the CUDA toolkit, don’t forget to ‘chmod +x cud*run’ before issuing ’sh cud*run’. After answering yes to all the prompts & installing, add to .bash_profile:

PATH=$PATH:/usr/local/cuda/bin
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib

export PATH
export LD_LIBRARY_PATH

3. Install the CUDA SDK, you should be good to go, but may also need to disable SELinux (eg., temporarily: ‘echo 0 > /selinux/enforce’)


Expression Templates in C++

April 15, 2009

High performance C++ has traditionally been a misnomer, particularly with regards to linear algebra classes and templates. Handcoded C and FORTRAN are the usual approaches taken in the physical sciences, but these lack the portability, maintainability and overall elegance of (object oriented) C++. It was brought to my attention recently that there are really clever ways of using the compiler for performance hacks in C++. Amongst these methods are expression templates by Todd Veldhuizen et al.

The difficulty in vanilla C++ array classes is the overhead associated with operations on arrays,  eg., addition and multiplication. Temporary objects are created, requiring precious memory bandwidth and unnecessary loops.  Expression templates is an ingenious way around this, by essentially nesting templates, and in conjunction with overloaded operators, creating parse trees at compile time, which effectively inline expressions. The following is pseudocode from Todd Veldhuizen’s notes:

struct plus{}; //overloaded operator
class Array{}; //the class

//X is a node in the parse tree; method
//works b/c class can take itself as template
//parameter

template<typename Left, typename Op, typename Right>
class X{};

//overloaded operator ‘+’ for use in nesting templated
//expressions

template<class T>
X<T, plus, Array> operator+(T,Array)
{

return X<T,plus, Array>()

}

If A,B,C represent class instances, then “D=A+B+C” looks like:

D=A+B+C;
=X<Array,plus,Array>() + C;
X<X<Array,plus,Array>,plus,Array>();

which gives performance comparable to handcoded C. For more information see the article.


CUDA part I

April 10, 2009

NMR can provide unparalleled insight into local atomic structural details, although in the solid state interpretation of lineshapes is hindered by anisotropic broadening, eg., the attached MQMAS spectra of two distinct chemical sites. The situation grows much worse for disordered materials where parameters take on a stochastic nature. I’ve developed some HPC software to enable the simulation of MQMAS spectra for these scenarios, and now my attention is turned to implementation using general purpose graphical processor unit (GPGPU) programming. For the paltry sum of 70USD you can have a little super computer of your own in the form the NVIDIA GeForce 8400 GPU. I installed today and have started playing…

foo


CP2k install howto

April 8, 2009

There is the possibility that I’ll be using CP2k in the near future, so I’ve been getting better acquainted. By all accounts this is a distribution that makes you work for its love, which is fine b/c I’m not needy. It also represents a new domain for me, mixed Gaussian/plane waves and molecular dynamics, I look forward to what it can do. In the meantime, Lady and Gentleman, I give you 5 READMEs and 7 hours later the complete steps for a full install on a fresh Linux i686 OS


0.0 Get Coffee

1.0 Install g95:

# wget -O – http://ftp.g95.org/g95-x86-linux.tgz | tar xvfz -

//check g95-install/bin for binary name, make symbolic link, eg.,

# ln -s /root/g95-install/bin/i686-unknown-linux-gnu-g95 /bin/g95

2.0 Install Lapack:

# wget -O – http://www.netlib.org/lapack/lapack.tgz | tar xvfz -

# cd lap*

//cp & edit the make.inc file to use g95 (edit FORTRAN & LOADER lines)

# cp make.inc.example make.inc
# vim make.inc

//make

# make lib

2.1 Lunch

3.0 Install & run ATLAS:

//get file from sourceforge & gunzip etc, cd into ATLAS,
//make a dir

# mkdir MyObj
# cd MyObj

//run configure specifying compiler and path to LAPACK eg.,

# /root/ATLAS/configure -C if /bin/g95 –with-netlib-lapack=/root/lapack-3.2/lapack_LINUX.a

//run all variants of make (will take quite a while :) )

# make
# make check
# make ptcheck
# make time
# make install

3.1 Coffee

4. Install FFTW3:

# wget -O – ftp://ftp.fftw.org/pub/fftw/fftw-3.2.1.tar.gz | tar xvfz -
# cd fft*
# ./configure
# make
# make install

5. Install cp2k

//download eg., via cvs:

# cvs -d :pserver:anonymous@cvs.cp2k.berlios.de:/cvsroot/cp2k co cp2k

// cd into cp2k/arch, create/edit arch file, eg., for Linux 686 w/ g95, serial;

# cp Linux-i686-g95.sopt foo.sopt

// edit with vim, making sure order of libraries is lapack/blas/cblas/atlas:

LIBS = /usr/local/atlas/lib/liblapack.a /usr/local/atlas/lib/libf77blas.a /
usr/local/atlas/lib/libcblas.a /usr/local/atlas/lib/libatlas.a

// cd into cp2k/makefiles,

# make ARCH=foo VERSION=sopt
# make install

// if build succesful, should have executable in ~/cp2k/exe/foo, make link eg.,

# ln -s /root/cp2k/exe/foo/cp2k.sopt /bin/cp2k

5.1 Dinner

5.2 Coffee

6. Watch tv for a bit

7. Edit & run ~cp2k/tools/do_regtest in future


emergence via connectedness

March 19, 2009

Cellular Automata have proved fertile research for at least 50 years, providing models for everything ranging from crystal growth to fluid dynamics. Discrete rules are applied at each time step to evolve a particular cellular state; ordered patterns emerge from random initial configurations. The example given here using the connected components labeling (CCL) algorithm demonstrates how topology also can give rise to order from randomness.

in_test2 out_test1

CCL in OCTAVE:

lab=1; freq=1; x=0; y=0;
% main loop(s)

for i=2:c-1
for j=2:d-1

if ( a(i,j)!=0 & label(i,j)==0)

x=i;
y=j;

label(i,j)=lab;
while (x!=1) & (y!=1) & (x!=c) & (y!=d)

[mm,x,nn,y]=pop_s(x,y);

if (a(mm-1,nn-1)==1 && label(mm-1,nn-1)==0 )

[x,y]=push_s(mm-1,x,nn-1,y);
label(mm-1,nn-1)=lab;
freq(lab)++;
end

if (a(mm-1,nn)==1 && label(mm-1,nn)==0)

[x,y]=push_s(mm-1,x,nn,y);
label(mm-1,nn)=lab;
freq(lab)++;
end

if (a(mm-1,nn+1)==1 && label(mm-1,nn+1)==0)

[x,y]=push_s(mm-1,x,nn+1,y);
label(mm-1,nn+1)=lab;
freq(lab)++;
end

if (a(mm,nn-1)==1 && label(mm,nn-1)==0)

[x,y]=push_s(mm,x,nn-1,y);
label(mm,nn-1)=lab;
freq(lab)++;
end

if (a(mm,nn+1)==1 && label(mm,nn+1)==0)

[x,y]=push_s(mm,x,nn+1,y);
label(mm,nn+1)=lab;
freq(lab)++;
end

if (a(mm+1,nn-1)==1 && label(mm+1,nn-1)==0)

[x,y]=push_s(mm+1,x,nn-1,y);
label(mm+1,nn-1)=lab;
freq(lab)++;
end

if (a(mm+1,nn)==1 && label(mm+1,nn)==0)

[x,y]=push_s(mm+1,x,nn,y);
label(mm+1,nn)=lab;
freq(lab)++;
end

if (a(mm+1,nn+1)==1 && label(mm+1,nn+1)==0)

[x,y]=push_s(mm+1,x,nn+1,y);
label(mm+1,nn+1)=lab;
freq(lab)++;
end

end %endwhile

lab++;
freq=[freq, 1];

end %endif

end
end


CPMG simulation/modeling

February 23, 2009

A script for simulating the CPMG sequence using the density matrix, with Gauss/Gauss envelope/spikelet broadening, for a forthcoming paper.


function [tot, gbb,t]=d_cpmg(N,tau,n,delta,r,rr)

%cpmg experiment evolution wjb 02/09
%simple/ideal pulse sequence:

%90y-n*[-180x-]

%N time steps
%tau
%n 180 loops
%delta chem shift
%r envelope gauss br^2
%rr spikelet gauss br^2

%matrix for I_x & I_y

a=[0 1/2; 1/2 0]; b=[0 -i/2; i/2 0];

%time step & initial rho

t=tau/(N-1); rho=a;

for k=1:n
sig(1)=trace(rho*a); sigi(1)=trace(rho*b);

for j=2:N/2

%iterate; free precession for tau/2

rho = [exp(-i*t*delta) 0; 0 exp(i*t*delta)]*rho*[exp(i*t*delta) 0; 0 exp(-i*t*delta)];

sig(j)=trace(rho*a); sigi(j)=trace(rho*b);

end

%apply 180x

rho = [0 exp(i*pi/2); exp(i*pi/2) 0]*rho*[0 exp(-i*pi/2); exp(-i*pi/2) 0];

%iterate; free precession for tau/2

sig(N/2+1)=trace(rho*a); sigi(N/2+1)=trace(rho*b);

for j=2:N/2

rho = [exp(-i*t*delta) 0; 0 exp(i*t*delta)]*rho*[exp(i*t*delta) 0; 0 exp(-i*t*delta)];

sig(j+N/2)=trace(rho*a); sigi(j+N/2)=trace(rho*b);
end

if k>1

tot=[tot (sig+i*sigi)];

else

tot=(sig+i*sigi);
end

end

tt=-tau/2:t:t*(N-1); gb=exp(-rr.*tt.^2); gb = [gb(N/2+1:N) gb(1:N/2)]; gbb=gb;

for i=1:n-1

gbb=[gbb gb];

end

t=0:t:t*(n*N-1); gbb=gbb.*exp(-r.*t.^2);


cpmg1