Category: numerical methods

tttaocp

Here’s a short series of articles I wrote while parallel computing SIG leader at Schlumberger entitled “Travels Through The Art of Computer Programming”. This series was inspired by reading through Knuth’s famous TAOCP volumes and covers various topics. For example, I give a RNG based on the unpredictability of the GPU warp scheduler, cover topics in polynomial and number theory, Pade approximants, encryption, graph algorithms, memory, CPU performance, blockchain, distributed networks, REST and more.

new GPU book

Numerical Computations with GPUs comes out later this year; Pierre-Yves and I were able to contribute a chapter on LU &QR decomposition (the latter using Givens rotations) for batches of dense matrices. We saw some impressive performance improvements for specific problem sizes. QR will benefit particularly from CUDA 6 and the availability of the fast/safe reciprocal hypotenuse function rhypot(x,y), more details here .

Nvidia GTC 2013

Really enjoying meeting up with old friends and new, at what has become my favorite conference; a wonderful combination of great science, engineering, design and gaming, all enabled by the venerable GPU. Here’s our talk for tomorrow, using GPUs to accelerate key algorithms in modeling the hydraulic fracturing process by EM methods. We focus on LU and QR batch decomposition specifically; special thanks and credit to Pierre-Yves for pushing over LUP and deriving some terrific insights through his optimizations.

Nvidia GTC 2013 Accepted Talks

I’m fortunate to work with some very talented people 🙂 Work supported in part by a generous donation of 50 M2070 Nvidia Tesla cards from HP


Accelerating and Scaling Lanczos Diagonalization with GPGPU

W. J. Brouwer, Sreejith G.J., F. Spiga

Lanczos diagonalization (LD) is an important algorithm for calculating eigenvalues and eigenvectors of large matrices, used in many applications. A example performed countless times throughout the world on a daily basis involves latent semantic analysis of documents for search and retrieval. This presentation details work devoted to exploiting the massive parallelism and scalability of GPUs, in order to enhance LD for key aspects of condensed matter physics. One significant application area is the diagonalization of the Hamiltonian for large, dense matrices encountered in studies of the fractional quantum Hall effect. A second application discussed in this work is to the Self Consistent Field (SCF) cycle of a Density Functional Theory (DFT) code, Quantum Espresso. Initial results are promising, demonstrating a 18x speedup using GPU, over an optimized CPU implementation. Further, the use of MPI in conjunction with NVIDIA GPUDirect allows for scaling to all GPUs across the cluster used in this work.

Estimating Fracture Size in Gas Shale Formations using Smart Proppants and GPGPU

M. K. Hassan, W. J. Brouwer, R. Mittra

This presentation considers a system containing specially engineered particles called smart proppants, which have the potential to serve as sensors in estimating the effective length of fractures in gas shale formation, a significant factor in predicting the yield of a reservoir. The proppants are suspended and randomly distributed in a background medium, the fracturing fluid. A Monte Carlo method is used to generate the properties and position of smart proppants within a volume that approximates the fracture zone. Modeling the particles as dipoles, one can construct a large and unwieldy matrix equation, simplified by the application of characteristic basis functions (CBF). The CBF method involves application of LU decomposition and SVD based methods to matrix sub blocks, in order to produce subsequent solutions. This presentation will discuss the overall application and solution method, as well as the results of using GPU implementations of the key algorithms, effectively providing 30-40x performance improvement over using a single CPU.

Large Scale Lanczos

This is an important algorithm for finding eigenvalues of large matrices eg., for latent semantic indexing in search. I’m currently porting this to GPU for a colleague’s research into topological exitons in FQHE (utilizing diagonalization of the Hamiltonian); matrices involved have side N > 1e6. The Lanczos algorithm is relatively simple, although it does require restart/re-orthogonalization throughout. The implementation uses, among others, the sgemv routine from Intel MKL and/or cuBLAS, initial results promising; 3x speedup using GPU, over 8 CPU threads.


Death by Tensor

Of all the objects in Math that might kill a man, tensors have surely been lethal on more than one occasion. Sure, it’s not a traditional way to check out, but I’ll bet it happens.

I don’t see a whole lot out there in the gnu space for manipulating tensors; there’s something for matlab from sandia with a scary looking license. When I get to production C/C++, there is a beautiful package put together by Landry at Utah, using expression templates. However nothing seemed to fit, when initially coding up Arghavani et al., integration algorithm for shape memory alloys, in octave. Most of the terse theory was relatively easy to code up, I used a block matrix representation for the Cartesian tensors, ostensibly of rank 2,4 and 6 eg., indices for rank 6 look like :

which in theory has good cache/locality properties; indices are increasingly fast (ie., smaller period) as one moves to smaller tiles. There are well known approaches for contraction involving indices occurring twice in expressions ie., using traditional matrix multiplication. I found it easiest to work in the space of the resultant, eg., rank2 -> rank4 before an operation involving a rank4 object etc. All was going well till I came to equation A9, specifically :

which, as you can see if you squint hard enough dear reader, involves contraction over multiple variables, and worse, concerns more than two elements. After my eyes quit bleeding, I resolved to stick with the block matrix approach; loops are far too slow when scripting, regardless of representation. I introduced a third dimension for summation, and moved tiles around to suit the contraction at a given step. I used a number of temporary variables for ease of reading the code, a misnomer if ever I saw/wrote one. Anyhoo, the result overall was code which runs ~ 20 times faster than loops, and written in terms of higher level objects, making the task of ultimately writing C++ much easier. IN theory.

a = rand(3,3);
d = rand(3,3);
b = kron(eye(3,3),rand(3,3));
c = kron(eye(3,3),rand(3,3));
f = eye(3,3);

bb = rand(3,3,3,3);
cc = rand(3,3,3,3);

%brief ndim test
%I removed all the nested loops 04/13, it was a little tl;dr


% expand 2nd  rank tensor U_im
%put m tiles in 3rd d
%i index has period 27 w/ matrix row index

foo(:,:,1)=kron(a(:,1),ones(9,27));
foo(:,:,2)=kron(a(:,2),ones(9,27));
foo(:,:,3)=kron(a(:,3),ones(9,27)); 

% expand 4th rank tensor G_mnpq
%put m tiles in 3rd d
%n index has period 27 w/ matrix column index
%p index has period 3 w/ block matrix row index
%q index has period 3 w/ block matrix column index

foo2(:,:,1)=[repmat(b(1:3,1:3),9,3), repmat(b(1:3,4:6),9,3), repmat(b(1:3,7:9),9,3)];
foo2(:,:,2)=[repmat(b(4:6,1:3),9,3), repmat(b(4:6,4:6),9,3), repmat(b(4:6,7:9),9,3)];
foo2(:,:,3)=[repmat(b(7:9,1:3),9,3), repmat(b(7:9,4:6),9,3), repmat(b(7:9,7:9),9,3)]; 

%expand 4th rank tensor Z_rmpq
%put m tiles in 3rd d
%r index has period 9 w/ block matrix row index
%p index has period 3 w/ block matrix row index
%q index has period 3 w/ block matrix column index

foo3(:,:,1)=repmat([repmat(c(1:3,1:3),1,9); repmat(c(4:6,1:3),1,9); repmat(c(7:9,1:3),1,9)],3,1);
foo3(:,:,2)=repmat([repmat(c(1:3,4:6),1,9); repmat(c(4:6,4:6),1,9); repmat(c(7:9,4:6),1,9)],3,1);
foo3(:,:,3)=repmat([repmat(c(1:3,7:9),1,9); repmat(c(4:6,7:9),1,9); repmat(c(7:9,7:9),1,9)],3,1); 

%expand 2nd rank tensor C_ms
%put m tiles in 3rd d
%s index has period 3 w/ block matrix column index

foo4(:,:,1)=repmat(kron(d(1,:),ones(27,3)),1,3);
foo4(:,:,2)=repmat(kron(d(2,:),ones(27,3)),1,3);
foo4(:,:,3)=repmat(kron(d(3,:),ones(27,3)),1,3);

%CONTRACT m; U_im * G_mnpq * Z_rmpq * C_ms ->  F_inrspq

F = sum(foo.*foo2.*foo3.*foo4,3);

%clear foo foo2 foo3 foo4

%reshape F_inrspq to suit contraction w/ n
%put n tiles in 3rd d

foo5(:,:,1) = F(:,1:9);
foo5(:,:,2) = F(:,10:18);
foo5(:,:,3) = F(:,19:27); 

%CONTRACT n; U_nj * F_inrspq = G_ijrspq

G = [(foo5(:,:,1).*a(1,1) + foo5(:,:,2).*a(2,1) + foo5(:,:,3).*a(3,1)), \    %j==1 tile
    (foo5(:,:,1).*a(1,2) + foo5(:,:,2).*a(2,2) + foo5(:,:,3).*a(3,2)), \    %j==2 tile
    (foo5(:,:,1).*a(1,3) + foo5(:,:,2).*a(2,3) + foo5(:,:,3).*a(3,3)) ];    %j==3 tile
%reshape G_ijrspq to suit contraction w/ r
%put r tiles in 3rd d; r has period 9 w/ row index

foo6(:,:,1) = G(1:3,:);            %r==1 tiles
foo6(:,:,2) = G(4:6,:);            %r==2 tiles
foo6(:,:,3) = G(7:9,:);            %r==3 tiles
foo6(:,:,4) = G(10:12,:);        %r==1 tiles
foo6(:,:,5) = G(13:15,:);        %r==2 tiles
foo6(:,:,6) = G(16:18,:);        %r==3 tiles
foo6(:,:,7) = G(19:21,:);        %r==1 tiles
foo6(:,:,8) = G(22:24,:);        %r==2 tiles
foo6(:,:,9) = G(25:27,:);        %r==3 tiles

%CONTRACT r; U_pr * G_ijrspq = H_ijpspq

H = [(foo6(:,:,1).*a(1,1) + foo6(:,:,2).*a(1,2) + foo6(:,:,3).*a(1,3)); \        %p==1 tiles
    (foo6(:,:,1).*a(2,1) + foo6(:,:,2).*a(2,2) + foo6(:,:,3).*a(2,3)); \        %p==2 tiles
    (foo6(:,:,1).*a(3,1) + foo6(:,:,2).*a(3,2) + foo6(:,:,3).*a(3,3)); \        %p==3 tiles
    (foo6(:,:,4).*a(1,1) + foo6(:,:,5).*a(1,2) + foo6(:,:,6).*a(1,3)); \        %p==1 tiles
    (foo6(:,:,4).*a(2,1) + foo6(:,:,5).*a(2,2) + foo6(:,:,6).*a(2,3)); \        %p==2 tiles
    (foo6(:,:,4).*a(3,1) + foo6(:,:,5).*a(3,2) + foo6(:,:,6).*a(3,3)); \        %p==3 tiles
    (foo6(:,:,7).*a(1,1) + foo6(:,:,8).*a(1,2) + foo6(:,:,9).*a(1,3)); \        %p==1 tiles
    (foo6(:,:,7).*a(2,1) + foo6(:,:,8).*a(2,2) + foo6(:,:,9).*a(2,3)); \        %p==2 tiles
    (foo6(:,:,7).*a(3,1) + foo6(:,:,8).*a(3,2) + foo6(:,:,9).*a(3,3))]; \        %p==3 tiles

%reshape H_ijpspq to suit contraction w/ s
%put s tiles in 3rd d; s has period 9 w/ row index

foo7(:,:,1) = H(:,1:3);            %s==1 tiles
foo7(:,:,2) = H(:,4:6);            %s==2 tiles
foo7(:,:,3) = H(:,7:9);            %s==3 tiles
foo7(:,:,4) = H(:,10:12);        %s==1 tiles
foo7(:,:,5) = H(:,13:15);        %s==2 tiles
foo7(:,:,6) = H(:,16:18);        %s==3 tiles
foo7(:,:,7) = H(:,19:21);        %s==1 tiles
foo7(:,:,8) = H(:,22:24);        %s==2 tiles
foo7(:,:,9) = H(:,25:27);        %s==3 tiles

%CONTRACT s; U_sq * H_ijpspq = I_ijpqpq

I = [(foo7(:,:,1).*a(1,1) + foo7(:,:,2).*a(2,1) + foo7(:,:,3).*a(3,1)), \        %q==1 tiles
    (foo7(:,:,1).*a(1,2) + foo7(:,:,2).*a(2,2) + foo7(:,:,3).*a(3,2)), \        %q==2 tiles
    (foo7(:,:,1).*a(1,3) + foo7(:,:,2).*a(2,3) + foo7(:,:,3).*a(3,3)), \        %q==3 tiles
    (foo7(:,:,4).*a(1,1) + foo7(:,:,5).*a(2,1) + foo7(:,:,6).*a(3,1)), \        %q==1 tiles
    (foo7(:,:,4).*a(1,2) + foo7(:,:,5).*a(2,2) + foo7(:,:,6).*a(3,2)), \        %q==2 tiles
    (foo7(:,:,4).*a(1,3) + foo7(:,:,5).*a(2,3) + foo7(:,:,6).*a(3,3)), \        %q==3 tiles
    (foo7(:,:,7).*a(1,1) + foo7(:,:,8).*a(2,1) + foo7(:,:,9).*a(3,1)), \        %q==1 tiles
    (foo7(:,:,7).*a(1,2) + foo7(:,:,8).*a(2,2) + foo7(:,:,9).*a(3,2)), \        %q==2 tiles
    (foo7(:,:,7).*a(1,3) + foo7(:,:,8).*a(2,3) + foo7(:,:,9).*a(3,3))]; \        %q==3 tiles

%CONTRACT p; I_ijpqpq = J_ijqq
%CONTRACT q; J_ijqq = output_ij

foo8 = kron(ones(9,9),temp).*I;
J = [sum(foo8(1:9,:)); sum(foo8(10:18,:)); sum(foo8(19:27,:))];

output = [sum(J(:,1:9),2), sum(J(:,10:18),2), sum(J(:,19:27),2)];

disp(['time vector : ',num2str(toc)]);
disp(['error btwn loops and vector apprch :',num2str(sum(sum(output-testg)))]);

SpMV solvers I

Back to reservoir simulation which I’ve enjoyed previously, lots of physics and numerics. The heart of the problem is the linear solver; after discretization of partial differential equation & linearization, one has a rather neat looking matrix equation A.x = b, where A are the coefficients, b the RHS vector and x the unknowns. Rather trivial for a small non-singular matrix, for a practical system composed of >1e7 non-zero elements, and matrices of height ~ 1e6 elements, one is faced with using a direct or iterative solution, finding inv(A) is just impossible. Consideration must also be given to storage format (eg., compressed sparse row) and memory usage; direct methods tend to blow up (like n^3) due to fill rates when factoring, and for other reasons. Iterative methods scale better but are more susceptible to numerical problems, and are a strong function of the matrix structure. As I trawl through the literature and iterate through possible solvers, a few collections standout eg., LIS, ILUPACK, and those in the intel MKL. Octave has useful formats, functions and solvers, including a nice polymorphic (adaptive) direct solver. What follows is a little bit ‘o’ script for converting output from the simulation, converting from CSR to octave’s format. Too bad the direct method doesn’t scale well, this worked nicely…


# output
fid     =fopen('matrix_data.txt','r');
# unpack CSR details; output
[cA,b]  =fscanf(fid,"%f,",Inf);
[rA,b]  =fscanf(fid,"%f,",Inf);
[dA,b]  =fscanf(fid,"%e,",Inf);
[rhs,b] =fscanf(fid,"%e,",Inf);
fclose(fid);
# matrix reconstruction A
rInd    =zeros(size(cA));
rInd(rA(1:end-1),:)     = 1;
#skip zero rows
rInd(rA(1:end-1),:) +=((rA(2:end)-rA(1:end-1))==0);
rInd = cumsum(rInd);
#store in octave sparse
sp = sparse(rInd, cA, dA);
#rhs vector
b=rhs;
#direct method
tic; x = b' / sp; timeA = toc;
#lu decomp for precond
tic; [l,u,p,q]=lu(sp); timeB=toc;
spy(u);

tutorial from Nvidia GTC

My talk is attached in *pdf. Overall the conference felt like a rock’n’roll show, my first non-academic meeting of this nature. Thoroughly enjoyable although a little difficult to tell what people are doing since it’s mostly proprietary work. The announcement of fermi was certainly exciting; ~8x times the double prec performance, ECC and many more cores/memory…

Talk outline:

  • Company Background
  • CUDA accelerates Geophysics:
    • Data Processing w/ Linear Algebra
    • Sparse Matrices
    • Seismic Imaging:Kirchhoff Time Migration
      • Frequency Filtering
      • Traveltime/Weight Calculations
      • GPU Migration
  • Summary

nvidiaSummit0909

Split-Step Algorithm

This is a fast method for the solution of parabolic PDE’s, relying on the FFT implementation of the Fourier Transform to speed things up. Here it’s applied to an acoustic/seismic problem, following the development of Kuperman/Jackson in “Imaging of Complex Media with Acoustic and Seismic Waves”. Consider first the Helmholtz equation for a point source at r’,z’:
eq1_ssa
where G is the Green’s function, K the wavenumber (function of the frequency omega and sound speed c). Assuming azimuthal symmetry, G may be expressed as a product of two functions:
eq2_ssa
and similarly K, now a product of (constant) K0 and index of refraction n. Substituting into the Helmholtz equation gives two PDE’s:
eq3_ssa
The first PDE has Bessel functions as solutions; taking the assymptotic outgoing Hankel function solution and substituting it into the second, with the narrow angle approximation (second derivative of psi with respect to r much smaller than first derivative wrt r), one finds for the second (parabolic) PDE:
eq4_ssa
where chi is the fourier transform of psi. Assuming the variation of n is insignificant, in the wave-space domain the PDE and solution are:
eq5_ssa
Finally, the inverse FT gives the field as a function of depth (Delta r = r-r0, r0 the boundary value):
eq6_ssa

Custom Electric Potential

I would like an electric potential with specific qualities, namely, quadratic in two dimensions. On the one hand, I could try and do a little thought experiment, as to which boundary gives the desired properties. Realistically though on the length scale I need it, the odds of producing the boundary are zilch. Much easier in terms of the multipole expansion. With this in mind, and knowing that copper wire along a cylinder makes for an easy setup, and that at least four terms are required for a quadratic potential, the following should work: V(x=0,z=a)=k, V(x=0,z=-a)=k, V(x=-a,z=0)=-k, V(x=a,z=0)=-k, with each wire along y. Assuming long wires, the potential for all four in the x/z plane is:

eq1_efg







To confirm this has the desired behavior near the origin (and hopefully over some reasonable distance), I turn to the Taylor multi-variable expansion. Example partial derivatives:

eq2_efg

Substituting into the expansion:
eq3_efg

does indeed reveal that V ~ (2/{a*a})[z*z-x*x], as desired:

lef