Category: algorithms

p2t moves to aws

I’m excited to say that the p2t backend is officially running on AWS; here’s some notes describing the important steps and a few lessons learned. Adrian Arnet is doing a magnificent job with the frontend, and my fabulous wife is leading organizational efforts. On track for a beta test in March!


SCSSC conference

I was privileged to speak at the Southern California Simulations in Science conference this week, held at UCSB in beautiful Santa Barbara. I had some great interactions and was really impressed by the level of work presented in the poster session. Overall a very enjoyable day; a big thanks to Paul, Fuz and Burak for the invitation! my talk

Nvidia Jetson

Finally splashed out on the TK1 dev kit; color me impressed! The buyer is first greeted by a neat looking, minimalist cardboard box, which contains the board, power supply and micro-usb cord, for flashing the device. I also picked up a serial-usb cord for debugging, as well as a usb hub via the local friendly Frys.

Booted rapidly, then decided to install a more recent OS; at the time of writing, found an excellent description with links here. Flashing the device took less than 30seconds, a new record I think. Shortly thereafter, installed OpenCV optimized specifically for the Tegra, as well as a few other things via the Nvidia JetPack. After installing a few other dependencies by way of sudo apt-get, I built and ran plot2txt software using a key benchmark, an image with >2M pixels. All data series were extracted in just under 2 seconds, only slightly worse than an Intel i7. The high relative performance in this case has much to do with OpenCV, which can take advantage of the CUDA cores. I tested a number of other benchmarks & libraries including DGEMM by way of ATLAS for ARM, and found at least 6 G Flop/s in many cases. I also find jetson to be incredibly stable versus several other dev kits/ARM offerings; it will happily run at full tilt all day long, barely breaking a sweat, courtesy the superior design and generous fan.

For more, check out the Nvidia Developer Zone/embedded computing, you most definitely will not be disappointed!

new GPU book

Numerical Computations with GPUs comes out later this year; Pierre-Yves and I were able to contribute a chapter on LU &QR decomposition (the latter using Givens rotations) for batches of dense matrices. We saw some impressive performance improvements for specific problem sizes. QR will benefit particularly from CUDA 6 and the availability of the fast/safe reciprocal hypotenuse function rhypot(x,y), more details here .

HPC Essentials 0

The notes from my last talk at PSU for the forseeable future, delivered to the Math department during colloquia a few weeks back. A kind of prequel to the HPC Essentials series, I take a simple Kriging process through the steps of making it an example of high performance computation. Lots of information, perhaps too much 🙂

cluster profiler

We’ve been working on a method to effectively monitor and in some senses profile all relevant processes running on one or more systems. An alpha has been released on github, Pierre-yves is working on a powerful flume+solr component for search, readme follows, more to come.


This code comprises the clpr_d daemon for the Cluster Profiler project, an Orwellian attempt to develop time series and statistics for all running processes on a single system or many systems. Process data is gathered or clustered according to process birthdate (rounded to the minute) and uid. The daemon uses several threads to work on a boost::multi_index data structure, containing the acquired process data. The main thread reads from a named pipe specified in key_defines.h, data produced by running and re-directing output from pidstat in (eg.,) an external shell loop. Appropriately formatted pidstat output may be produced from this forked code, the utility originally produced by Sebastien Godard : An example usage is the following, removing process data from root, and redirecting to a fifo in the bin directory of this distribution :

pidstat -d -u -h -l -r -w -U -v | grep -v root > bin/clpr_input

An archiving thread works asynchronously to write port 80 using tcp/ipv4 whenever queried, according to the format specified in the ostream operator for clpr_proc_db, the wrapper around boost::multi_index. Acquired process data from the fifo is ‘blobbed’ together by the reader and statistics developed on the fly. A logging class will write a log file periodically as well, with filename specified in key_defines.h. Top utilization statistics are reported in the log file, using the multiple search/sort indices of boost::multi_index. Finally, a manager thread periodically monitors the size of the database, trimming/deleting entries according to timestamps and a maximum size.

Keep in mind this daemon can be a security risk and compute as well as i/o intensive. It has been designed with flume + solr in mind; for the overall project, flume is used to query various instances of clpr_d, and solr used for indexing and search on records – WJB 03/14

Install fork of pidstat specified, ‘make’ this distribution, specifying compile and linker paths for boost as needed.

avpipe alpha

While I nut out the details of working with the new open source codec for h.264 from cisco, I’ve gone ahead and released the code for the aforementioned avi processing application, tentatively dubbed avpipe, on github . Performance is fairly good, although memory management needs attention at some point 🙂