Category: Machine Learning


We’re very excited to be attending Collision in New Orleans next year as participants in the alpha program !! Fingers crossed plot2txt is available on AWS and soft launched by then 🙂


SCSSC conference

I was privileged to speak at the Southern California Simulations in Science conference this week, held at UCSB in beautiful Santa Barbara. I had some great interactions and was really impressed by the level of work presented in the poster session. Overall a very enjoyable day; a big thanks to Paul, Fuz and Burak for the invitation! my talk

Nvidia Jetson

Finally splashed out on the TK1 dev kit; color me impressed! The buyer is first greeted by a neat looking, minimalist cardboard box, which contains the board, power supply and micro-usb cord, for flashing the device. I also picked up a serial-usb cord for debugging, as well as a usb hub via the local friendly Frys.

Booted rapidly, then decided to install a more recent OS; at the time of writing, found an excellent description with links here. Flashing the device took less than 30seconds, a new record I think. Shortly thereafter, installed OpenCV optimized specifically for the Tegra, as well as a few other things via the Nvidia JetPack. After installing a few other dependencies by way of sudo apt-get, I built and ran plot2txt software using a key benchmark, an image with >2M pixels. All data series were extracted in just under 2 seconds, only slightly worse than an Intel i7. The high relative performance in this case has much to do with OpenCV, which can take advantage of the CUDA cores. I tested a number of other benchmarks & libraries including DGEMM by way of ATLAS for ARM, and found at least 6 G Flop/s in many cases. I also find jetson to be incredibly stable versus several other dev kits/ARM offerings; it will happily run at full tilt all day long, barely breaking a sweat, courtesy the superior design and generous fan.

For more, check out the Nvidia Developer Zone/embedded computing, you most definitely will not be disappointed!

vale alta vista

Wow, the death of two great tech trailblazers  in the span of one week; what next, sharknado? As an undergraduate in the mid 90s, there was nothing quite like exploring the 57 documents available on the internet using alta vista and arguing the validity of string theory with other nerds. Thanks be to heaven I met a girl and got married.



materials informatics

Attached is a recent poster from the snowbird machine learning conference, using support vector machines to learn the complex mapping between NMR spectra and underlying structure. By using machine learning to solve this inverse problem, the hope is of course to present spectra for new and interesting materials to the ML network in order to learn the underlying structure automatically without need for intensive ab initio calculations and experimental simulation.

APS March talk


Workers in various scientific disciplines seek to develop chemical models for extended and molecular systems. The modeling process revolves around the gradual refinement of model assumptions, through comparison of experimental and computational results. Solid state Nuclear Magnetic Resonance (NMR) is one such experimental technique, providing great insight into chemical order over Angstrom length scales. However, interpretation of spectra for complex materials is difficult, often requiring intensive simulations. Similarly, working forward from the model in order to produce experimental quantities via ab initio is computationally demanding. The work involved in these two significant steps, compounded by the need to iterate back and forth, drastically slows the discovery process for new materials. There is thus great motivation for the derivation of structural models directly from complex experimental data, the subject of this work. Using solid state NMR experimental datasets, in conjunction with ab initio calculations of measurable NMR parameters, a network of machine learning kernels are trained to rapidly yield structural details, on the basis of input NMR spectra. Results for an environmentally relevant material will be presented, and directions for future work.

aps talk

Mining Spectra + Molecules

I’ve been working with colleagues and collaborators from the UK to mine NMR spectra and corresponding molecular structures from documents. The object is to create an XML/CML database to give researchers unprecedented access to information, useful in (for instance) drug discovery. At this stage, I have focused on writing algorithms for the extraction of molecules to *svg, and NMR data to *txt. The latter is then refined and processed, first to determine peak positions. The data is then optimally fit using mixture models, and peak lists created automatically using standard and novel algorithms.

Forecasting w/ RSS feeds + SVM + Wavelets

You know the words, but probably haven’t seen them in the same sentence. Simply put, I’m going to wax lyrical about the causal relationship between news and certain types of stock index, illustrated thus:
There are lots of assumptions here, including the efficiency and transparency of the market, and the assumption that the investor doesn’t have inside knowledge. I also assume that the investor is only informed this way, he/she has no conception of real intrinsic value until such times as he/she is informed via company reports and the like. So a good candidate index would be a tech stock, where the actual commodity might be ambiguous and the index is heavily manipulated by opinion over real worth. To summarize, the figure relates to a public company whose index is a strong function of investor feedback from news. We would like to exploit this fact, for the class of companies for which this might be true, by using machine learning to parse news and generate a signal which is some function of the company’s index. We are relying on the power of the written word to influence events far into the future.

For example, consider a fictitious tech consultancy (offering the somewhat ambiguous commodity of ‘useful information’) who features regularly in your favorite tech RSS feed. Let’s naïvely assign a value of +1 to a positive statement, -1 to a negative statement. You may read at alternate times:

That’s preposterous, what they offer is useful
That’s useful, what they offer is preposterous

Both statements convey both positive and pejorative messages directed at different parties, yet are composed of the same words. So in classifying a statement using machine learning on your PS3 yellow dog cluster, we must come up with both a useful dictionary and a means to encode context, even before assigning value to words. A naïve assignment of value to words gives each statement a sum total of 0, even though they convey drastically different opinions. Further, frequency can be useful, consider:

That’s preposterous, what they offer is very, very useful

Obviously ‘very’ is used to provide emphasis and therefore frequency of words has bearing also. Consider finally the statement:

CEO Joe Blo will have neurosurgery on June 11

Since the company is basically selling information, the CEO’s power to provide information is going to change drastically after 06/11. Thus there is some uncertainty, and we would expect sentiment to sway between negative and positive in the interim ie., words may take a ‘value’ in the continuum between +/- 1.

Finally, some measure of the reliability or impact of the news source proves helpful in weighting this source against others. This ‘impact factor’ could be simply determined from website volume, or from something more complicated like a Markov Chain ranking algorithm. To summarize then, in order to classify text and generate ‘signal’ from a RSS feed, at the very least we require:

  • a dictionary
  • a means to encode word:
    • value
    • context
    • frequency
  • Impact of the news source

How we classify the very large amount of information produced after this manner to produce useful signal is a trickier topic requiring a little geometry and perhaps Bayes. To get the creative juices bubbling, here’s a couple of lines of code to play with in bash:

# date | cut -c12-19 > foo.txt

#curl –silent ‘; | awk ‘{for (i=1;i<=NF;i++) { if ($i==”is”) {print NR,i}}}’ >> foo.txt

Next time I’ll provide a more rigorous example and show how we can produce useful signal from an ensemble of SVM’s. (In the meantime you may want to check out SVM Light). Last but not least, I’ll go over ideas/objects from Hilbert space including completeness and wavelets, and how we can ultimately use some math with our signal for robust time series prediction.

ED: go easy with curl, you don’t want to come across as a robot and get your IP/domain banned 🙂

Norm Conserving Pseudopotentials I

Excellent description by Eberhard Engel here

I’m in the process of trying to create a large number of NCPP’s for the calculation of magnetic properties, using GIPAW. The end goal is to try, in conjunction with machine learning and NMR, to do structure determination for complicated materials.

This is somewhat related to the work presented at M’soft eScience, with the slightly different goal of going from a large database of calculated values, and via clustering and comparison to NMR simulations, back-out chemical structures.