I’ve been working with colleagues and collaborators from the UK to mine NMR spectra and corresponding molecular structures from documents. The object is to create an XML/CML database to give researchers unprecedented access to information, useful in (for instance) drug discovery. At this stage, I have focused on writing algorithms for the extraction of molecules to *svg, and NMR data to *txt. The latter is then refined and processed, first to determine peak positions. The data is then optimally fit using mixture models, and peak lists created automatically using standard and novel algorithms.

Mining Spectra + Molecules
February 3, 2009Forecasting w/ RSS feeds + SVM + Wavelets
January 29, 2009You know the words, but probably haven’t seen them in the same sentence. Simply put, I’m going to wax lyrical about the causal relationship between news and certain types of stock index, illustrated thus:

There are lots of assumptions here, including the efficiency and transparency of the market, and the assumption that the investor doesn’t have inside knowledge. I also assume that the investor is only informed this way, he/she has no conception of real intrinsic value until such times as he/she is informed via company reports and the like. So a good candidate index would be a tech stock, where the actual commodity might be ambiguous and the index is heavily manipulated by opinion over real worth. To summarize, the figure relates to a public company whose index is a strong function of investor feedback from news. We would like to exploit this fact, for the class of companies for which this might be true, by using machine learning to parse news and generate a signal which is some function of the company’s index. We are relying on the power of the written word to influence events far into the future.
For example, consider a fictitious tech consultancy (offering the somewhat ambiguous commodity of ‘useful information’) who features regularly in your favorite tech RSS feed. Let’s naïvely assign a value of +1 to a positive statement, -1 to a negative statement. You may read at alternate times:
That’s preposterous, what they offer is useful
That’s useful, what they offer is preposterous
Both statements convey both positive and pejorative messages directed at different parties, yet are composed of the same words. So in classifying a statement using machine learning on your PS3 yellow dog cluster, we must come up with both a useful dictionary and a means to encode context, even before assigning value to words. A naïve assignment of value to words gives each statement a sum total of 0, even though they convey drastically different opinions. Further, frequency can be useful, consider:
That’s preposterous, what they offer is very, very useful
Obviously ‘very’ is used to provide emphasis and therefore frequency of words has bearing also. Consider finally the statement:
CEO Joe Blo will have neurosurgery on June 11
Since the company is basically selling information, the CEO’s power to provide information is going to change drastically after 06/11. Thus there is some uncertainty, and we would expect sentiment to sway between negative and positive in the interim ie., words may take a ‘value’ in the continuum between +/- 1.
Finally, some measure of the reliability or impact of the news source proves helpful in weighting this source against others. This ‘impact factor’ could be simply determined from website volume, or from something more complicated like a Markov Chain ranking algorithm. To summarize then, in order to classify text and generate ‘signal’ from a RSS feed, at the very least we require:
- a dictionary
- a means to encode word:
- value
- context
- frequency
- Impact of the news source
How we classify the very large amount of information produced after this manner to produce useful signal is a trickier topic requiring a little geometry and perhaps Bayes. To get the creative juices bubbling, here’s a couple of lines of code to play with in bash:
# date | cut -c12-19 > foo.txt
#curl –silent ‘http://rss.slashdot.org/Slashdot/slashdot’ | awk ‘{for (i=1;i<=NF;i++) { if ($i==”is”) {print NR,i}}}’ >> foo.txt
Next time I’ll provide a more rigorous example and show how we can produce useful signal from an ensemble of SVM’s. (In the meantime you may want to check out SVM Light). Last but not least, I’ll go over ideas/objects from Hilbert space including completeness and wavelets, and how we can ultimately use some math with our signal for robust time series prediction.
ED: go easy with curl, you don’t want to come across as a robot and get your IP/domain banned
Norm Conserving Pseudopotentials I
January 14, 2009Excellent description by Eberhard Engel here
I’m in the process of trying to create a large number of NCPP’s for the calculation of magnetic properties, using GIPAW. The end goal is to try, in conjunction with machine learning and NMR, to do structure determination for complicated materials.
This is somewhat related to the work presented at M’soft eScience, with the slightly different goal of going from a large database of calculated values, and via clustering and comparison to NMR simulations, back-out chemical structures.
ChemXSeer Collaboratory
October 24, 2008A presentation made to CEKA meeting at Penn State 10/24/08 on the collaboratory project I participate in… covers Knowledge Discovery, Machine Learning, Databases, Web 2.0 etc etc
Press for Image Proc Work
September 26, 2008In an unusual twist, a few news sources are reporting on the aforementioned image proc work at Penn State: New Scientist , Naked Scientists and L’Atelier
I’m flattered and more than a little surprised
ED: also showed up in ACM tech news
JCDL image extraction paper/talk
September 16, 2008Here’s the paper abstract, links to paper/talk follow:
“Most search engines index the textual content of documents in digital libraries. However, scholarly articles frequently report important findings in figures for visual impact and the contents of these figures are not indexed. These contents are often invaluable to the researcher in various fields, for the purposes of direct comparison with their own work. Therefore, searching for figures and extracting figure data are important problems. To the best of our knowledge, there exists no tool to automatically extract data from figures in digital documents. If we can extract data from these images automatically and store them in a database, an end-user can query and combine data from multiple digital documents simultaneously and efficiently. We propose a framework based on image analysis and machine learning to extract information from 2-D plot images and store them in a database. The proposed algorithm identifies a 2-D plot and extracts the axis labels, legend and the data points from the 2-D plot. We also segregate overlapping shapes that correspond to different data points. We demonstrate performance of individual algorithms, using a combination of generated and real-life images.”
Posted by bbrouwer
Posted by bbrouwer 
Posted by bbrouwer 