DocEng09 paper (almost done)

April 7, 2009

Abstract: For at least fifty years, liquid state Nuclear Magnetic Resonance (NMR) has served as an important analytical technique in studying local atomic bonding information. Thus, a vast amount of data of interest to the chemist and crystallographer resides in archived documents, containing liquid state NMR spectra and accompanying molecular structures. These structures are determined on the basis of chemical shift information from spectra, using well established empirical rules. The combined wealth of information represented visually in the spectra and molecules precludes straightforward inclusion in a traditional database. Given its value to the researcher, work by this group is being dedicated to automatic extraction of spectral and molecular information from documents, for conversion to the Chemical Markup Language (CML) and incorporation into a database. Preliminary results are presented here, as well as details of future work.

Keywords: Information Extraction, Chemical Markup Language, Nuclear Magnetic Resonance, Computer Assisted Structure Determination, PostScript

doceng09fig


Building a distributed PSE

February 17, 2009

Over the last couple of years, I’ve been invested in building a distributed problem solving environment (PSE). I offer a simplified application here in the hopes that the general ideas can be useful to someone.

What has also initiated a desire to see this released is my wife’s iPhone with its addictive applications. On the downside, there is a serious lack of real estate and unless I convert to Mac I can’t get my hands on the SDK. However, coupling the iPhone’s portability with a personal server running traditional applications makes for a cheap and powerful PSE. Computational tasks and data storage can take place server side with the iPhone serving as a very pleasing web portal.

pse_fig

There are plenty of applications of these ideas to the sciences, for someone with a modicum of talent and vision. For the purposes of this example I’m going to use free financial data and the computational task will be options pricing under the Black Scholes model for European contracts. I’ve taken a data source at random, and I have no idea if there is a European style options contract available. Also realize that there is potentially unlimited risk associated with options, and it is widely held that the traditional theory is naïve and at worst downright wrong, particularly when using historical data to calculate volatility. So use this example under advisement…and as always, please be careful with curl, don’t get yourself banned :)

  1. Set up MySQL + Apache + PHP on a server
  2. Create Database + table on server:
  3. //invoke mysql (as root)
    >mysql –u root –p
    //create database
    >create database tech_stock;
    //extend permission to user bill with password *****
    >GRANT ALL ON tech_stock.* TO bill@localhost IDENTIFIED BY “*****“;
    //quit;
    >quit;
    //as user, invoke mysql as before; select database
    >mysql tech_stock –u bill –p
    //create a table
    >create table tech_A( n_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, date VARCHAR(8), open FLOAT,
    high FLOAT, low FLOAT, close FLOAT, volume FLOAT);
    //check it exists
    >DESCRIBE tech_A;
    //logout
    >quit;

  4. Write a shell script to grab data:
  5. #!/bin/sh
    # data update/download wjb 02/08
    Month=$(date +%b)
    Day=$(date +%d)
    Year=$(date +%G)
    str1=”http://finance.google.com/finance/historical?cid=659815&startdate=Feb+10%2C+2008&enddate=”
    str2=”%2C+”$Year”&output=csv”
    curl $str1$Month”+”$Day$str2 > tech_A.csv
    exit #cheerio

  6. Make a html portal:
  7. <html><body style=”font-family:verdana”>
    <title>PSE Demo</title>
    <form action=”main.php” method=”post” enctype=”multipart/form-data”>
    <font size=”5″ style=”color:blue”>Data Processing Portal<br>
    <table style=”background-color:blue”>
    <tr><td><font color=”white”>Password:</font> </td>
    <td><input type=”password” name=”pswd” /> </td></tr>
    <tr><td><font color=”white”>Email: </font></td>
    <td><input type=”text” name=”email” /> </td></tr>
    <tr><td><input type=”submit” name=”cmdupload” value=”Submit” /></tr></td>
    </table>
    </form>
    <br>
    <font size=”1″>
    Collaboratory for SDE data modeling <br>
    </font>
    </body></html>

more to come…


Mining Spectra + Molecules

February 3, 2009

I’ve been working with colleagues and collaborators from the UK to mine NMR spectra and corresponding molecular structures from documents. The object is to create an XML/CML database to give researchers unprecedented access to information, useful in (for instance) drug discovery. At this stage, I have focused on writing algorithms for the extraction of molecules to *svg, and NMR data to *txt. The latter is then refined and processed, first to determine peak positions. The data is then optimally fit using mixture models, and peak lists created automatically using standard and novel algorithms.
spec_db


Forecasting w/ RSS feeds + SVM + Wavelets

January 29, 2009

You know the words, but probably haven’t seen them in the same sentence. Simply put, I’m going to wax lyrical about the causal relationship between news and certain types of stock index, illustrated thus:
diag_wave3
There are lots of assumptions here, including the efficiency and transparency of the market, and the assumption that the investor doesn’t have inside knowledge. I also assume that the investor is only informed this way, he/she has no conception of real intrinsic value until such times as he/she is informed via company reports and the like. So a good candidate index would be a tech stock, where the actual commodity might be ambiguous and the index is heavily manipulated by opinion over real worth. To summarize, the figure relates to a public company whose index is a strong function of investor feedback from news. We would like to exploit this fact, for the class of companies for which this might be true, by using machine learning to parse news and generate a signal which is some function of the company’s index. We are relying on the power of the written word to influence events far into the future.

For example, consider a fictitious tech consultancy (offering the somewhat ambiguous commodity of ‘useful information’) who features regularly in your favorite tech RSS feed. Let’s naïvely assign a value of +1 to a positive statement, -1 to a negative statement. You may read at alternate times:

That’s preposterous, what they offer is useful
That’s useful, what they offer is preposterous

Both statements convey both positive and pejorative messages directed at different parties, yet are composed of the same words. So in classifying a statement using machine learning on your PS3 yellow dog cluster, we must come up with both a useful dictionary and a means to encode context, even before assigning value to words. A naïve assignment of value to words gives each statement a sum total of 0, even though they convey drastically different opinions. Further, frequency can be useful, consider:

That’s preposterous, what they offer is very, very useful

Obviously ‘very’ is used to provide emphasis and therefore frequency of words has bearing also. Consider finally the statement:

CEO Joe Blo will have neurosurgery on June 11

Since the company is basically selling information, the CEO’s power to provide information is going to change drastically after 06/11. Thus there is some uncertainty, and we would expect sentiment to sway between negative and positive in the interim ie., words may take a ‘value’ in the continuum between +/- 1.

Finally, some measure of the reliability or impact of the news source proves helpful in weighting this source against others. This ‘impact factor’ could be simply determined from website volume, or from something more complicated like a Markov Chain ranking algorithm. To summarize then, in order to classify text and generate ‘signal’ from a RSS feed, at the very least we require:

  • a dictionary
  • a means to encode word:
    • value
    • context
    • frequency
  • Impact of the news source

How we classify the very large amount of information produced after this manner to produce useful signal is a trickier topic requiring a little geometry and perhaps Bayes. To get the creative juices bubbling, here’s a couple of lines of code to play with in bash:

# date | cut -c12-19 > foo.txt

#curl –silent ‘http://rss.slashdot.org/Slashdot/slashdot’ | awk ‘{for (i=1;i<=NF;i++) { if ($i==”is”) {print NR,i}}}’ >> foo.txt

Next time I’ll provide a more rigorous example and show how we can produce useful signal from an ensemble of SVM’s. (In the meantime you may want to check out SVM Light). Last but not least, I’ll go over ideas/objects from Hilbert space including completeness and wavelets, and how we can ultimately use some math with our signal for robust time series prediction.

ED: go easy with curl, you don’t want to come across as a robot and get your IP/domain banned :)


Norm Conserving Pseudopotentials I

January 14, 2009

Excellent description by Eberhard Engel here

I’m in the process of trying to create a large number of NCPP’s for the calculation of magnetic properties, using GIPAW. The end goal is to try, in conjunction with machine learning and NMR, to do structure determination for complicated materials.

This is somewhat related to the work presented at M’soft eScience, with the slightly different goal of going from a large database of calculated values, and via clustering and comparison to NMR simulations, back-out chemical structures.

ml_mqmas


Microsoft eScience presentation 12/08/08

December 18, 2008

ChemXSeer Collaboratory

October 24, 2008

A presentation made to CEKA meeting at Penn State 10/24/08 on the collaboratory project I participate in… covers Knowledge Discovery, Machine Learning, Databases, Web 2.0 etc etc

chemxseer talk


simple grid scheduling algo

September 30, 2008

grids are becoming increasingly popular but after you remove some fancy protocols and security measures they are still composed of distributed machines. If you have a fairly ‘dirty’ grid, ie., very heterogeneous and not endowed with MPI etc, when performing data decomposition, some thought must be given as to how data is divided. Suppose you would like to divide M same tasks amongst N processors. If (taking into account network latency etc) one can order the time it takes to perform a single task as a1 < a2 < .. < ai < .. < aN, then a naive division of labor for each node would be:

But this overlooks possible data collision and latency involved in loading buffers at the manager, so take this into account with an extra term, giving the tasks for a single node as:

where b is less than one, and since the total tasks handled by N procs must be M, this implies the relationship btwn r and b:

r may be determined from experiment, and b follows.


Press for Image Proc Work

September 26, 2008

In an unusual twist, a few news sources are reporting on the aforementioned image proc work at Penn State: New Scientist , Naked Scientists and L’Atelier

I’m flattered and more than a little surprised :)

ED: also showed up in ACM tech news


JCDL image extraction paper/talk

September 16, 2008

Here’s the paper abstract, links to paper/talk follow:

“Most search engines index the textual content of documents in digital libraries. However, scholarly articles frequently report important findings in figures for visual impact and the contents of these figures are not indexed. These contents are often invaluable to the researcher in various fields, for the purposes of direct comparison with their own work. Therefore, searching for figures and extracting figure data are important problems. To the best of our knowledge, there exists no tool to automatically extract data from figures in digital documents. If we can extract data from these images automatically and store them in a database, an end-user can query and combine data from multiple digital documents simultaneously and efficiently. We propose a framework based on image analysis and machine learning to extract information from 2-D plot images and store them in a database. The proposed algorithm identifies a 2-D plot and extracts the axis labels, legend and the data points from the 2-D plot. We also segregate overlapping shapes that correspond to different data points. We demonstrate performance of individual algorithms, using a combination of generated and real-life images.”

paper
talk