Mining Spectra + Molecules

I’ve been working with colleagues and collaborators from the UK to mine NMR spectra and corresponding molecular structures from documents. The object is to create an XML/CML database to give researchers unprecedented access to information, useful in (for instance) drug discovery. At this stage, I have focused on writing algorithms for the extraction of molecules to *svg, and NMR data to *txt. The latter is then refined and processed, first to determine peak positions. The data is then optimally fit using mixture models, and peak lists created automatically using standard and novel algorithms.



  1. Thijs

    would it be possible to find a database representation that allows for finding quick matched of *mixtures* of molecules.
    SELECT “this beer peak pattern” FROM database;

    28 rows:
    ID | weight | name
    1 |93.0% | H2O
    3 | 6.7% | C2H50H
    4 | 0.3% | ….

    finding superpositions if probably easiest in the Fourier domain, right? And the chosen peak representation *is* in the Fourier domain, right?

  2. bbrouwer

    yes, totally possible! the total spectrum would be the weighted sum of the individual spectra.

    at my present employer we are using a particular method to index the XML documents of chemical structures with their corresponding NMR spectrum, the n-tuples used in the index correspond to the chem shifts and peak amplitudes.

    In the case of your mixture, using substructure/fragment search, candidate structures would be returned based on their individual n-tuples being non-orthogonal to the mixture n-tuple…

  3. Thijs

    sounds logical…
    Decomposing mixutures into components can indeed be build in a separate layer on top of a pattern database with a search strucuture. I’m sure there are some cool algo’s for that. I think you’re saying that you would decompose all the patterns into orthogonal fragments right? …and match those ‘non-overlapping signatures’. That’s smart!

