Forecasting w/ RSS feeds + SVM + Wavelets

January 29, 2009

You know the words, but probably haven’t seen them in the same sentence. Simply put, I’m going to wax lyrical about the causal relationship between news and certain types of stock index, illustrated thus:
diag_wave3
There are lots of assumptions here, including the efficiency and transparency of the market, and the assumption that the investor doesn’t have inside knowledge. I also assume that the investor is only informed this way, he/she has no conception of real intrinsic value until such times as he/she is informed via company reports and the like. So a good candidate index would be a tech stock, where the actual commodity might be ambiguous and the index is heavily manipulated by opinion over real worth. To summarize, the figure relates to a public company whose index is a strong function of investor feedback from news. We would like to exploit this fact, for the class of companies for which this might be true, by using machine learning to parse news and generate a signal which is some function of the company’s index. We are relying on the power of the written word to influence events far into the future.

For example, consider a fictitious tech consultancy (offering the somewhat ambiguous commodity of ‘useful information’) who features regularly in your favorite tech RSS feed. Let’s naïvely assign a value of +1 to a positive statement, -1 to a negative statement. You may read at alternate times:

That’s preposterous, what they offer is useful
That’s useful, what they offer is preposterous

Both statements convey both positive and pejorative messages directed at different parties, yet are composed of the same words. So in classifying a statement using machine learning on your PS3 yellow dog cluster, we must come up with both a useful dictionary and a means to encode context, even before assigning value to words. A naïve assignment of value to words gives each statement a sum total of 0, even though they convey drastically different opinions. Further, frequency can be useful, consider:

That’s preposterous, what they offer is very, very useful

Obviously ‘very’ is used to provide emphasis and therefore frequency of words has bearing also. Consider finally the statement:

CEO Joe Blo will have neurosurgery on June 11

Since the company is basically selling information, the CEO’s power to provide information is going to change drastically after 06/11. Thus there is some uncertainty, and we would expect sentiment to sway between negative and positive in the interim ie., words may take a ‘value’ in the continuum between +/- 1.

Finally, some measure of the reliability or impact of the news source proves helpful in weighting this source against others. This ‘impact factor’ could be simply determined from website volume, or from something more complicated like a Markov Chain ranking algorithm. To summarize then, in order to classify text and generate ‘signal’ from a RSS feed, at the very least we require:

  • a dictionary
  • a means to encode word:
    • value
    • context
    • frequency
  • Impact of the news source

How we classify the very large amount of information produced after this manner to produce useful signal is a trickier topic requiring a little geometry and perhaps Bayes. To get the creative juices bubbling, here’s a couple of lines of code to play with in bash:

# date | cut -c12-19 > foo.txt

#curl –silent ‘http://rss.slashdot.org/Slashdot/slashdot’ | awk ‘{for (i=1;i<=NF;i++) { if ($i==”is”) {print NR,i}}}’ >> foo.txt

Next time I’ll provide a more rigorous example and show how we can produce useful signal from an ensemble of SVM’s. (In the meantime you may want to check out SVM Light). Last but not least, I’ll go over ideas/objects from Hilbert space including completeness and wavelets, and how we can ultimately use some math with our signal for robust time series prediction.

ED: go easy with curl, you don’t want to come across as a robot and get your IP/domain banned :)


SDE scripts

January 19, 2009

Mostly a  reworking of  fortran code from Kloeden & Platen into Matlab/Octave. Includes important things like Karhunen-Loéve expansions, Stratonovich integrals for higher order methods, Ito summation etc and also a Markov Chain Monte Carlo example

SDE scripts


Norm Conserving Pseudopotentials I

January 14, 2009

Excellent description by Eberhard Engel here

I’m in the process of trying to create a large number of NCPP’s for the calculation of magnetic properties, using GIPAW. The end goal is to try, in conjunction with machine learning and NMR, to do structure determination for complicated materials.

This is somewhat related to the work presented at M’soft eScience, with the slightly different goal of going from a large database of calculated values, and via clustering and comparison to NMR simulations, back-out chemical structures.

ml_mqmas


Padé Approximants

January 14, 2009

You may have a Taylor series which is slowly convergent or downright divergent, in which case you might want to try a Padé approximant. This is essentially a method which approximates a function in terms of the ratio of two polynomials, of orders N,M. A particularly useful representation is in continued fractions; the (N+1)th member of the Padé sequence:
pade_1
for J >= 0 is given by:
pade_2
Bender & Orszag in their book give an algorithm for constants c, I’ve worked out the first few by hand for you. There’s no such thing as a free lunch; you might find a converged value for your function, but it’s at the expense of nasty analysis in finding progressively higher orders for c
pade_3


Fiery Furnaces/Widow City

January 13, 2009

I always thought PitchFork’s review was a little terse, this is a thoroughly playable and pleasant album, particularly tracks 3,6 and 8. It’s in heavy rotation this week on my playlist(s)

widow city

widow city


Glorious Gawk part I

January 10, 2009

It’s frequently helpful to eyeball a structure while going through iterations of ab initio, sometimes even building up animation for a dynamic calculation. One very simple way without going through a more intensive application is to parse a structure file with Gawk/Awk, to create input for POVray which is then easily rendered. Example follows, simply change expressions /foo/ and fields $ to suit your file…

ED: POVray uses a left-handed system


#script to write povray code from co-ordinates

BEGIN{print “\#include \”colors.inc\”\n \
\#include \”textures.inc\”\n \
\camera \{ \n \ location \<-2,18,-5\>\n \
look_at \<-2, 0,-5\>\n \ angle 45 \n \
\} \n \ plane \{ \n \
y, -100 \n \ texture \{ \n \
pigment \{ \n \ color rgb\<1, 1, 1\>\n \
\} \n \ finish \{ \n \
diffuse 0.4 \n \ ambient 2 \n \
phong 0 \n \ phong_size 0 \n \
reflection 0 \n \ \}\}\} \n \

\#declare a=8.7\; \n \ \#declare c=9.0\; \n \
\#declare Green = texture\{ \n \
pigment \{ color rgb\<0.2, 0.8, 0.2\>\} \n \
finish \{ambient 0.7 diffuse 0.5 reflection 0.01\} \n \
\} \n \
\#declare Blue = texture\{ \n \
pigment \{ color rgb\<0.2, 0.2, 0.8\>\} \n \
finish \{ambient 0.7 diffuse 0.5 reflection 0.01\} \n \
\} \n \
\#declare Yellow = texture\{ \n \
pigment \{ color Yellow\} \n \
finish \{ambient 0.7 diffuse 0.5 reflection 0.01\} \n \
\} \n \
\#declare Red = texture\{ \n \
pigment \{ color rgb\<0.8, 0.2, 0.2\>\} \n \
finish \{ambient 0.7 diffuse 0.5 reflection 0.01\} \n \
\}” > “scsul.pov”};

/S1/{print “\n sphere\{\< “$7″ , “$9″ , “$8″ \>, 0.15 \n \
texture \{Yellow\}\n \\}\n” > “scsul.pov”};

/Sc/{print “\n sphere\{\< “$7″ , “$9″ , “$8″ \>, 0.3 \n \
texture \{Green\}\n \
\}\n” > “scsul.pov”};

/O/{print “\n sphere\{\< “$7″ , “$9″ , “$8″ \>, 0.6 \n \
texture \{Red\}\n \
\}\n” > “scsul.pov”};

END{ print “light_source \{ \<5, 40, 5\> \n \
color White \}” > “scsul.pov”};