I'm using TimescaleDB to manage 450GB of stocks and options data from Massive (what used to be polygon.io), and I've been getting LLM agents to iterate over academic research to see if anything works to improve trading with backtesting.
It's an addictive slot machine where I pull the lever and the dials spin as I hope for the sound of a jackpot. 999 out of 1000 winning models do so because of look-ahead bias, which makes them look great but are actually bad models. For example, one didn't convert the time zone from UTC to EST, so five hours of future knowledge got baked into the model. Another used `SELECT DISTINCT`, which chose a value at random during a 0–5 hour window — meaning 0–5 hours of future knowledge got baked in. That one was somehow related to Timescale hypertables.
Now I'm applying the VIX formula to TSLA options trades to see if I can take research papers about trading with VIX and apply them to TSLA.
Whatever the case, I've learned a lot about working with LLM agents and time-series data, and very little about actually trading equities and derivatives.
(I did 100% beat SPY with a train/out-of-sample test, though not by much. I'll likely share it here in a couple weeks. It automates trading on Robinhood, which is pretty cool.)
Fun fact - some of it may be a subset of all data and with trimmed outlying points, so when you set some stop loss conditions they get tripped in the real world, but not by your dataset. Get data from my sources.
Relateable. If I had a dollar for every time I ran into issues with time zones, that would be a profitable strategy in and of itself.
I tried exactly this - loading polygon.io data into TimescaleDB, and it was very inefficient.
Ended up using ClickHouse - much smaller on disk, and much faster on all metrics.
[dead]
[dead]
Nice. I played with this a bit. Agents are very good at Rust and CUDA so massive parallelization of compute for things like options chains may give you an edge. Also, you may find you have a hard time getting very low latency connection - one that is low enough in ms so that when you factor in the other delays, you still have an edge. So one approach might be to acknowledge that as a hobbyist you can't compete on lowest-latency, so you try to compete on two other fronts: Most effective algorithm, and ability to massively parallelize on consumer GPU what would take others longer to calculate.
Best of luck. Super fun!
PS: Just a follow-up. There was a post here a few days ago about a research breakthrough where they literally just had the agent iterate on a single planning doc over and over. I think pushing chain of thought for SOTA foundational models is fertile ground. That may lead to an algorithmic breakthrough if you start with some solid academic research.