OP here. We realized there are a ton of limitations with backtest and paper money but still wanted to do this experiment and share the results. By no means is this statistically significant on whether or not these models can beat the market in the long term. But wanted to give everyone a way to see how these models think about and interact with the financial markets.
> Grok ended up performing the best while DeepSeek came close to second.
I think you mean "DeepSeek came in a close second".
Cool experiment.
I have a PhD in capital markets research. It would be even more informative to report abnormal returns (market/factor-adjusted) so we can tell whether the LLMs generated true alpha rather than just loading on tech during a strong market.
These are LLMs - next token guessers. They don't think at all and I suggest that you don't try to get rich quick with one!
LLMs are handy tools but no more. Even Qwen3-30B heavily quantised will do a passable effort of translating some Latin to English. It can whip up small games in a single prompt and much more and with care can deliver seriously decent results but so can my drill driver! That model only needs a £500 second hand GPU - that's impressive for me. Also GPT-OSS etc.
Yes, you can dive in with the bigger models that need serious hardware and they seem miraculous. A colleague had to recently "force" Claude to read some manuals until it realised it had made a mistake about something and frankly I think "it" was only saying it had made a mistake. I must ask said colleague to grab the reasoning and analyse it.
I think it would be interesting to see how it goes in a scenario where the market declines or where tech companies underperform the rest of the market. In recent history they've outperformed the market and that might bias the choices that the LLMs make - would they continue with these positive biases if they were performing badly?
> But wanted to give everyone a way to see how these models think…
Think? What exactly did “it” think about?
You should redo this with human controls. By a weird coincidence, I have sufficient free time.