> Testing GPT-5, Claude, Gemini, Grok, and DeepSeek with $100K each over 8 months of backtested trading
So the results are meaningless - these LLMs have the advantage of foresight over historical data.
> We time segmented the APIs to make sure that the simulation isn’t leaking the future into the model’s context.
I wish they could explain what this actually means.
That's only if they're trained on data more recent than 8 months ago
Not sure how sound the analysis is but they did apparently actually think of that.
> We were cautious to only run after each model’s training cutoff dates for the LLM models. That way we could be sure models couldn’t have memorized market outcomes.