logoalt Hacker News

We gave 5 LLMs $100K to trade stocks for 8 months

192 pointsby cheeseblubberyesterday at 11:08 PM162 commentsview on HN

Comments

bcrosby95yesterday at 11:20 PM

> Grok ended up performing the best while DeepSeek came close to second. Almost all the models had a tech-heavy portfolio which led them to do well. Gemini ended up in last place since it was the only one that had a large portfolio of non-tech stocks.

I'm not an investor or researcher, but this triggers my spidey sense... it seems to imply they aren't measuring what they think they are.

show 8 replies
naetyesterday at 11:47 PM

I used to work for a brokerage API geared at algorithmic traders and in my experience anecdotal experience many strategies seem to work well when back-tested on paper but for various reasons can end up flopping when actually executed in the real market. Even testing a strategy in real time paper trading can end up differently than testing on the actual market where other parties are also viewing your trades and making their own responses. The post did list some potential disadvantages of backtesting, so they clearly aren't totally in the dark on it.

Deepseek did not sell anything, but did well with holding a lot of tech stocks. I think that can be a bit of a risky strategy with everything in one sector, but it has been a successful one recently so not surprising that it performed well. Seems like they only get to "trade" once per day, near the market close, so it's not really a real time ingesting of data and making decisions based on that.

What would really be interesting is if one of the LLMs switched their strategy to another sector at an appropriate time. Very hard to do but very impressive if done correctly. I didn't see that anywhere but I also didn't look deeply at every single trade.

show 4 replies
Nevermarkyesterday at 11:47 PM

Just one run per model? That isn't backtesting. I mean technically it is, but "testing" implies producing meaningful measures.

Also just one time interval? Something as trivial as "buy AI" could do well in one interval, and given models are going to be pumped about AI, ...

100 independent runs on each model over 10 very different market behavior time intervals would producing meaningful results. Like actually credible, meaningful means and standard deviations.

This experiment, as is, is a very expensive unbalanced uncharacterizable random number generator.

show 5 replies
dash2yesterday at 11:20 PM

There's also this thing going on right now: https://nof1.ai/leaderboard

Results are... underwhelming. All the AIs are focused on daytrading Mag7 stocks; almost all have lost money with gusto.

show 5 replies
dhosektoday at 1:10 AM

I wouldn’t trust any backtracking test with these models. Try doing a real-time test over 8 months and see what happens then. I’d also be suspicious of anything that doesn’t take actual costs into account.

show 1 reply
cheeseblubberyesterday at 11:30 PM

OP here. We realized there are a ton of limitations with backtest and paper money but still wanted to do this experiment and share the results. By no means is this statistically significant on whether or not these models can beat the market in the long term. But wanted to give everyone a way to see how these models think about and interact with the financial markets.

show 7 replies
sethops1yesterday at 11:13 PM

> Testing GPT-5, Claude, Gemini, Grok, and DeepSeek with $100K each over 8 months of backtested trading

So the results are meaningless - these LLMs have the advantage of foresight over historical data.

show 4 replies
lvspifftoday at 1:20 AM

I setup real life accounts with etrade and fidelity using the etrade auto portfolio, fidelity i have an advisor for retirement, and then i did a basket portfolio as well but used ms365 with grok 5 and various articles and strategies to pick a set of 5 etfs that would perform similarly to the exposure of my other two.

This year So far all are beating the s&p % wise (only by <1% though) but the ai basket is doing the best or at least on par with my advisor and it’s getting to a point where the auto investment strategy of etrade at least isn’t worth it. Its been an interesting battle to watch as each rebalances at varying times as i put more funds in each and some have solid gains which profits get moved to more stable areas. This is only with a few k in each acct other than retirement but its still fun to see things play out this year.

In other words though im not surprised at all by the results. Ai isnt something to day trade with still but it is helpful in doing research for your desired risk exposure long term imo.

show 1 reply
buredorannayesterday at 11:27 PM

Like so many analyses before them, including my own, this completely misses the basics of mean/variance risk analysis.

We need to know the risk adjusted return, not just the return.

copypaperyesterday at 11:55 PM

>Each model gets access to market data, news APIs, company financials...

The article is very very vague on their methodology (unless I missed it somewhere else?). All I read was, "we gave AI access to market data and forced it to make trades". How often did these models run? Once a day? In a loop continuously? Did it have access to indicators (such as RSI)? Could it do arbitrary calculations with raw data? Etc...

I'm in the camp that AI will never be able to successfully trade on its own behalf. I know a couple of successful traders (and many unsuccessful!), and it took them years of learning and understanding before breaking even. I'm not quite sure what the difference is between the successful and non-successful. Some sort of subconscious knowledge from staring at charts all day? A level of intuition? Regardless, it's more than just market data and news.

I think AI will be invaluable as an assistant (disclaimer; I'm working on an AI trading assistant), but on its own? Never. Some things simply simply can't be solved with AI and I think this is one of them. I'm open to being wrong, but nothing has convinced me otherwise.

xnxyesterday at 11:28 PM

Spoiler: They did not use real money or perform any actual trades.

hoerzutoday at 12:43 AM

For backtesting LLMs on polymarket I built. You can try with live data without sign up at: https://timba.fun

btbuildemtoday at 1:33 AM

It turns out DeepSeek only made BUY trades (not a single SELL in the history in their live example) -- so basically, buy & hold strategy wins, again.

show 1 reply
regnulltoday at 2:17 AM

I'm working on a project where you can run your own experiment (or use it for real trading): https://portfoliogenius.ai. Still a bit rough, but most of the main functionality works.

client4today at 1:07 AM

The obvious next question is: does the AI on cocaine outperform? https://pihk.ai/

dehrmanntoday at 1:05 AM

Is it just prompting LLMs with "I have $100k to invest. Here are all publicly traded stocks and a few stats on them. Which stocks should I buy?" And repeat daily, rebalancing as needed?

This isn't the best use case for LLMs without a lot of prompt engineering and chaining prompts together, and that's probably more insightful than running them LLMs head-to-head.

mlmonkeyyesterday at 11:32 PM

> We were cautious to only run after each model’s training cutoff dates for the LLM models

Grok is constantly training and/or it has access to websearch internally.

You cannot backtest LLMs. You can only "live" test them going forward.

show 1 reply
hoerzutoday at 12:42 AM

How many trades? What's the z-score?

1a527dd5yesterday at 11:38 PM

Time.

That has been the best way to get returns.

I setup a 212 account when I was looking to buy our first house. I bought in small tiny chunks of industry where I was comfortable and knowledgeable in. Over the years I worked up a nice portfolio.

Anyway, long story short. I forgot about the account, we moved in, got a dog, had children.

And then I logged in for the first time in ages, and to my shock. My returns were at 110%. I've done nothing. It's bizarre and perplexing.

show 2 replies
cedwstoday at 12:04 AM

Backtesting for 8 months is not rigorous enough and also this site has no source code or detailed methodology. Not worth the click.

halzmyesterday at 11:25 PM

I think these tests are always difficult to gauge how meaningful they actually are. If the S&P500 went up 12% over that period, mainly due to tech stocks, picking a handful of tech stocks is always going to set you higher than the S&P. So really all I think they test is whether the models picked up on the trend.

I more surprised that Gemini managed to lose 10%. I wish they actually mentioned what the models invested in and why.

show 2 replies
refactor_mastertoday at 1:28 AM

Should have done GME stocks only. Now THAT would’ve been interesting to see how much they’d end up losing on that.

Just riding a bubble up for 8 months with no consequences is not an indicator of anything.

XCSmetoday at 1:17 AM

If it's backtesting on data older than the model, then strategy can have lookahead bias, because the model might already know what big events will happen that can influence the stock markets.

wowamittoday at 1:12 AM

Is finding the right stocks to invest in an LLM problem? Language models aren't the right fit, I would presume. It would also be insightful to compare this with traditional ML models.

XenophileJKOyesterday at 11:56 PM

So.. I have been using an LLM to make 30 day buy and hold portfolios. And the results are "ok". (Like 8% vs 6% for the S&P 500 over the last 90 days)

What you ask the model to do is super important. Just like writing or coding.. the default "behavior" is likely to be "average".. you need to very careful of what you are asking for.

For me this is just a fun experiment and very interesting to see the market analysis it does. I started with o3 and now I'm using 5.1 Thinking (set to max).

I have it looking for stocks trading below intrinsic value with some caveats because I know it likes to hinge on binary events like drug trial results. I also have it try to have it look at correlation with the positions and make sure they don't have the same macro vulnerability.

I just run it once a month and do some trades with one of my "experimental" trading accounts. It certainly has thought of things I hadn't like using an equal weight s&p 500 etf to catch some upside when the S&P seems really top heavy and there may be some movement away from the top components, like last month.

show 1 reply
digitcatphdyesterday at 11:34 PM

Backtesting is a complete waste in this scenario. The models already know the best outcomes and are biased towards it.

Benderyesterday at 11:47 PM

This experiment was also performed with a fish [1] though it was only given $50,000. Spoiler, the fish did great vs wall street bets.

[1] - https://www.youtube.com/watch?v=USKD3vPD6ZA [video][15 mins]

luccabztoday at 12:55 AM

we should:

1. train with a cutoff date at ~2006

2. simulate information flow (financial data, news, earnings, ...) day by day

3. measure if any model predicts the 2008 collapse, how confident they are in the prediction and how far in advance

mikewarottoday at 12:25 AM

They weren't doing it in real time, thus it's possible that the LLMs might have had undisclosed perfect knowledge of the actual history of the market. Only an real time study is going to eliminate this possibility.

parpfishyesterday at 11:20 PM

I wonder if this could be explained as the result of LLMs being trained to have pro-tech/ai opinions while we see massive run ups in tech stock valuations?

It’d be great to see how they perform within particular sectors so it’s not just a case of betting big on tech while tech stocks are booming

Genegotoday at 1:07 AM

When I see stuff like this, I feel like rereading the Incerto by Taleb just to refresh and sharpen my bullshit senses.

itaketoday at 12:34 AM

Model output is non-deterministic.

Did they make 10 calls per decision and then choose the majority? or did they just recreate the monkey picking stocks strategy?

chongliyesterday at 11:20 PM

They outperformed the S&P 500 but seem to be fairly well correlated with it. Would like to see a 3X leveraged S&P 500 ETF like SPXL charted against those results.

show 2 replies
iLoveOncallyesterday at 11:29 PM

Since it's not included in the main article, here is the prompt:

> You are a stock trading agent. Your goal is to maximize returns.

> You can research any publicly available information and make trades once per day.

> You cannot trade options.

> Analyze the market and provide your trading decisions with reasoning.

>

> Always research and corroborate facts whenever possible.

> Always use the web search tool to identify information on all facts and hypotheses.

> Always use the stock information tools to get current or past stock information.

>

> Trading parameters:

> - Can hold 5-15 positions

> - Minimum position size: $5,000

> - Maximum position size: $25,000

>

> Explain your strategy and today's trades.

Given the parameters, this definitely is NOT representative of any actual performance.

I recommend also looking at the trade history and reasoning for each trade for each model, it's just complete wind.

As an example, Deepseek made only 21 trades, which were all buys, which were all because "Companyy X is investing in AI". I doubt anyone believe this to be a viable long-term trading strategy.

show 1 reply
gwdyesterday at 11:21 PM

The summary to me is here:

> Almost all the models had a tech-heavy portfolio which led them to do well. Gemini ended up in last place since it was the only one that had a large portfolio of non-tech stocks.

If the AI bubble had popped in that window, Gemini would have ended up the leader instead.

show 1 reply
867-5309today at 1:56 AM

GPT-5 was released 4 months ago..

mempkotoday at 3:10 AM

The stats are abysmal. What's the MDD compared to S&P 500. What is the Sortino? What are the confidence intervals for all the stats? Number of trades? So many questions....

hsuduebc2today at 2:34 AM

In bullish market when few companies are creating a bubble, does this benchmark have any informational value? Wouldn't it be better to run this on seamlessly random intervals in past years?

dogmayoryesterday at 11:34 PM

They could only trade once per day and hold 5-15 positions with a position size of $5k-$25k according to the agent prompt. Limited to say the least.

tiffaniyesterday at 11:46 PM

What was the backtesting method? Was walk-forward testing involved? There are different ways to backtest.

_alternator_today at 12:13 AM

Wait, they didn’t give them real money. They simulated the results.

IncreasePoststoday at 1:26 AM

Just picking tech stocks and winning isn't interesting unless we know the thesis behind picking the tech sticks.

Instead, maybe a better test would he give it 100 medium cap stocks, and it needs to continually balance its portfolio among those 100 stocks, and then test the performance.

stuffntoday at 1:02 AM

Trading in a nearly 20 year bull market and doing well is not an accomplishment.

darepublictoday at 1:46 AM

So in other words I should have listened to the YouTube brainrot and asked chatgot for my trades. Sigh.

jacktheturtleyesterday at 11:29 PM

This is really dumb. Because the models themselves, like markets, are indeterministic. They will yield different investment strategies based on prompts and random variance.

This is a really dumb measurement.

dismalafyesterday at 11:57 PM

Back when I was in university we used statistical techniques similar to what LLMs use to predict the stock market. It's not a surprise that LLMs would do well over this time period. The problem is that when the market turns and bucks trends they don't do so well, you need to intervene.

apical_dendriteyesterday at 11:24 PM

Looking at the recent holdings for the best models, it looks like it's all tech/semiconductor stocks. So in this time frame they did very well, but if they ended in April, they would have underperformed the S&P500.

🔗 View 8 more comments