Last time I used chatbox arena, I was the one to ask question to LLM and so I made my own benchmark....

aucisson_masque • 02/20/2025 • 2 replies • view on HN

Last time I used chatbox arena, I was the one to ask question to LLM and so I made my own benchmark. There wasn't any predefined question.

How could Musk LLM train on data that does not yet exist ?

Replies

HenryBemis • 02/20/2025

That. I have used only ChatGPT and I remember asking 4 legacy to write some code. I asked o3 the same question when it came out, and then I compared the codes. o3 was 'better' more precise, more detailed, less 'crude'. Now, don't get me wrong, crude worked fine. But when I wanted to do the v1.1 and v1.2 o3 nailed it every time, while 4 legacy was simply bad and full of errors.

With that said, I assume that every 'next' version of each engine is using my 'prompts' to train, so each new version has the benefit of having already processed my initial v1.0 and then v1.1 and then v1.2. So it is somewhat 'unfair' because for "ChatGTP v2024" my v1.0 is brand new while for "ChatGTP v2027" my v1.0, v1.1, v1.2 is already in the training dataset.

I haven't used Grok yet, perhaps it's time to pause that OpenAI payment and give Elon some $$$ and see how it works 'for me'.

JKCalhoun • 02/20/2025

That's true. You can head over to lmarena.ai and pit it against other LLMs yourself. I only tried two prompts but was surprised at how well it did.

There are "leaderboards" there that provide more anecdotal data points than my two.

alt Hacker News

Replies