Over the last two weeks, I ran several unsystematic comparisons of three reasoning models: ChatGPT o...

tkgally • 01/20/2025 • 6 replies • view on HN

Over the last two weeks, I ran several unsystematic comparisons of three reasoning models: ChatGPT o1, DeepSeek’s then-current DeepThink, and Gemini 2.0 Flash Thinking Experimental. My tests involved natural-language problems: grammatical analysis of long texts in Japanese, New York Times Connections puzzles, and suggesting further improvements to an already-polished 500-word text in English. ChatGPT o1 was, in my judgment, clearly better than the other two, and DeepSeek was the weakest.

I tried the same tests on DeepSeek-R1 just now, and it did much better. While still not as good as o1, its answers no longer contained obviously misguided analyses or hallucinated solutions. (I recognize that my data set is small and that my ratings of the responses are somewhat subjective.)

By the way, ever since o1 came out, I have been struggling to come up with applications of reasoning models that are useful for me. I rarely write code or do mathematical reasoning. Instead, I have found LLMs most useful for interactive back-and-forth: brainstorming, getting explanations of difficult parts of texts, etc. That kind of interaction is not feasible with reasoning models, which can take a minute or more to respond. I’m just beginning to find applications where o1, at least, is superior to regular LLMs for tasks I am interested in.

Replies

torginus • 01/20/2025

o1 is impressive, I tried feeding it some of the trickier problems I have solved (that involved nontrivial algorithmic challenges) over the past few months, and it managed to solve all of them, and usually came up with slightly different solutions than I did, which was great.

However what I've found odd was the way it formulated the solution was in excessively dry and obtuse mathematical language, like something you'd publish in an academic paper.

Once I managed to follow along its reasoning, I understood what it came up with could essentially be explain in 2 sentences of plain english.

On the other hand, o1 is amazing at coding, being able to turn an A4 sheet full of dozens of separate requirements into an actual working application.

➕ show 3 replies

rcpt • 01/20/2025

I found that reasoning models were good for CAD. I can ask for openSCAD code to produce and then add to some kind of shape

➕ show 1 reply

starfezzy • 01/20/2025

Can it solve easy problems yet? Weirdly, I think that's an important milestone.

Prompts like, "Give me five odd numbers that don't have the letter 'e' in their spelling," or "How many 'r's are in the word strawberry?"

I suspect the breakthrough won't be trivial that enables solving trivial questions.

➕ show 4 replies

thefourthchime • 01/21/2025

I completely agree, for my day-to-day use o1 isn't needed. I only use it for complicated solutions involving code.

synergy20 • 01/20/2025

a dumb question,how did you use deepseek,e.g r1?

➕ show 2 replies

alt Hacker News

Replies