Ask HN: How are thinking efforts implemented?

57 points • by simianwords • last Sunday at 12:38 PM • 19 comments • view on HN

Claude and ChatGPT have thinking efforts where you can tune the amount of thinking allowed.

Like low, medium, high, xhigh and so on.

But are they different models underneath? Or same model with different parameter?

The reason I ask is because, if I change the effort param mid conversation in Claude code, I get a warning suggesting I’m breaking the cache.

I don’t think this happens in Codex because when I change the effort, the responses are still quick.

Comments

pyentropy • last Sunday at 3:48 PM

Take a look at the harmony repo which specifies the internal OpenAI format - the effort level is specified in the context after the <|start|> tag - https://github.com/openai/harmony

Note that inference libs also have parsers that put hard limits on reasoning tokens with separate counters (similar to how you can put a limit on token generation per completion versus waiting for an <eos>). For that, take a look at vllm reasoning docs.

➕ show 1 reply

aabdi • last Sunday at 2:33 PM

Different models do slight variants.

Usually it’s done in post training to enforce behavior based on prompt. Ie. System prompt with thinking:max or low or wtv.

Enforcement then goes via constrained decoding, checking for think token start and end with max lengths, or other variations

bjourne • last Sunday at 6:23 PM

LLMs work by generating the most likely continuation to a prompt. But they can also generate multiple likely continuations. This create multiple branches which in turn can generate even more branches. The LLM can then evaluate the branches, prune the unpromising ones, and merge the best ones. More branches means more tokens, means more effort.

➕ show 1 reply

__patchbit__ • last Sunday at 1:50 PM

At a guess. May be associated with token length context window. Down selecting is consistent with warning message, forcing cutoff to context window. The technical term cache being a synonym. Increasing the headroom for more "thinking" should allow the implementation to access more resources without warning about the cache breaking.

sometimelurker • last Sunday at 4:35 PM

they use multitoken prediction behind the scenes, that might interact with the CoT in a strange way. maybe for different thinking modes they have different MTP models? if so thats interesting

➕ show 1 reply

Yahyaaa • last Monday at 1:29 PM

Usually it’s not a different model, it’s the same model with different inference-time settings. “Thinking effort” typically changes the compute budget and decoding behavior (how many steps, how much exploration, sometimes internal planning loops).

Some stacks also tie it to orchestration layers or system/prompt signals, which is why it can look inconsistent across products

shanewei • last Sunday at 1:12 PM

[dead]

alt Hacker News

Ask HN: How are thinking efforts implemented?

Comments