> the estimates
It doesn't estimate.
It generates tokens that read like estimates associated with the context in its training material.
What would you expect the generator to output instead?
It generates tokens by estimating what the next token is going to be.
Sure it cannot think like a human, but given it's input, it should give a good statistical answer (approximating not of how long it actually takes, but what a human would say how long it takes).
I think people are continuing to view these systems as pure LLMs - when that ship sailed 6+ months ago. Between being able to review memory, using agent harnesses and sub agents and skills to go out and discover information - modern systems (Codex, Claude Code, Cursor) - use LLMs - but the LLM is only a small component of it. Compare what you get from sending a request to a chatbot like ChatGPT - to what you can from a modern harness. The output is influenced by the LLM, but it's no longer a "model making a token prediction based on training material and RLHF" - that's a very 2025 way of looking at these systems.
Even Gary Marcus is starting to come around and realize that his priors are no longer as relevant as they once were.
you might like the stuff in my work of oh my pi, its a test bed for my ideas around making these tools more reliable. hoping to maybe have a native ui iter of the real thing that this is a test bed for this summer.
https://github.com/cartazio/oh-punkin-pi/blob/main/scripts/b...
Obviously there isn't a hidden corpus of logs of coding chatbot assistants that has been accumulating over the years, but these coding chatbot assistants output tokens that resemble how we all imagined a coding chatbot assistant would have operated had it existed in the first place to end up in a corpus. "Training material" includes supervised fine-tuning, preference training, RLHF, and so on, so that certain outputs (like these timeline estimates) may really have been decided (at some level of conscious awareness) by product teams.