This just shows that the models (not AI, statistical models of text used without consent) are not th...

martin-t • today at 1:55 AM • 4 replies • view on HN

This just shows that the models (not AI, statistical models of text used without consent) are not that smart, it's the tooling around them which allows using these models as a heuristic for brute force search of the solution space.

Just last week, I prompted (not asked, it is not sentient) Claude to generate (not tell me or find out or any other anthropomorphization) whether I need to call Dispose on objects passed to me from 2 different libraries for industrial cameras. Being industrial, most people using them typically don't post their code publicly, which means the models have poor statistical coverage around these topics.

The LLM generated a response which triggered the tooling around it to perform dozens of internet searches and based on my initial prompt, the search results and lots of intermediate tokens ("thinking"), generated a reply which said that yes, I need to call Dispose in both cases.

It was phrased authoritatively and confidently.

So I tried it, one library segfaulted, the other returned an exception on a later call. I performed my own internet search (a single one) and immediately found documentation from one of the libraries clearly stating I don't need to call Dispose. The other library being much more poorly documented didn't mention this explicitly but had examples which didn't call Dispose.

I am sure if I used LLMs "properly" "agentically", then they would have triggered the tooling around them to build and execute the code, gotten the same results as me much faster, then equally authoritatively and confidently stated that I don't need to call Dispose.

This is not thinking. It's a form of automation but not thinking and not intelligence.

Replies

TonyStr • today at 8:46 AM

Yes, I think you are spot on. I've been toying with Claude Code recently to counter my own bias against agentic coding. It will confidently create a broken project, run it, read the error messages, fix it, run it, read the error messages and keep going until it runs. I used it to create a firefox addon, which meant that it invoked me very frequently to validate its output. This was much more tedious than letting it work on problems that it could validate with the console. It also kinda sucks at googling and looking up documentation.

AI "reasoning" in it's current state is a hack meant to overcome the problem of contextual learning[0]. It somewhat works given enough time and good automatic tooling. When this problem is solved, I think we will see a significant boost in productivity from these tools. In it's current state, I'm not convinced that they are worth my time (and money).

[0] - https://hy.tencent.com/research/100025?langVersion=en

logicprog • today at 9:02 AM

> I am sure if I used LLMs "properly" "agentically", then they would have triggered the tooling around them to build and execute the code, gotten the same results as me much faster, then equally authoritatively and confidently stated that I don't need to call Dispose.

Yes, usually my agents directly read the source code of libraries that don't have lots of good documentation or information in their training data, and/or create test programs as minimal viable examples and compile and run them themselves to see what happens, it's quite useful.

But you're right overall; LLMs placed inside agents are essentially providing a sort of highly steerable plausible prior for a genetic algorithm to automatically solve problems and do automation tasks. It's not as brute force as a classic genetic algorithm, but it can't always one-shot things, there is sometimes an element of guess-and-check. But IME at least that element is usually not more iterations than it takes me to figure something out (2-3), on average, although sometimes it needs more iterations than I would've on simple problems, and other times much less on harder ones, or vice versa.

Isamu • today at 2:31 AM

>brute force search of the solution space

“Brute force” is mostly what makes it all work, and what is most disappointing to me currently. Including the brute force necessary to train an LLM, the vast quantity of text necessary to approach almost human quality, the massive scale of data centers necessary to deploy these models, etc.

I am hoping this is a transitional period, where LLMs could be used to create better models that are more finesse and less brute force.

➕ show 1 reply

ozozozd • today at 5:15 AM

What a sober and accurate observation of the real capabilities of LLMs.

And it’s nothing to sneeze at because it allows me to stay in the terminal rather than go back and forth between the terminal and Google.

➕ show 1 reply

alt Hacker News

Replies