I bet most of these issues are essentially system prompt/harness issues. If your example had ...

embedding-shape • yesterday at 12:42 PM • 1 reply • view on HN

I bet most of these issues are essentially system prompt/harness issues.

If your example had "Validate any details before sharing them with the user, with multiple sources" as the system prompt, it was using a model that is strong at following system prompts precisely and had access to some basic tools, then it'd spend maybe minutes more, but the answer would have been way more accurate.

But no, Google want "the new search results" (LLM hallucinations) to be on top, so we end up with "sounds plausible" answers instead "Collection of evidence from reliable/semi-reliable" or similar, which sucks. We could have quality, but it's too expensive/slow, so we get slop instead, just to maximize for speed and convenience.

Replies

techpression • yesterday at 2:52 PM

Errors multiply though, you might just get more plausible sounding errors than actual facts.

Like when agent 1 says X, agent 2 verifies it as Y and the original question ends up being some weird amalgamation of Z with additional ”this is really true” statements sprinkled on top.

I agree Google responses hurt more than help, but I’ve also gotten identical outcomes of 40min self-reasoning Opus threads (it’s less common obviously).

➕ show 1 reply

alt Hacker News

Replies