How would one set this sort of test up? I surely have example domains where LLMs routinely do poorly...

j2kun • yesterday at 11:55 PM • 1 reply • view on HN

How would one set this sort of test up? I surely have example domains where LLMs routinely do poorly (for example, custom bazel rules and workspaces), but what would constitute a "showcase" here?

Replies

lijok • yesterday at 11:58 PM

To change my mind I’ll be satisfied with a thorough description of the domain and ideally a theory on why it does poorly in that domain. But we’re not talking LLMs here, we’re talking opus4.5 specifically.

➕ show 2 replies

alt Hacker News

Replies