How would one set this sort of test up? I surely have example domains where LLMs routinely do poorly (for example, custom bazel rules and workspaces), but what would constitute a "showcase" here?
To change my
mind I’ll be satisfied with a thorough description of the domain and ideally a theory on why it does poorly in that domain. But we’re
not talking LLMs here, we’re talking opus4.5 specifically.
To change my mind I’ll be satisfied with a thorough description of the domain and ideally a theory on why it does poorly in that domain. But we’re not talking LLMs here, we’re talking opus4.5 specifically.