The only "black box" here is Anthropic. At least an LLM's performance and consistency can be established by statistical methods.