There are some days where it acts staggeringly bad, beyond baselines.
But it’s impossible to actually determine if it’s model variance, polluted context (if I scold it, is it now closer in latent space to a bad worker, and performs worse?), system prompt and tool changes, fine tunes and AB tests, variances in top P selection…
There’s too many variables and no hard evidence shared by Anthropic.