We (gemma.cpp) recently started accumulating softmax terms into f64. There is at least one known case of this causing differing output, but after 200 tokens, hence unlikely to be detected in many benchmarks.
Does anyone have experience with higher-precision matmul and whether it is worthwhile?
Isn’t 200 tokens basically nothing? Did you mean to say 2000?