The implementation absolutely can influence the outputs.
If you have a sloppy implementations which somehow accumulates a lot of error in it's floating point math, you will get worse results.
It's rarely talked about, but it's a real thing. Floating point addition and multiplication is non-associative and the order of operations affects the correctness and performance. Developers might (unknowningly) trade performance for correctness. And it matters a lot more in the low precision modes we operate today. Just try different methods of summing a vector containing 9,999 fp16 ones in fp16. Hint: it will never be 9,999.0 and you won't get close to the best approximation if you do it in a naive loop.
TIL, thanks.
I thought all current implementations accumulate into a fp32 instead of accumulating in fp16.