Remember that models on different inference platforms might not necessarily give exactly the same results, adding another axis of non-determinism to development. Things like quantization, custom model serving silicon, batching, or other inference optimizations might mean a model from the original provider performs differently from the hosted one :/
This paper isn't the exact same scenario, since it's an auditable open weight llama model, but shows the symptoms of this: https://arxiv.org/pdf/2410.20247
Anyone who has used gpt-x via openai vs microsoft has experienced this very clearly.