Fwiw: I didn't read the post carefully, this is just a passing by comment.
For my own use case I was trying to test consistency or an evaluation process and found that injecting a UUID into the system prompt (busting cache) made a material difference.
Without it, resubmitting the same inputs in close time intervals (e.g. 1, 5, or 30 min) would produce very consistent evaluations. Adding the UUID would decrease consistency (showing true evaluation consistency not artificially improved by catching) and highlight ambiguous evaluation criteria that was causing problems.
So I wonder how much prompt caching is a factor here. I think these LLM providers (all of them) are caching several layers beyond just tokenization.