From experience with large scale clusters, yeah. Weird stuff happens. But it's very hard to setup a test cluster that is actually representative, and you can only do so much on a live cluster. Occasionally, I have been able to find explanations for some of the weird behavior, but usually it's like here's a bug in Linux packet forwarding that was fixed in Linus's tree 15 years ago, but apparently has never been deployed to some router, so it's just going to keep aggregating input packets because large receive offload, and then drop them with needs frag because the aggregated packet is too big to forward. sigh (that's not exactly a cluster scale issue, but it's the most relatable example of an investigation that comes to mind)
You're pretty unlikely to get academic papers when the required setup involves having 100M+ clients geographically dispersed. And it's going to be very hard for peers to reproduce your findings.