Small subtle errors that are only exposed at certain execution parts could be one. You might place things differently onto the GPU depending on how large the batch is, if you've found one way to be faster batch_size<1024, but another when batch_size>1024. As number of concurrent incoming requests goes up, you increase batch_size. Just one possibility, guess there could be a multitude of reasons, as it's really hard to reason about until you sit with the data in front of you. vLLM has had bugs with these sort of thing too, so wouldn't surprise me.
Small subtle errors that are only exposed at certain execution parts could be one. You might place things differently onto the GPU depending on how large the batch is, if you've found one way to be faster batch_size<1024, but another when batch_size>1024. As number of concurrent incoming requests goes up, you increase batch_size. Just one possibility, guess there could be a multitude of reasons, as it's really hard to reason about until you sit with the data in front of you. vLLM has had bugs with these sort of thing too, so wouldn't surprise me.