(Not the author but I work in real-time voice.) WebSockets don't really translate to actual GPU load, since they spend a ton of time idling. So strictly speaking, you don't need a GPU per WebSocket assuming your GPU infra is sufficiently decoupled from your user-facing API code.
That said, a GPU per generation (for some operational definition of "generation") isn't uncommon, but there's a standard bag of tricks, like GPU partitioning and batching, that you can use to maximize throughput.
> that you can use to maximize throughput
While degrading the experience sometimes, little or by a lot, thanks to possible "noisy neighbors". Worth keeping in mind that most things are trade-offs somehow :) Mostly important for "real-time" rather than batched/async stuff of course.