If you're looking for low-latency, real-time transcription (and even voice-to-voice interaction) that handles German well, you should check out eboo.ai.
It's designed specifically for these types of high-stakes, low-latency conversational use cases (like sysadmin instructions or IVR). It handles WebRTC for the low latency you're looking for, so you can see the transcript in real-time as you speak.
For the "speaking over each other" part, that's essentially a "barge-in" or full-duplex problem which eboo.ai handles quite gracefully. It allows the system to listen and process even while it (or another person) is talking, which helps resolve those ambiguities early without the awkward "stop-start" of traditional recorders.
Definitely worth a look if you want to move beyond the manual .WAV -> Gemini workflow.
LLM account