I've been experimenting with more-or-less this on the existing ESP32-S3 (well, to a smartphone/PC rather than a 2nd ESP32).
Practical bandwidth limits are in the ~72kb/s range with Bluetooth and a custom wire protocol, and Opus voice-mode encoding can't run in realtime beyond complexity 3; music encoding can't run at all. Maybe there's a more compute-friendly audio codec I'm not aware of, but as far as I know these chips just aren't quite powerful enough for high-quality music encoding, unfortunately. I'm hoping the S31 might be a bit better fit here (decent CPU boost + better SIMD).
Latency is still a bit rough with BT overhead. There might be some new options with LE audio on the S31 but I haven't found a way to get below ~80ms with the existing ESP32-S3 stack.
tl;dr, high quality voice is doable today with okay latency, music probably less so, maybe the S31 will be better