Applications don't get 4GB with a 32-bit address space. The practical split between application and kernel was usually 1-3 or 2-2 with 3-1 being experimental and mooted with the switch to 64-bit. Nowadays with VRAM being almost as large as main RAM, you need the larger address space just to map a useful chunk of it in.
When you factor in memory fragmentation, you really only had a solid 0.75-1.5GB of space that could be kept continuously in use. That was starting to become a problem even when 32-bit was the only practical option. A lot of games saw a benefit to just having the larger address space, such that they ran better in 64-bit with only 4GB of RAM despite the fatter 64-bit pointers.
I believe that's an accident of the evolutionary path chosen with syscalls. If we'd instead gone with a ring buffer approach to make requests, then you'd never need to partition the memory address space; the kernel has its memory and userspace has its and you don't need the situation where the kernel is always mapped.
It depends on the kernel architecture. 4G/4G kernels weren't the most common thing, but also weren't exactly rare in the grand scheme of things. PowerPC macOS (and x86 in macOS before they officially released Intel based mac hardware) were 4G/4G for example. The way that works under x86 is that you just reserve a couple kernel pages mapped into both address spaces to do the page table swap on interrupts and syscalls. A little expensive, but less than you'd think, and having the kernel and user space not fight for virtual address space provided its own efficiencies to partially make up the difference. We've been moving back to that anyway with Kernel Page Table isolation for spectre mitigations.
And 3-1 wasn't really experimental. It was essentially always that way under Linux, and had been supported under Windows since the late 90s.