A lot of the time once you get into multi-gig+ territory the answer isn't "make the kernel faster," it's "stop doing it in the kernel."
You end up pushing the hot path out to userland where you can actually scale across cores (DPDK/netmap/XDP style approaches), batch packets, and then DMA straight to and from the NIC. The kernel becomes more of a control plane than the data plane.
PF/ALTQ is very much in the traditional in-kernel, per-packet model, so it hits those limits sooner.
pushing the hot path out to userland where you can actually scale across cores
What sort of kernel do you have which can't scale across cores?
The big things to avoid are crossing the user/kernel divide and communication across cores.
Staying in the kernel is approximately the same as bypassing the kernel (caveats apply); for a packet filtering / smoothing use case, I don't think kernel bypass is needed. You probably want to tune NIC hashing so that inbound traffic for a given shaping queue arrives in the same NIC rx queue; but you probably want that in a kernel bypass case as well. Userspace is certainly nicer during development, as it's easier to push changes, but in 2026, it feels like traffic shaping has pretty static requirements and letting the kernel do all the work feels reasonable to me.
Otoh, OpenBSD is pretty far behind the curve on SMP and all that (I think their PF now has support for SMP, but maybe it's still in development?; I'd bet there's lots of room to reduce cross core communication as well, but I haven't examined it). You can't pin userspace cores to cpus, I doubt their kernel datastructures are built to reduce communications, etc. Kernel bypass won't help as much as you would hope, if it's available, which it might not be, because you can't control the userspace to limit cross core communications.