Note that you could also run them on AMD (and presumably Intel) dGPUs. e.g. I have a 32GB R9700, which is much cheaper than a 5090, and runs 27B dense models at ~20 t/s (or MoE models with 3-4B active at ~80t/s). I expect an Arc B70 would also work soon if it doesn't already, and would likely be the price/perf sweet spot right now.
My R9700 does seem to have an annoying firmware or driver bug[0] that causes the fan to usually be spinning at 100% regardless of temperature, which is very noisy and wastes like 20+ W, but I just moved my main desktop to my basement and use an almost silent N150 minipc as my daily driver now.
[0] Or manufacturing defect? I haven't seen anyone discussing it online, but I don't know how many owners are out there. It's a Sapphire fwiw. It does sometimes spin down, the reported temperatures are fine, and IIRC it reports the fan speed as maxed out, so I assume software bug where it's just not obeying the fan curve
I have 2x asrock R9700. One of the them was noticeably noisier than the other and eventually developed an annoying vibration while in the middle of its fan curve. Asrock replaced it under RMA.
Yup, I suppose that these smaller, dense models are in the lead wrt. fast inference with consumer dGPUs (or iGPUs depending on total RAM) with just enough VRAM to contain the full model and context. That won't give you anywhere near SOTA results compared to larger MoE models with a similar amount of active parameters, but it will be quite fast.