The fine-tuning overhead is definitely a factor, but for smaller shops the hard constraint is usually inference VRAM. Running a 32B model locally or on a rented GPU is surprisingly expensive if you aren't saturating it. Even at 4-bit quantization you are looking at dual 3090s or an A6000 to get decent tokens per second. The $400 training cost is impressive but the hosting bill is what actually kills the margin compared to per-token APIs.