So you are suggesting building a full featured package that is nontrivial compared to this fun excitement?
Vision models do a pretty decent job with spatial reasoning. It’s not there yet but you’re dismissing some interesting work going on.