I've been having pretty good success with unity as a 3d llm tool. In addition to the iso views I've included a perspective mode that can focus on a list of game object ids with a custom camera origin. The agent is required to send instructions along with the VLM request each time in order to condition how the view is interpreted. E.g.: "How does ambient occlusion look in A vs B?".
The VLM is invoked as a nested operation within a tool call, not as part of the same user-level context. This provides the ability to analyze a very large number of images without blowing token budgets.
I've observed that GPT5.4 can iteratively position the perspective camera and stop once it reaches subjectively interesting arrangements. I don't know how to quantify this, but it does seem to have some sense of world space.
I think much of it comes down to conditioning the vision model to "see" correctly, and willingness to iterate many times.