logoalt Hacker News

bob1029today at 7:56 AM0 repliesview on HN

I've been having pretty good success with unity as a 3d llm tool. In addition to the iso views I've included a perspective mode that can focus on a list of game object ids with a custom camera origin. The agent is required to send instructions along with the VLM request each time in order to condition how the view is interpreted. E.g.: "How does ambient occlusion look in A vs B?".

The VLM is invoked as a nested operation within a tool call, not as part of the same user-level context. This provides the ability to analyze a very large number of images without blowing token budgets.

I've observed that GPT5.4 can iteratively position the perspective camera and stop once it reaches subjectively interesting arrangements. I don't know how to quantify this, but it does seem to have some sense of world space.

I think much of it comes down to conditioning the vision model to "see" correctly, and willingness to iterate many times.