Last night I tried out Opus 4.6 on a personal project involving animating in Gaussian Splats where the final result is output as a video.
In the past, AI coding agents could usually reason about the code well enough that they had a good chance of success, but I’d have to manually test since they were bad at “seeing” the output and characterizing it in a way that allowed them to debug if things went wrong, and they would never ever check visual outputs unless I forced them to (probably because it didn’t work well during RL training).
Opus 4.6 correctly reasoned (on its own, I didn’t even think to prompt this) that it could “test” the output by grabbing the first, middle and last frame, and observing that the first frame should be empty, the middle frame half full of details, and the final frame resembling the input image. That alone wouldn’t have impressed me that much, but it actually found and fixed a bug based on visual observation of a blurry final frame (we hadn’t run the NeRF training for enough iterations).
In a sense this is an incremental improvement in the model’s capabilities. But in terms of what I can now use this model for, it’s huge. Previous models struggled at tokenizing/interpreting images beyond describing the contents in semantic terms, so they couldn’t iterate based on visual feedback when the contents were abstract or broken in an unusual way. The fact that they can do this now means I can set them on tasks like this unaided and have a reasonable probability that they’ll be able to troubleshoot their own issues.
I understand your exhaustion at all the breathless enthusiasm, but every new models radically changes the game for another subset of users/tasks. You’re going to keep hearing that counterargument for a long time, and the worst part is, it’s going to be true even if it’s annoying.
Not surprising, since being able to "see" images effectively is key to unblocking LLM augmentation for use in web and app frontend work.