But nothing prevents llms from being RLed to do this right?
But does training llms to be better at this, improves their world model or does it only make changes at the surface?
Estimate the calorie count of this door handle: https://m.youtube.com/watch?v=VDSzY52Mkrw&pp=0gcJCVACo7VqN5t...
Extreme example perhaps, but no, you can't just turn pixels into calories. Right now I'd be impressed if we could reliably estimate volume to within 30% from a photo, but even with that correct the contents of the food can easily be way off without visible sign.
Okay, so take the sandwich. There is no way to know what is in it by looking at it. No amount of optimisation will fix this.
I'm sure one could produce a CV model that was a lot better at guessing here than these LLMs are, but fundamentally it is still guessing.
Yes, something prevents llms from being RLed to do this: You can't see through something opaque to determine whether there's something high calorie or low calorie out of sight.
The problem itself is unsolvable given the data provided.
You could conceivable make it better at making guesses, but they will inherently always be guesses that will sometimes be wildly off.