logoalt Hacker News

gwern11/08/20240 repliesview on HN

One of the things about offline imitation learning like OP or LLMs in general is that the more important the error in their world model, the faster it'll correct itself. If you think you can teleport across a river, you'll make & execute plans which exploit that fact first thing to save a lot of time - and then immediately hit the large errors in that plan and observe a new trajectory which refutes an entire set of errors in your world model. And then you retrain and now the world model is that much more accurate. The new world model still contains errors, and then you may try to exploit those too right away, and then you'll fix those too. So the errors get corrected when you're able to execute online with on-policy actions. The errors which never turn out to be relevant won't get fixed quickly, but then, why do you care?