WOPR used reinforcement learning, and could learn from its simulated mistakes. LLMs can't do t...

jedberg • yesterday at 8:40 PM • 0 replies • view on HN

WOPR used reinforcement learning, and could learn from its simulated mistakes. LLMs can't do that without some sort of RL harness. :)

alt Hacker News