Eh...kinda. The RL in RLHF is a very different animal than the RL in a Waymo car training pipeline, ...

benreesman • last Tuesday at 4:59 PM • 1 reply • view on HN

Eh...kinda. The RL in RLHF is a very different animal than the RL in a Waymo car training pipeline, which is sort of obvious when you see that the former can be done by anyone with some clusters and some talent, and the latter is so hard that even Waymo has a marked preference for operating in July in Chandler AZ: everyone else is in the process of explaining why they didn't really want Level 5 per se anyways: all brakes no gas if you will.

The Q summations that are estimated/approximated by deep policy networks are famously unstable/ill-behaved under descent optimization in the general case, and it's not at all obvious that "point RL at it" is like, going to work at all. You get stability and convergence issues, you get stuck in minima, it's hard and not a mastered art yet, lot of "midway between alchemy and chemistry" vibes.

The RL in RLHF is more like Learning to Rank in a newsfeed optimization setting: it's (often) ranked-choice over human-rating preferences with extremely stable outcomes across humans. This phrasing is a little cheeky but gives the flavor: it's Instagram where the reward is "call it professional and useful" instead of "keep clicking".

When the Bitter Lesson essay was published, it was contrarian and important and most of all aimed at an audience of expert practitioners. The Bitter Bitter Lesson in 2025 is that if it looks like you're in the middle of an exponential process, wait a year or two and the sigmoid will become clear, and we're already there with the LLM stuff. Opus 4 is taking 30 seconds on the biggest cluster that billions can buy and they've stripped off like 90% of the correctspeak alignment to get that capability lift, we're hitting the wall.

Now this isn't to say that AI progress is over, new stuff is coming out all the time, but "log scale and a ruler" math is marketing at this point, this was a sigmoid.

Edit: don't take my word for it, this is LeCun (who I will remind everyone has the Turing) giving the Gibbs Lecture on the mathematics 10k feet view: https://www.youtube.com/watch?v=ETZfkkv6V7Y

Replies

Davidzheng • last Tuesday at 7:28 PM

I'm in agreement--RLHF won't lead to massively more intelligent beings than humans. But I said RL not RLHF

➕ show 1 reply

alt Hacker News

Replies