logoalt Hacker News

simonwyesterday at 8:15 PM2 repliesview on HN

I would love to know the answer to that question!

One guess: maybe running multiple different fine-tuning style operations isn't actually that expensive - order of hundreds or thousands of dollars per run once you've trained the rest of the model.

I expect the majority of their evaluations are then automated, LLM-as-a-judge style. They presumably only manually test the best candidates from those automated runs.


Replies

ACCount37yesterday at 9:43 PM

That's sort of true. SFT isn't too expensive - the per-token cost isn't far off from that of pre-training, and the pre-training dataset is massive compared to any SFT data. Although the SFT data is much more expensive to obtain.

RL is more expensive than SFT, in general, but still worthwhile because it does things SFT doesn't.

Automated evaluation is massive too - benchmarks are used extensively, including ones where LLMs are judged by older "reference" LLMs.

Using AI feedback directly in training is something that's done increasingly often too, but it's a bit tricky to get it right, and results in a lot of weirdness if you get it wrong.

Imnimoyesterday at 9:38 PM

I guess I thought the pipeline was typically Pretraining -> SFT -> Reasoning RL, such that it would be expensive to test how changes to SFT affect the model you get out of Reasoning RL. Is it standard to do SFT as a final step?

show 1 reply