logoalt Hacker News

Tree Search Distillation for Language Models Using PPO

56 pointsby at2005today at 12:51 AM3 commentsview on HN

Comments

natufunutoday at 6:20 AM

Great post! I wonder why MCTS is not more popular as a test time compute harness. Did you compare performance of MCTS (without distillation) against other methods (eg best of N) with the same compute budget?

supermdguytoday at 4:20 AM

> One might note that MCTS uses more inference compute on a per-sample basis than GRPO: of course it performs better

This part confused me, it sounded like they were only doing the MCTS at train time, and then using GRPO to distill the MCTS policy into the model weights. So wouldn’t the model still have the same inference cost?

show 1 reply
devcraft_aitoday at 8:30 AM

[dead]

biang15343100today at 3:27 AM

[flagged]

puildupOtoday at 2:21 AM

[dead]