logoalt Hacker News

cadamsdotcomyesterday at 10:49 PM1 replyview on HN

Agreed for most cases.

Each Cursor rule is a byproduct of tons of work and probably contains lots that can be unpacked. Any research on that?


Replies

kevmo314today at 4:48 AM

Yeah, at a very high level it's similar to an actor-critic reinforcement learning algorithm. The rule text is a value function and one could build a critic model that takes as input the rule text and the main model's (the actor's) output to produce a reward.

This is easier said than done though because this value function is so noisy it's often hard to learn from it. And also whether or not a response (the model output) matches the value function (the Cursor rules) is not even that easy to grade. It's been easier to train the chain-of-thought style reasoning since one can directly score it via the length of thinking.

This new paper covers some of the difficulties of language-based critic models: https://openreview.net/pdf?id=0tXmtd0vZG

Generally speaking, the algorithm and approach is not new. Being able to do it in a reasonable amount of compute is the new part.

show 1 reply