logoalt Hacker News

kevmo314yesterday at 4:48 AM1 replyview on HN

Yeah, at a very high level it's similar to an actor-critic reinforcement learning algorithm. The rule text is a value function and one could build a critic model that takes as input the rule text and the main model's (the actor's) output to produce a reward.

This is easier said than done though because this value function is so noisy it's often hard to learn from it. And also whether or not a response (the model output) matches the value function (the Cursor rules) is not even that easy to grade. It's been easier to train the chain-of-thought style reasoning since one can directly score it via the length of thinking.

This new paper covers some of the difficulties of language-based critic models: https://openreview.net/pdf?id=0tXmtd0vZG

Generally speaking, the algorithm and approach is not new. Being able to do it in a reasonable amount of compute is the new part.


Replies

cadamsdotcomyesterday at 6:13 AM

Suggestion was even simpler - feed a reasoning model a prompt like “tell me a few reasons a user might’ve created this Cursor rule: {RULE_TEXT}”

Do that for a bunch of rules scraped from a bunch of repos - and you’ve got yourself a dataset for training a new model with - or maybe for fine tuning.

show 1 reply