logoalt Hacker News

trothamelyesterday at 3:39 PM1 replyview on HN

Offhand, do you know what format that data is in? Is it a question and then a human answering that question? Mostly just curious at to what the training data consists of.


Replies

jmalickiyesterday at 3:55 PM

The most advanced training data is in the form of rubrics as rewards.

A human asks a question, then writes rubrics to judge the LLMs response, so rather than evaluating a specific response, those rubrics can live on as the LLM evolves and gives different answers. There are more complex variants as well, but that's the basic principle.

https://arxiv.org/abs/2507.17746