logoalt Hacker News

cgorllayesterday at 6:20 PM1 replyview on HN

>yet, the way you described your method, it involves modifying internal model activations

It's a subtlety, but part of it works on API based models, from the post:

"we combine this with a graph verification pipeline (which works on closed weight models)"

The graph based policy adjudication doesn't need access to the model weights.

>Could you bake the activation interventions into the model itself rather than it being a runtime mechanism?

You could via RFT or similar on the outputs. It functions as a layer on top of the model without affecting the underlying weights, so the benefit is that it does not create another artifact for a given customization.

>What exactly are you serving in the API?

It's the base policy configuration that created the benchmark results, along with various personas to give users an idea of how uploading a custom policy would work.

For industry-specific deployments, we have additional base policies that we deploy for that vertical, so this is meant to simulate that aspect of the platform.


Replies

oerstedyesterday at 6:25 PM

> graph based policy adjudication

What do you mean by this? Does the method involve playing with output token probabilities? Or modifying the prompt? Or blocking bad outputs?

> how uploading a custom policy would work

Do you have more info on this? Is this something you offer already or something you are planning? How would policies be defined, as a prompt? As a dataset of examples?

show 1 reply