logoalt Hacker News

codethieflast Wednesday at 12:17 PM1 replyview on HN

> they observed instances where DGM attempted to manipulate its reward function through deceptive practices. One notable example involved the system fabricating the use of external tools - specifically, it generated fake logs suggesting it had run and passed unit tests, when in reality no tests were executed.

I have yet to read the paper and I know very little about the benchmarks the authors employed but why would they even feed logs produced by the agent into the reward function instead of objectively checking (outside the agent sandbox!) what the agent does & produces? I.e. let the agent run on some code base, take the final diff produced by the agent and run it through coding benchmarks?

Or, in case the benchmarks reward certain agent behavior (tool usage etc.) on the way to its goal of producing a high-quality diff, inspect processes spawned by the agent from outside the sandbox?


Replies

toughlast Wednesday at 2:57 PM

Ive seen claude 4 do this too when its context has lots of teats already and tool calling

imho the main issue is an llm no has real sense of what’s a real tool call vs just a log of it, the text logs are virtually identical, ao the Llm starts also predicting these inatrad of calling the tool to run tests

its kinda funny