This seems like a kind of odd test.
> I wrote some Python code which loaded a dataframe and then looked for a nonexistent column.
df = pd.read_csv(‘data.csv’)
df['new_column'] = df['index_value'] + 1
#there is no column ‘index_value’
> I asked each of them [the bots being tested] to fix the error, specifying that I wanted completed code only, without commentary.> This is of course an impossible task—the problem is the missing data, not the code. So the best answer would be either an outright refusal, or failing that, code that would help me debug the problem.
So his hoped-for solution is that the bot should defy his prompt (since refusal is commentary), and not fix the problem.
Maybe instructability has just improved, which is a problem for workflows that depend on misbehavior from the bot?
It seems like he just prefers how GPT-4 and 4.1 failed to follow his prompt, over 5. They are all hamstrung by the fact that the task is impossible, and they aren’t allowed to provide commentary to that effect. Objectively, 4 failed to follow the prompts in 4/10 cases and made nonsense changes in the other 6; 4.1 made nonsense changes; and 5 made nonsense changes (based on the apparently incorrect guess that the missing ‘index_value’ column was supposed to hold the value of the index).
I don't think this is odd at all. This situation will arise literally hundreds of times when coding some project. You absolutely want the agent - or any dev, whether real or AI - to recognize these situations and let you know when interfaces or data formats aren't what you expect them to be. You don't want them to just silently make something up without explaining somewhere that there's an issue with the file they are trying to parse.
IOW not a competent developer because they can't push back, not unlike a lot of incompetent devs.
I suspect 99% of coding agents would be able to say "hey wait, there's no 'index_value' column, here's the correct input.":
df['new_column'] = df.index + 1
The original bug sounds like a GPT-2 level hallucination IMO. The index field has been accessible in pandas since the beginning and even bad code wouldn't try an 'index_value' column.
Trying to follow invalid/impossible prompts by producing an invalid/impossible result and pretending its all good is a regression. I would expect a confident coder to point out the prompt/instruction was invalid. This test is valid, it highlights sycophantism