It would be interesting to get the agents to write code to preprocess the logs and generate systems to analyse the outputs.
Maybe they are already doing this? Are there logs of the model's thinking?