> Flagship models usually do not do that without some convincing
Just a data point, but I’ve been having Claude do this regularly
I think I was using GitHub Copilot when I made the experience that led me to this statement. I guess the experience of using LLMs can be quite different depending on model version and harness.
Same. I was having it debug a routine python issue and it broke out mpympler and LLDB, and added a signal handler dump stack traces.
Gemini Flash-Lite was a decent reverse-engineering sidekick since 2.5 as well.