What? Yes they do take shortcuts and hacks. They change the tests case to make it pass. As the context gets longer it is less reliable at following earlier instructions. I literally had Claude hallucinate nonexistent APIs and then admitted “You caught me! I didn’t actually know, let me do a web search” and then after the web search it still mixes deprecated patterns and APIs against instructions.
I’m much more worried about the reliability of software produced by LLMs.