I think this lack of 'G' (generality, or modality) is the problem. A human visualizes this kind of problem (a little video plays in my head of taking a car to a car wash). LLM's don't do this, they 'think' only in text, not visually.
A proper AGI would have have to have knowledge in video, image, audio and text domains to work properly.