Using an LLM is the SOTA way to turn plain text instructions into embodied world behavior.
Charitably, I guess you can question why you would ever want to use text to command a machine in the world (simulated or not).
But I don't see how it's the wrong tool given the goal.
SOTA typically refers to achieving the best performance, not using the trendiest thing regardless of performance. There is some subtlety here. At some point an LLM might give the best performance in this task, but that day is not today, so an LLM is not SOTA, just trendy. It's kinda like rewriting something in Rust and calling it SOTA because that's the trend right now. Hope that makes sense.