It's an artificial intelligence, not a small deterministic shell script. Stop comparing it to one. It has both new capabilities and new classes of failure mode. Those new failure modes are more like human failure modes than traditional symbolic logic failures.
We need to get better at using them and building them by validating both the inputs and outputs of such systems in more sophisticated ways, but to act surprised and denounce them because they fail in different ways than more primitive systems misses the point.
They're stochastic by design. If we want deterministic results we must use deterministic validators in conjunction with the stochastic system. It's trivial, and one day security experts will look back on the time when people didn't in the same way we look back on 90's software that didn't validate user input at all.