There is a massive difference between an 'unsolved problem' and a problem solved 'the wrong way'. Yes, 99.7% is surprisingly good. But it did not detect the errors in its own output. And it should have.
Besides, we're all stuck on the 99.7% as if that's the across the board output, but that's a cherry picked result:
"The best models (bases 24, 16 and 32) achieve a near-perfect accuracy of 99.7%, while odd-base models struggle to get past 80%."
I do think it is a very interesting thing to do with a model and it is impressive that it works at all.