Category error. You want 100% accuracy for an impossible problem. This is a famously unsolved conjecture. The only way to get the answer is to fully calculate it. The task was to make a guess and see how well it could do. 99.7 is surprisingly good. If the task was to calculate, the llm could write a python program, just like I would have if asked to calculate the answer.
Category error.
The problem here is deterministic. *It must be for accuracy to even be measured*.
The model isn't trying to solve the Collatz conjecture, it is learning a pretty basic algorithm and then doing this a number of times. The instructions it needs to learn is
if x % 2:
x /= 2
else:
x = x*3 + 1
It also needs to learn to put that in a loop and for that to be a variable, but the algorithm is static.On the other hand, the Collatz conjecture states that for C(x) (the above algorithm) has a fixed point of 1 for all x (where x \in Z+). Meaning that eventually any input will collapse to the loop 1 -> 4 -> 2 -> 1 (or just terminate at 1). You can probably see we know this is true for at least an infinite set of integers...
Edit: I should note that there is a slight modification to this, though model could get away with learning just this. Their variation limits to odd numbers and not all of them. For example 9 can't be represented by (2^k)m - 1 (but 7 and 15 can). But you can see that there's still a simple algorithm and that the crux is determining the number of iterations. Regardless, this is still deterministic. They didn't use any integers >2^71, which we absolutely know the sequences for and we absolutely know all terminate at 1.
To solve the Collatz Conjecture (and probably win a Fields Metal) you must do one of 2 things.
1) Provide a counter-example
2) Show that this happens for all n, which is an infinite set of numbers, so this strictly cannot be done by demonstration.
There is a massive difference between an 'unsolved problem' and a problem solved 'the wrong way'. Yes, 99.7% is surprisingly good. But it did not detect the errors in its own output. And it should have.
Besides, we're all stuck on the 99.7% as if that's the across the board output, but that's a cherry picked result:
"The best models (bases 24, 16 and 32) achieve a near-perfect accuracy of 99.7%, while odd-base models struggle to get past 80%."
I do think it is a very interesting thing to do with a model and it is impressive that it works at all.