Nope, they encoded or compiled in a simple VM / WASM interpreter to the transformer weights, there is no training. You'd be forgiven for this misreading, as they deliberately mislead early on that their model is (in principle) trainable, but later admit that their actual model is not actually differentiable, but that a differentiable approximation "should" still work (despite no info about what loss function or training data could allow scoring partially correct / incomplete program outputs).
Nope, they encoded or compiled in a simple VM / WASM interpreter to the transformer weights, there is no training. You'd be forgiven for this misreading, as they deliberately mislead early on that their model is (in principle) trainable, but later admit that their actual model is not actually differentiable, but that a differentiable approximation "should" still work (despite no info about what loss function or training data could allow scoring partially correct / incomplete program outputs).