If the model is trained to be a interpreter, then that means that the loss should reach 0 for it to be fully trained?
Also, if it's execution is purely deterministic, you probably don't need non linearity in the layers, right?