Is the approach fundamentally limited to smaller models? Or could you theoretically train a model as powerful as the largest models, but much faster?