What's unusual about it? It seems pretty standard to train small models to validate an approach, and then show that training scales with model size to 8B to 14B parameter models, which is what they did.