You took a simple path, embedding smaller into larger. What if you need to reduce number of layers and/or width of hidden layers? How will you embed larger into smaller? As for the "addition of same layers" - would the process of "layers to add" selection be considered training?
What if you still have to obtain the best result possible for given coefficient/tokenization budget?
I think that my comment express general case, while yours provide some exceptions.