logoalt Hacker News

theszyesterday at 4:16 PM3 repliesview on HN

Change layer size and you have to retrain. Change number of layers and you have to retrain. Change tokenization and you have to retrain.


Replies

dTalyesterday at 5:06 PM

None of that is true, at least in theory. You can trivially change layer size simply by adding extra columns initialized as 0, effectively embedding your smaller network in a larger network. You can add layers in a similar way, and in fact LLMs are surprisingly robust to having layers added and removed - you can sometimes actually improve performance simply by duplicating some middle layers[0]. Tokenization is probably the hardest but all the layers between the first and last just encode embeddings; it's probably not impossible to retrain those while preserving the middle parts.

[0] https://news.ycombinator.com/item?id=47431671 https://news.ycombinator.com/item?id=47322887

show 2 replies
altruiosyesterday at 4:49 PM

Hopefully we will find a way to make it so that making minor changes don't require a full retrain. Training how to train, as a concept, comes to mind.

CamperBob2yesterday at 5:17 PM

And yet the KL divergence after changing all that stuff remains remarkably similar between different models, regardless of the specific hyperparameters and block diagrams employed at pretraining time. Some choices are better, some worse, but they all succeed at the game of next-token prediction to a similar extent.

To me, that suggests that transformer pretraining creates some underlying structure or geometry that hasn't yet been fully appreciated, and that may be more reusable than people think.

Ultimately, I also doubt that the model weights are going to turn out to be all that important. Not compared to the toolchains as a whole.

show 1 reply