logoalt Hacker News

ACCount37today at 6:19 AM1 replyview on HN

Good question.

It might work, I considered running a test like this. But it does demand certain things.

The subnetwork has to be either crafted as "gradient resistant" or remain frozen. Not all discovered or handcrafted circuits would survive gradient pressure as is. Especially the kind of gradients that fly in early pre-training.

It has to be able to interface with native representations that would form in a real LLM during pre-training, which is not trivial. This should happen early enough in pre-training. Gradients must start routing through our subnetwork. We can trust "rich get richer" dynamics to take over from there, but for that, we need the full network to discover the subnetwork and start using it.

And finally, it has to start being used for what we want it to be used for. It's possible that an "addition primitive" structure would be subsumed for something else, if you put it into the training run early enough, when LLM's native circuitry is nonexistent.

Overall, for an early test, I'd spray 200 frozen copies of the same subnetwork into an LLM across different layers and watch the dynamics as it goes through pre-training. Roll extra synthetic addition problems into the pre-training data to help discovery along. Less of a principled solution and more of an engineering solution.


Replies

rao-vtoday at 8:36 AM

+1 I’ve always had the feeling that training from randomly initialized weights without seeding some substructure is unnecessarily slowing LLM training.

Similarly I’m always surprised that we don’t start by training a small set of layers, stack them and then continue.

show 1 reply