Did you try repeating the same mid-layer block more than once? If the gain comes from giving the m...

BloodAndCode • today at 7:18 PM • 1 reply • view on HN

Did you try repeating the same mid-layer block more than once?

If the gain comes from giving the model another pass over its internal representation, I'd expect some sort of diminishing-returns curve as you add more repeats. But if those layers form a spevific circuit, running it multiple times might actually break the computation.

It would be really interesting to see which of those regims the model falls into.

Replies

dnhkng • today at 7:22 PM

Yes!

I tried that pretty early on, the its basically never good. Its described in the the section: https://dnhkng.github.io/posts/rys/#the-beginning-of-llm-neu...

➕ show 1 reply

alt Hacker News

Replies