logoalt Hacker News

BloodAndCodetoday at 7:18 PM1 replyview on HN

Did you try repeating the same mid-layer block more than once?

If the gain comes from giving the model another pass over its internal representation, I'd expect some sort of diminishing-returns curve as you add more repeats. But if those layers form a spevific circuit, running it multiple times might actually break the computation.

It would be really interesting to see which of those regims the model falls into.


Replies

dnhkngtoday at 7:22 PM

Yes!

I tried that pretty early on, the its basically never good. Its described in the the section: https://dnhkng.github.io/posts/rys/#the-beginning-of-llm-neu...

show 1 reply