there's nothing "pretty conventional" about a neural memory mechanism that comes along with such solid evidence of scalability and appealing performance characteristics.
If neural memory was conventional, GPT4o's memory wouldn't be stored as plain text and prepended to prompts.
This paper reminds me of the Switch Transformer paper; e.g. solidifying, expanding on, and proving out an area of research that may well have a big impact on leading LLMs and the SOTA in AI.
Agreed the concept of surprise is very cool.
There definitely is precedent - any parallelizably-decodable CABAC-derived neural compression algorithm basically has a flavor of this idea at its heart - intersperse statistical state throughout your token stream so you can decouple novelty in your state space on the fly.
Taken to its extreme where the ‘memory’ is descriptive enough to deterministically control the decoding you get parallelism over the sequence for free as a consequence of the associativity.
Similar techniques are used in making video compression algorithms robust enough for low latency reconnection in online streaming in poor/changing network conditions, or making it possible to decompress JPEGs at >1GBps in parallel by exploiting the presence of ‘RESET’ tokens that indicate independent/novel substreams.
That said, I do agree that this is definitely a great paper and contribution to language models though!
1991
> Each RNN tries to solve the pretext task of predicting its next input, sending only unexpected inputs to the next RNN above. This greatly facilitates downstream supervised deep learning such as sequence classification. By 1993, the approach solved problems of depth 1000 (requiring 1000 subsequent computational stages/layers—the more such stages, the deeper the learning). A variant collapses the hierarchy into a single deep net. It uses a so-called conscious chunker RNN which attends to unexpected events that surprise a lower-level so-called subconscious automatiser RNN. The chunker learns to understand the surprising events by predicting them. The automatiser uses my neural knowledge distillation procedure of 1991 [UN0-UN2] to compress and absorb the formerly conscious insights and behaviours of the chunker, thus making them subconscious. The systems of 1991 allowed for much deeper learning than previous methods.
https://people.idsia.ch/~juergen/very-deep-learning-1991.htm...
>the concept of surprise is very cool
Then you may be interested in Simplicity Theory:
https://simplicitytheory.telecom-paris.fr/
In particular this recent paper:>Unexpectedness and Bayes’ Rule
>A great number of methods and of accounts of rationality consider at their foundations some form of Bayesian inference. Yet, Bayes’ rule, because it relies upon probability theory, requires specific axioms to hold (e.g. a measurable space of events). This short document hypothesizes that Bayes’ rule can be seen as a specific instance of a more general inferential template, that can be expressed also in terms of algorithmic complexities, namely through the measure of unexpectedness proposed by Simplicity Theory.
Source: https://cifma.github.io/Papers-2021/CIFMA_2021_paper_13.pdf