I have a nonlinear attention mechanism which seems to improve data efficiency, but it's slow. I'm trying to learn the python CuTe DSL to speed it up.
I'm also reading Principles and Practice of Deep Representation Learning, Or: A Mathematical Theory of Memory.