I have a transformer attention mechanism which seems to be more data-efficient than the usual dot pr...

AlexCoventry • last Monday at 2:37 AM • 0 replies • view on HN

I have a transformer attention mechanism which seems to be more data-efficient than the usual dot product, and I'm trying to write a performant backwards kernel for it.

alt Hacker News