The approach here is very bad for training though, because unlike softmax attention, average-hard at...

yorwba • today at 5:24 PM • 0 replies • view on HN

The approach here is very bad for training though, because unlike softmax attention, average-hard attention is not differentiable with respect to the keys and queries, and if you try to fix that e.g. with straight-through estimation, the backward pass cannot be sped up in the same way as the forward pass.

alt Hacker News