logoalt Hacker News

yorwbatoday at 5:24 PM0 repliesview on HN

The approach here is very bad for training though, because unlike softmax attention, average-hard attention is not differentiable with respect to the keys and queries, and if you try to fix that e.g. with straight-through estimation, the backward pass cannot be sped up in the same way as the forward pass.