The paper says that: > In practice, we find that four Taylor terms (P = 4) suffice for recoveri...

jcarreiro • today at 4:47 PM • 3 replies • view on HN

The paper says that:

> In practice, we find that four Taylor terms (P = 4) suffice for recovering conventional attention with elementwise errors of approximately the same magnitude as Float16 resolution, acceptable for many AI applications.

ie., the claim is that this method reproduces the results of conventional attention, up to float16 numerical precision.

Replies

kristjansson • today at 7:06 PM

> approximately the same magnitude

and they really do mean that, their results show +/- 1 on log10 plots.

fheinsen • today at 5:13 PM

The method is more general. The github repository's first example is with eight Taylor terms (P = 8).

energy123 • today at 5:29 PM

It converges on conventional attention as P goes up

alt Hacker News

Replies