> this is where the taylor expression would fail to represent the values well. "In practic...

energy123 • today at 3:27 PM • 1 reply • view on HN

> this is where the taylor expression would fail to represent the values well.

"In practice, we find that four Taylor terms (P = 4) suffice for recovering conventional attention with elementwise errors of approximately the same magnitude as Float16 resolution"

Replies

seanhunter • today at 3:49 PM

I read that too, but I wondered whether elementwise error is the right metric. Surely the actual error metric should be to evaluate model performance for a conventional transformer model and then the same model with the attention mechanism replaced by this 4th order Taylor approximation?

➕ show 1 reply

alt Hacker News

Replies