This uses the Taylor approximation to approximate softmax, but that IS only an approximation. I wond...

mapontosevenths • today at 3:25 PM • 1 reply • view on HN

This uses the Taylor approximation to approximate softmax, but that IS only an approximation. I wonder exactly how much that trade-off costs in terms of accuracy vs performance? I note that they say it's close to float16 with four Taylor terms.

My other concern would be that Taylor itself is fairly complex. I wonder how well GPU's handle this in comparison to good old fashioned softmax? The last time I used Taylor with a custom Triton kernel it was still very slow. That could just have been my own jank vibe-coded implementation though.

Replies

slashdave • today at 6:22 PM

If the model learns by using the approximate softmax, then why does it matter? We only need the behavior of softmax, not an exact numerical solution.

➕ show 1 reply

alt Hacker News

Replies