logoalt Hacker News

mapontoseventhstoday at 3:25 PM1 replyview on HN

This uses the Taylor approximation to approximate softmax, but that IS only an approximation. I wonder exactly how much that trade-off costs in terms of accuracy vs performance? I note that they say it's close to float16 with four Taylor terms.

My other concern would be that Taylor itself is fairly complex. I wonder how well GPU's handle this in comparison to good old fashioned softmax? The last time I used Taylor with a custom Triton kernel it was still very slow. That could just have been my own jank vibe-coded implementation though.


Replies

slashdavetoday at 6:22 PM

If the model learns by using the approximate softmax, then why does it matter? We only need the behavior of softmax, not an exact numerical solution.

show 1 reply