Send patches! But remember that many speedups end being not exactly correct and the logits drift. But there is extensive testing and even ds4-eval now to test how it performs.
Hah, it's quick hacks for me to understand CUDA better, I'm unlikely to have time to make them proper enough :( But maybe opening an issue talking about what I tried and what worked, makes sense.
I did confirm no logits drift, as you so nicely have provided tooling for ensuring exactly this, thanks for the great care that obviously gone into the project, been a pleasure to play around with! :)
Hah, it's quick hacks for me to understand CUDA better, I'm unlikely to have time to make them proper enough :( But maybe opening an issue talking about what I tried and what worked, makes sense.
I did confirm no logits drift, as you so nicely have provided tooling for ensuring exactly this, thanks for the great care that obviously gone into the project, been a pleasure to play around with! :)