It's because they're natively trained with 1 bit, so it's not losing anything. Now, ...

MarsIronPI • today at 2:23 AM • 1 reply • view on HN

It's because they're natively trained with 1 bit, so it's not losing anything. Now, the question might be how they manage to get decent predictive performance with such little precision. That I don't know.

Replies

syntaxpr • today at 4:50 AM

Not training. Transposing rows/columns of matrices to group 128 parameters with similar (shared) scale factor. Qwen-3 model.

alt Hacker News

Replies