logoalt Hacker News

dollo_7today at 5:35 AM2 repliesview on HN

Not sure if I buy it. First, SVD decomposition to obtain U, Σ, V is computationally expensive, so it would work only if we are not finetuning very big models.

But my real concern comes at the results. The "13 parameters" looks like bait, because it is one result of finetuning a model on a very simple math benchmark, grade-school-math (GSM8K), an already very saturated benchmark on every model. Besides, it seems to happen only for the qwen family model... It looks like GSM8K was part of the training set of the qwen model, and this tinylora finetuning did the last adjustments to perfectly reflect that overtraining.


Replies

sachaatoday at 6:01 AM

Fair points, especially on GSM8K saturation and Qwen possibly already sitting close to the solution. That said, even if this is mostly "last-mile alignment", the fact that it can be done with such a tiny signal is still interesting, it suggests the gap between capability and behavior might be much smaller (and cheaper to bridge) than we assume.

robrenaudtoday at 5:57 AM

Yeah, my big problem with the paper is it just might be an artifact of qwen's training process.