logoalt Hacker News

andaitoday at 2:27 PM0 repliesview on HN

I was reminded of "model alloys", where they randomly select a LLM for every agentic turn. This significantly boosted performance on security work.

(10 points on the benchmark, or a relative increase of over 20%)

https://news.ycombinator.com/item?id=44630724

TFA on the other hand tests two things at once: mixing models, and "fuse a model with itself",! the latter being just test time compute. e.g. Opus was able to match Fable on TFA, at the cost of costing twice as much money (and presumably time).

These two dimensions are orthogonal but can be combined for further gains.

It's not clear that every task benefits from it though. The only benched deep research, and their results are a bit weird. (e.g. they have DeepSeek outranking frontier models.)

More research needed!