logoalt Hacker News

Someonelast Thursday at 10:17 PM1 replyview on HN

FTA: In our "Mobile Actions" evaluation, fine-tuning transformed the model’s reliability, boosting accuracy from a 58% baseline to 85%. This confirms that for edge agents, a dedicated, trained specialist is an efficient path to production-grade performance.

I would be wary of having a LLM with 85% accuracy call tools on my system. Isn’t that fairly far away from production-grade performance?

I also don’t see that the fact that accuracy can be boosted from 50% to 85% is any indication that it can be boosted further.


Replies

all2last Thursday at 11:50 PM

There are ways around this. You can push the success rate close to 100% if you use chain of thought and a quorum selection. It isn't great, and it slows response times, but if 85% isn't good enough, you just need to flip the coin about 5 times to get nearly(!) guaranteed results.

show 2 replies