He tried with a tiny model (gemma3:4b), got a range from 66 to 99. Then tried again with a small model (gemini 3.1 flash lite), the range was 48 to 64. Would a frontier model be more consistent? Perhaps this tool was optimized for more capable models?
It makes sense to me intuitively (though I'm not sure if my reasoning is actually correct).
Worse model may not "know" enough to distinguish between a 70 and a 100 candidate, so it's expected that it's output has high variance. But a better model might "know" enough, so it can be more confident and thus more consistent.