Not the parent poster, but I did get the wrong answer even with reasoning turned on.

gf000 • today at 7:46 AM • 1 reply • view on HN

Thank you all! We needed further data points.

comparing one shot results is a foolish way to evaluate a statistical process like LLM answers. we need multiple samples.

for https://generative-ai.review I do at least three samples of output. this often yields very differnt results even from the same query.

alt Hacker News