logoalt Hacker News

gbalduzzitoday at 6:25 AM0 repliesview on HN

Following the original comment concepts, if every model requires a different prompting technique to maximize its output, how can a benchmark based on sending the same prompt to all models be accurate? We should create different prompts for each model, but then how reliable and unbiased can the benchmark be?

It is a fundamentally hard problem to solve