> Run the same agent n times to increase success rate.
Are there benchmarks out there that back this claim?
Yes, this is the pass@k metric from code generation research. Found the relevant paper Evaluating Large Language Models Trained on Code (Chen et al., 2021) which introduced the metric.
[dead]
Yes, this is the pass@k metric from code generation research. Found the relevant paper Evaluating Large Language Models Trained on Code (Chen et al., 2021) which introduced the metric.