Are there examples of the outputs the LLMs under test generated? I couldn't find any detailed o...

no_multitudes • last Wednesday at 9:12 PM • 0 replies • view on HN

Are there examples of the outputs the LLMs under test generated? I couldn't find any detailed ones in the paper or code.

The result here seems to be "Our Judge LLM gave another LLM a 21% grade for some code it generated", which is ... not qualitatively meaningful at all to me.

alt Hacker News