logoalt Hacker News

lelanthrantoday at 8:21 AM0 repliesview on HN

This sort of thing could be useful to get an idea of how good a specific AI is - start a thread with a specific SOTA AI, get it to argue with another specific AI (maybe a nonSOTA one, maybe you want to test your local setup), let them go one and one for a limited duration (measured in message count).

Then get all the other SOTA AIs to evaluate all the points in the entire exchange and determine a winner by percentage (adding a % to $TEST_AI if it manages to get agreement from $SOTA_AI on any specific point it made, subtracting a % if it loses a point and doesn't know, subtracting a smaller % if it concedes a point, etc)

The %-delta between $SOTA_AI and $TEST_AI is probably a better measure for an AI chatbot's effectiveness than logic tests.

Don't think it will work for code or similar, though.