I just started creating my own benchmarks (very simple questions for humans but tricky for AI, like ...

XCSme • today at 5:08 PM • 0 replies • view on HN

I just started creating my own benchmarks (very simple questions for humans but tricky for AI, like how many r's in strawberry kind of questions, still WIP).

Qwen3.5 is doing ok on my limited tests: https://aibenchy.com

alt Hacker News