Which eval/benchmark is the best measure for how well a model can create frontend design? Claude has practically been leading this for a while now. Not sure how OpenAI is going to catch up on visual design