As soon as they're publicly usable people benchmark them carefully. All currently available models have clear metrics.