> I believe eval startups can work when they're targeting safety benchmarks specifically.
Are there any examples of successful startups doing this?