My favorite Google LLM benchmark is asking Gemini models to create a script that fetches API usage (just request counts) for a project from GCP.
100% failure rate.