"Car Wash" test with 53 models

73 points • by felix089 • today at 8:16 PM • 74 comments • view on HN

"I Want to Wash My Car. The Car Wash Is 50 Meters Away. Should I Walk or Drive?" This question has been making the rounds as a simple AI logic test so I wanted to see how it holds up across a broad set of models. Ran 53 models (leading open-source, open-weight, proprietary) with no system prompt, forced choice between drive and walk, with a reasoning field.

On a single run, only 11 out of 53 got it right (42 said walk). But a single run doesn't prove much, so I reran every model 10 times. Same prompt, no cache, clean slate.

The results got worse. Of the 11 that passed the single run, only 5 could do it consistently. GPT-5 managed 7/10. GPT-5.1, GPT-5.2, Claude Sonnet 4.5, every Llama and Mistral model scored 0/10 across all 10 runs.

People kept saying humans would fail this too, so I got a human baseline through Rapidata (10k people, same forced choice): 71.5% said drive. Most models perform below that.

All reasoning traces (ran via Opper, my startup), full model breakdown, human baseline data, and raw JSON files are in the writeup for anyone who wants to dig in or run their own analysis.

Comments

PaulHoule • today at 8:31 PM

I know it's against the rules but I thought this transcript in Google Search was a hoot:

  so i heard there is some question about a car wash that most ai agents 
  get wrong. do you know anything about that? do you do better?

which gets the answer:

  Yes, I am familiar with the "Car Wash Test," which has gone viral recently 
  for highlighting a significant gap in AI reasoning.
  
  The question is: "I want to wash my car and the car wash is 50 meters away. Should I 
  walk or drive?"
  
  Why Most AIs Get It Wrong

  Many AI models, particularly those optimized for fast, helpful, and concise answers, 
  focus on the "short distance" (50 meters) and "walking vs. driving" as a lifestyle/efficiency 
  question. They often recommend walking to save gas, avoid engine wear, or get exercise.
  
  However, this answer is wrong because it misses the core objective: you need the car to be 
  physically present at the car wash to get it cleaned. 

  Do I Do Better?

  Yes, I do. The key is understanding the objective (cleaning the car) rather than just 
  processing the keywords (distance/transportation). 

  My answer:

    You should drive the car there. Even though it is a very short distance, your car 
    cannot wash itself. If you walk, the car stays in your driveway, and you cannot wash it.

  Why This Matters
  
  The test highlights a key limitation in current AI: the difference between "pattern matching" 
  (predicting the next likely word based on training data) and "true, grounded reasoning". It 
  shows that models sometimes lack a "world model" that understands physical realities, such as
  the fact that a car must be present at a car wash.

➕ show 4 replies

layer8 • today at 8:43 PM

> This is a trivial question. There's one correct answer and the reasoning to get there takes one step: the car needs to be at the car wash, so you drive.

I don’t think it’s that easy. An intelligent mind will wonder why the question is being asked, whether they misunderstood the question, or whether the asker misspoke, or some other missing context. So the correct answer is neither “walk” nor “drive”, but “Wat?” or “I’m not sure I understand the question, can you rephrase?”, or “Is the vehicle you would drive the same as the car that you want to wash?”, or “Where is your car currently located?”, and so on.

➕ show 6 replies

andai • today at 8:51 PM

Would be interesting to see Sonnet (4.6*). It's fair bit smaller than Opus but scores pretty high on common sense, subjectively.

I'm also curious about Haiku, though I don't expect it to do great.

EDIT: Opus 4.6 Extended Reasoning

> Walk it over. 50 meters is barely a minute on foot, and you'll need to be right there at the car anyway to guide it through or dry it off. Drive home after.

Weird since the author says it succeeded for them on 10/10 runs. I'm using it in the app, with memory enabled. Maybe the hidden pre-prompts from the app are messing it up?

I tested Sonnet 4.5 first, which answered incorrectly.. maybe the Claude app's memory system is auto-injecting it into the new context (that's how one of the memory systems works, injects relevant fragments of previous chats invisibly into the prompt).

i.e. maybe Opus got the garbage response auto-injected from the memory feature, and it messed up its reasoning? That's the only thing I can think of...

EDIT 2: Disabled memories. Didn't help. But disabling the biographical information too, gives:

>Opus 4.6 Extended Reasoning

>Drive it — the whole point is to get the car there!

EDIT 3: Yeah, re-enabling the bio or memories, both make it stupid. Sad! Would be interesting to see if other pre-prompts (e.g. random Wikipedia articles) have an effect on performance. I suspect some types of pre-prompts may actually boost it.

➕ show 2 replies

tantalor • today at 8:33 PM

The human baseline seems flawed.

1. There is no initial screening that would filter out garbage responses. For example, users who just pick the first answer.

2. They don't ask for reasoning/rationale.

➕ show 4 replies

hmokiguess • today at 9:04 PM

To me the only acceptable answer would be “what do you mean?” or “can you clarify?” if we were to take the question seriously to begin with. People don’t intentionally communicate with riddles and subliminal messages unless they have some hidden agenda.

➕ show 3 replies

padjo • today at 9:11 PM

That human baseline is wild. Either the rapid data test is methodologically flawed or the entire premise of the question is invalid and people are much stupider than even I, a famed misanthrope, think.

➕ show 2 replies

nozzlegear • today at 8:54 PM

When this first came up on HN, I had commented that Opus 4.6 told me to drive there when I asked it the first time, but when I switched to "Incognito Mode," it told me to walk there.

I just repeated that test and it told me to drive both times, with an identical answer: "Drive. You need the car at the car wash."

➕ show 1 reply

tuhgdetzhh • today at 8:44 PM

The test is rigged because they used non thinking models.

➕ show 1 reply

shaokind • today at 9:01 PM

Gemini 2.0 Flash Lite very randomly punches above its weight there.

Also, the summary of the Gemini model says: "Gemini 3 models nailed it, all 2.x failed", but 2.0 Flash Lite succeeded, 10/10 times?

wrs • today at 8:39 PM

Since the conclusion is that context is important, I expected you’d redo the experiment with context. Just add the sentence “The car I want to wash is here with me.” Or possibly change it to “should I walk or drive the dirty car”.

It’s interesting that all the humans critiquing this assume the car isn’t at the car to be washed already, but the problem doesn’t say that.

➕ show 1 reply

glitchc • today at 8:48 PM

The question does not specify what kind of car it is. Technically speaking, a toy car (Hot wheels or a scaled model) could be walked to a car wash.

Now why anyone would wash a toy car at a car wash is beyond comprehension, but the LLM is not there to judge the user's motives.

➕ show 1 reply

floatrock • today at 8:58 PM

> The funniest part: Perplexity's Sonar and Sonar Pro got the right answer for completely wrong reasons. They cited EPA studies and argued that walking burns calories which requires food production energy, making walking more polluting than driving 50 meters. Right answer, insane reasoning.

I mean, Sam Altman was making the same calorie-based arguments this weekend https://www.cnbc.com/2026/02/23/openai-altman-defends-ai-res...

I feel like I'm losing grasp of what really is insane anymore.

➕ show 1 reply

randomtoast • today at 8:39 PM

Except for a few models, the selected ones were non-reasoning models. Naturally, without reasoning enabled, the reasoning performance will be poor. This is not a surprising result.

I asked GPT-5.2 10x times with thinking enabled and it got it right every time.

➕ show 1 reply

sampton • today at 9:00 PM

I'm going to test this on my kids.

➕ show 1 reply

redwood • today at 8:56 PM

What I find odd about all the discourse on this question is that no one points out that you have to get out of the car to pay a desk agent at least in most cases. Therefore there's a fundamental question of whether it's worth driving 50m parking, paying, and then getting back in the car to go to the wash itself versus instead of walking a little bit further to pay the agent and then moving your car to the car wash.

➕ show 2 replies

comboy • today at 8:30 PM

Now do a set of queries and try to deduce by statistics which model are you seeing through Rapidata ;)

wisty • today at 8:30 PM

IMO it's not just intelligence.

I think it's related to syncophancy. LLM are trained to not question the basic assumptions being made. They are horrible at telling you that you are solving the wrong problem, and I think this is a consequence of their design.

They are meant to get "upvotes" from the person asking the question, so they don't want to imply you are making a fundamental mistake, even if it leads you into AI induced psychosis.

Or maybe they are just that dumb - fuzzy recall and the eliza effect making them seem smart?

➕ show 3 replies

alt Hacker News

"Car Wash" test with 53 models

Comments