logoalt Hacker News

atq2119today at 2:45 PM0 repliesview on HN

We're all groping around in the dark here, but something that could have happened is a tokenizer artifact.

The vocabularies I've seen tend to prefer tokens that start with a space. It feels somewhat plausible to me that an LLM sampling would "accidentally" pick the " Hello" token over the "Hello" token, leading to D:\ Hello in the command. And then that gets parsed as deleting the drive.

I've seen similar issues in GitHub Copilot where it tried to generate field accessors and ended up producing an unidiomatic "base.foo. bar" with an extra space in there.