> Certainly not from interpretability research What research shows that you can ask ChatGPT to ...

embedding-shape • yesterday at 7:54 PM • 1 reply • view on HN

> Certainly not from interpretability research

What research shows that you can ask ChatGPT to explain its reasoning and why it said what it said, and that's guaranteed to actually be the motivation?

I've seen a bunch of experimentation looking at various things inside the black box while the inference is happening, but never seen any research pointing to tokens being able to explain why other tokens are there, but I'd be very happy to be educated here if you have any resources at hand, I won't claim to know everything.

Replies

famouswaffles • yesterday at 8:11 PM

>What research shows that you can ask ChatGPT to explain its reasoning and why it said what it said, and that's guaranteed to actually be the motivation?

What research shows that you can ask a Human to explain its reasoning and why it said what it said, and that's guaranteed to actually be the motivation? Because there's no such thing. If anything, what research exists suggests any explanation we're making is a nice post-hoc rationalization after the fact even if the Human thinks otherwise.

https://transformer-circuits.pub/2025/introspection/index.ht...

➕ show 1 reply

alt Hacker News

Replies