logoalt Hacker News

pixl9701/20/20253 repliesview on HN

>if you asked o1 to summarize a long email, it would just summarize the email. QwQ reasoned about why I asked it to summarize the email.

Did o1 actually do this on a user hidden output?

At least in my mind if you have an AI that you want to keep from outputting harmful output to users it shouldn't this seems like a necessary step.

Also, if you have other user context stored then this also seems like a means of picking that up and reasoning on it to create a more useful answer.

Now for summarizing email itself it seems a bit more like a waste of compute, but in more advanced queries it's possibly useful.


Replies

ozgune01/20/2025

Yes, o1 hid its input. Still, it also provided a summary of its reasoning steps. In the email case, o1 thought for six seconds, summarized its thinking as "summarizing the email", and then provided the answer.

We saw this in other questions as well. For example, if you asked o1 to write a "python function to download a CSV from a URL and create a SQLite table with the right columns and insert that data into it", it would immediately produce the answer. [4] If you asked it a hard math question, it would try dozens of reasoning strategies before producing an answer. [5]

[4] https://github.com/ubicloud/ubicloud/discussions/2608#discus...

[5] https://github.com/ubicloud/ubicloud/discussions/2608#discus...

coffeebeqn01/20/2025

I think O1 does do that. It once spit out the name of the expert model for programming in its “inner monologue” when I used it. Click on the grey “Thought about X for Y seconds” and you can see the internal monologue

show 1 reply
whywhywhywhy01/21/2025

>Now for summarizing email itself it seems a bit more like a waste of compute

This is the thought path that led to 4o being embarrassingly unable to do simple tasks. Second you fall into the level of task OpenAI doesn’t consider “worth the compute cost” you get to see it fumble about trying to do the task with poorly written python code and suddenly it can’t even do basic things like correctly count items in a list that OG GTP4 would get correct in a second.