logoalt Hacker News

grey-areatoday at 10:55 AM3 repliesview on HN

Interesting, so someone submitting a paper for review could also submit one with hidden instructions for LLMs to summarise or review it in a very positive light.

Given this detection method works so well in the use case of feeding reviewing LLMs instructions, it should also work for the original submitted paper itself, as long as it was passed along with its watermark intact. Even those just using LLMs to summarise could easily be affected if LLMs were instructed to generate very positive summaries.

So the 2% cheaters on policy A, AND 100% of policy B reviewers could fall for this and be subtly guided by the LLMs overly-positive summaries or even complete very positive reviews (based on hidden instructions).

That this sort of adversarial attack works is really quite troubling for those using LLMs to help them understand texts, because it would work even if asked to summarise something.


Replies

Tade0today at 11:09 AM

> Interesting, so someone submitting a paper for review could also submit one with hidden instructions for LLMs to summarise or review it in a very positive light.

I may or may not know a guy who added several hidden sentences in Finnish to his CV that might have helped him in landing an interview.

show 1 reply
bjournetoday at 12:10 PM

> Interesting, so someone submitting a paper for review could also submit one with hidden instructions for LLMs to summarise or review it in a very positive light.

Has been done: https://www.theguardian.com/technology/2025/jul/14/scientist...

show 1 reply
wood_spirittoday at 11:01 AM

Then these papers with these instructions get included in the training corpus for the next frontier models and those models learn to put these kinds of instructions into what they generate and …?