logoalt Hacker News

ck_oneyesterday at 10:44 PM6 repliesview on HN

It didn't use web search. But for sure it has some internal knowledge already. It's not a perfect needle in the hay stack problem but gemini flash was much worse when I tested it last time.


Replies

viraptoryesterday at 10:57 PM

If you want to really test this, search/replace the names with your own random ones and see if it lists those.

Otherwise, LLMs have most of the books memorised anyway: https://arstechnica.com/features/2025/06/study-metas-llama-3...

show 2 replies
joshmlewisyesterday at 10:57 PM

I think the OP was implying that it's probably already baked into its training data. No need to search the web for that.

show 1 reply
obirundatoday at 1:17 AM

This underestimates how much of the Internet is actually compressed into and is an integral part of the model's weights. Gemini 2.5 can recite the first Harry Potter book verbatim for over 75% of the book.

show 1 reply
Trasmattatoday at 12:54 AM

Do the same experiment in the Claude web UI. And explicitly turn web searches off. It got almost all of them for me over a couple of prompts. That stuff is already in its training data.

soulofmischiefyesterday at 11:23 PM

The only worthwhile version of this test involves previously unseen data that could not have been in the training set. Otherwise the results could be inaccurate to the point of harmful.

eek2121yesterday at 10:58 PM

Honestly? My advice would be to cook something custom up! You don't need to do all the text yourself. Maybe have AI spew out a bunch of text, or take obscure existing text and insert hidden phrases here or there.

Shoot, I'd even go so far as to write a script that takes in a bunch of text, reorganizes sentences, and outputs them in a random order with the secrets. Kind of like a "Where's Waldo?", but for text

Just a few casual thoughts.

I'm actually thinking about coming up with some interesting coding exercises that I can run across all models. I know we already have benchmarks, however some of the recent work I've done has really shown huge weak points in every model I've run them on.

show 1 reply