logoalt Hacker News

Search tool that only returns content created before ChatGPT's public release

661 pointsby dmitrygrtoday at 4:06 AM259 commentsview on HN

Comments

shevy-javatoday at 9:28 AM

> This is a search tool that will only return content created before ChatGPT's first public release on November 30, 2022.

The problem is that Google's search engine - but, oddly enough, ALL search engines - got worse before that already. I noticed that search engines got worse several years before 2022. So, AI further decreased the quality, but the quality had a downwards trend already, as it was. There are some attempts to analyse this on youtube (also owned by Google - Google ruins our digital world); some explanations made sense to me, but even then I am not 100% certain why Google decided to ruin google search.

One key observation I made was that the youtube search, was copied onto Google's regular search, which makes no sense for google search. If I casually search for a video on youtube, I may be semi-interested in unrelated videos. But if I search on Google search for specific terms, I am not interested in crap such as "others also searched for xyz" - that is just ruining the UI with irrelevant information. This is not the only example, Google made the search results worse here and tries to confuse the user in clicking on things. Plus placement of ads. The quality really worsened.

show 11 replies
audialatoday at 4:02 PM

It doesn't really work. I tried my website and it shows up, while definitely being built after 2023. There is a mistake in the metadata of the page that shows it as from 2011.

https://audiala.com/changelog

swyxtoday at 5:02 AM

somebody said once we are mining "low-background tokens" like we are mining low-background (radiation) steel post WW2 and i couldnt shake the concept out of my head

(wrote up in https://www.latent.space/i/139368545/the-concept-of-low-back... - but ironically repeating something somebody else said online is kinda what i'm willingly participating in, and it's unclear why human-origin tokens should be that much higher signal than ai-origin ones)

show 4 replies
keiferskitoday at 10:48 AM

Projects like this remind me of a plot point in the Cyberpunk 2077 game universe. The "first internet" got too infected with dangerous AIs, so much so that a massive firewall needed to be built, and a "new" internet was built that specifically kept out the harmful AIs.

(Or something like that: it's been awhile since I played the game, and I don't remember the specific details of the story.)

It makes me wonder if a new human-only internet will need to be made at some point. It's mostly sci-fi speculation at this point, and you'd really need to hash out the details, but I am thinking of something like a meatspace-first network that continually verifies your humanity in order for you to retain access. That doesn't solve the copy-paste problem, or a thousand other ones, but I'm just thinking out loud here.

show 4 replies
tkgallytoday at 5:24 AM

Somewhat related, the leaderboard of em-dash users on HN before ChatGPT:

https://www.gally.net/miscellaneous/hn-em-dash-user-leaderbo...

show 3 replies
softwaredougtoday at 1:16 PM

The other day I was researching with ChatGPT.

* ChatGPT hallucinated an answer

* ChatGPT put it in my memory, so it persisted between conversations

* When asked for a citation, ChatGPT found 2 AI created articles to back itself up

It took a while, but I eventually found human written documentation from the organization that created the technical thingy I was investigating.

This happens A LOT for topics on the edge of knowledge easily found on the Web. Where you have to do true research, evaluate sources, and make good decisions on what you trust.

show 2 replies
permo-wtoday at 5:55 AM

besides for training future models, is this really such a big deal? most of the AI-gened text content is just replacing content-farm SEO-spam anyway. the same stuff that any half-awares person wouldn't have read in the past is now slightly better written, using more em dashes and instances of the word "delve". if you're consistently being caught out by this stuff then likely you need to improve your search hygiene, nothing so drastic as this

the only place I've ever had any issue with AI content is r/chess, where people love to ask ChatGPT a question and then post the answer as if they wrote it, half the time seemingly innocently, which, call me racist, but I suspect is mostly due to the influence of the large and young Indian contingent. otherwise I really don't understand where the issue lies. follow the exact same rules you do for avoiding SEO spam and you will be fine

show 6 replies
themanmarantoday at 5:18 AM

The low-background steel of the internet

https://en.wikipedia.org/wiki/Low-background_steel

show 1 reply
tobrtoday at 6:01 AM

For images, https://same.energy is a nice option that, being abandoned but still functioning since a few years, seems to naturally not have crawled any AI images. And it’s all around a great product.

Barathkannatoday at 2:16 PM

I didn’t know “eccentric engineering” was even a term before reading this. It’s fascinating how much creativity went into solving problems before large models existed. There’s something refreshing about seeing humans brute force the weird edges of a system instead of outsourcing everything to an LLM.

It also makes me wonder how future kids will see this era. Maybe it will look the same way early mechanical computers look to us. A short period where people had to be unusually inquisitive just to make things work.

show 1 reply
vertnerdtoday at 1:36 PM

Just the other evening, as my family argued about whether some fact was or was not fake, I detached from the conversation and began fantasizing about whether it was still possible to buy a paper encyclopedia.

zkmontoday at 8:02 AM

Most of college courses and school books haven't changed in decades. Some reputed college keep courses for Pascal and Fortran instead of Python or Java, just because, it might affect their reputation of being classical or pure or to match their campus buildings style.

show 1 reply
ricardo81today at 6:59 AM

FWIW Mojeek (an organic search engine in the classic sense) can do this with the before: operator.

https://www.mojeek.com/search?q=britney+spears+before%3A2010...

dinkblamtoday at 9:24 AM

google results were already 90% SEO crap long before ChatGPT

just use Kagi and block all SEO sites...

show 1 reply
GaryBlutotoday at 5:39 AM

Why use this when you can use the before: syntax on most search engines?

show 1 reply
javaskrrttoday at 3:35 PM

This is such a great idea

josephjrobisontoday at 3:12 PM

The real gold is content created before the internet!

anticensortoday at 5:10 AM

You should call it Predecember, referring to the eternal December.

show 1 reply
stopthetoday at 1:46 PM

In hindsight, that would've been a real utility use case for NFTs. A decentralized cryptographic prove that some content existed in a particular form at a particular moment.

lxgrtoday at 12:09 PM

> This is a search tool that will only return content created before ChatGPT's first public release on November 30, 2022.

How does it do that? At least Google seems to take website creation date metadata at face value.

Roritharrtoday at 11:28 AM

I hope there's an uncensored version of the Internet Archive somewhere, I wish I could look at my website ca. 2001, but I think it got removed because of some fraudulent DMCA claim somewhere in the early 2010s.

1gn15today at 4:55 AM

Does this filter out traditional SEO blogfarms?

show 1 reply
defraudbahtoday at 8:09 AM

ChatGPT also returns content only created before ChatGPT release, which is why I still have to google damn it!

show 2 replies
progman32today at 5:41 AM

Not affiliated, but I've been using kagi's date range filter to similar effect. The difference in results for car maintenance subjects is astounding (and slightly infuriating).

RomanPushkintoday at 7:49 AM

For that purpose I do not update my book on LeanPub about Ruby. I just know one day people gonna read it more, because human-written content would be gold.

dpedutoday at 2:56 PM

I mean I get it, but it seems a bit silly. What's next - an image search engine that only returns images created before photoshop?

phplovesongtoday at 8:27 AM

The slop is getting worse, as there is so much llm generated shit online, now new models are getting trained on the slop. Slop training slop, and slop. We have gone full circle just in a matter of a few years.

show 1 reply
voiper1today at 6:54 AM

Of course my first thought was: Let's use this as a tool for AI searches (when I don't need recent news).

erikpukinskistoday at 12:35 PM

Interesting concept. As a side benefit this would allow you to make steady progress fighting SEO slop as well, since there can be no arms race if you are ignoring new content.

You could even add options for later cutoffs… for example, you could use today’s AIs to detect yesterday’s AI slop.

pknerdtoday at 7:15 AM

Something generated by humans does not mean high quality.

show 3 replies
cryptozeustoday at 7:26 AM

technically you can ask chatgpt to return the same result by asking it to filter by year

ETH_starttoday at 8:21 AM

I'm grateful that I published a large body of content pre-ChatGPT so that I have proof that I'm not completely inarticulate without AI.

johngtoday at 4:10 AM

I don't know how this works under the hood but it seems like no matter how it works, it could be gamed quite easily.

show 3 replies
EGregtoday at 9:55 AM

Can't we just append "before:2021-01-01" to Google?

I use this to find old news articles for instance.

theodrictoday at 10:09 AM

This tool has no future. We have that in common with it, I fear.

What we really need to do is build an AI tool to filter out the AI automatically. Anybody want to help me found this company?

bizviewtoday at 1:54 PM

[dead]

ListAndFusetoday at 8:25 AM

[dead]

tejaallutoday at 6:35 AM

[dead]

tejaallutoday at 6:35 AM

[dead]

tejaallutoday at 6:34 AM

[dead]

k_roytoday at 5:06 AM

[flagged]

show 3 replies
hekkletoday at 5:36 AM

[flagged]

show 1 reply