It doesn't really work. I tried my website and it shows up, while definitely being built after 2023. There is a mistake in the metadata of the page that shows it as from 2011.
somebody said once we are mining "low-background tokens" like we are mining low-background (radiation) steel post WW2 and i couldnt shake the concept out of my head
(wrote up in https://www.latent.space/i/139368545/the-concept-of-low-back... - but ironically repeating something somebody else said online is kinda what i'm willingly participating in, and it's unclear why human-origin tokens should be that much higher signal than ai-origin ones)
Projects like this remind me of a plot point in the Cyberpunk 2077 game universe. The "first internet" got too infected with dangerous AIs, so much so that a massive firewall needed to be built, and a "new" internet was built that specifically kept out the harmful AIs.
(Or something like that: it's been awhile since I played the game, and I don't remember the specific details of the story.)
It makes me wonder if a new human-only internet will need to be made at some point. It's mostly sci-fi speculation at this point, and you'd really need to hash out the details, but I am thinking of something like a meatspace-first network that continually verifies your humanity in order for you to retain access. That doesn't solve the copy-paste problem, or a thousand other ones, but I'm just thinking out loud here.
Somewhat related, the leaderboard of em-dash users on HN before ChatGPT:
https://www.gally.net/miscellaneous/hn-em-dash-user-leaderbo...
The other day I was researching with ChatGPT.
* ChatGPT hallucinated an answer
* ChatGPT put it in my memory, so it persisted between conversations
* When asked for a citation, ChatGPT found 2 AI created articles to back itself up
It took a while, but I eventually found human written documentation from the organization that created the technical thingy I was investigating.
This happens A LOT for topics on the edge of knowledge easily found on the Web. Where you have to do true research, evaluate sources, and make good decisions on what you trust.
besides for training future models, is this really such a big deal? most of the AI-gened text content is just replacing content-farm SEO-spam anyway. the same stuff that any half-awares person wouldn't have read in the past is now slightly better written, using more em dashes and instances of the word "delve". if you're consistently being caught out by this stuff then likely you need to improve your search hygiene, nothing so drastic as this
the only place I've ever had any issue with AI content is r/chess, where people love to ask ChatGPT a question and then post the answer as if they wrote it, half the time seemingly innocently, which, call me racist, but I suspect is mostly due to the influence of the large and young Indian contingent. otherwise I really don't understand where the issue lies. follow the exact same rules you do for avoiding SEO spam and you will be fine
For images, https://same.energy is a nice option that, being abandoned but still functioning since a few years, seems to naturally not have crawled any AI images. And it’s all around a great product.
I didn’t know “eccentric engineering” was even a term before reading this. It’s fascinating how much creativity went into solving problems before large models existed. There’s something refreshing about seeing humans brute force the weird edges of a system instead of outsourcing everything to an LLM.
It also makes me wonder how future kids will see this era. Maybe it will look the same way early mechanical computers look to us. A short period where people had to be unusually inquisitive just to make things work.
Just the other evening, as my family argued about whether some fact was or was not fake, I detached from the conversation and began fantasizing about whether it was still possible to buy a paper encyclopedia.
Most of college courses and school books haven't changed in decades. Some reputed college keep courses for Pascal and Fortran instead of Python or Java, just because, it might affect their reputation of being classical or pure or to match their campus buildings style.
FWIW Mojeek (an organic search engine in the classic sense) can do this with the before: operator.
https://www.mojeek.com/search?q=britney+spears+before%3A2010...
google results were already 90% SEO crap long before ChatGPT
just use Kagi and block all SEO sites...
Why use this when you can use the before: syntax on most search engines?
This is such a great idea
The real gold is content created before the internet!
You should call it Predecember, referring to the eternal December.
In hindsight, that would've been a real utility use case for NFTs. A decentralized cryptographic prove that some content existed in a particular form at a particular moment.
> This is a search tool that will only return content created before ChatGPT's first public release on November 30, 2022.
How does it do that? At least Google seems to take website creation date metadata at face value.
I hope there's an uncensored version of the Internet Archive somewhere, I wish I could look at my website ca. 2001, but I think it got removed because of some fraudulent DMCA claim somewhere in the early 2010s.
ChatGPT also returns content only created before ChatGPT release, which is why I still have to google damn it!
Not affiliated, but I've been using kagi's date range filter to similar effect. The difference in results for car maintenance subjects is astounding (and slightly infuriating).
For that purpose I do not update my book on LeanPub about Ruby. I just know one day people gonna read it more, because human-written content would be gold.
I mean I get it, but it seems a bit silly. What's next - an image search engine that only returns images created before photoshop?
The slop is getting worse, as there is so much llm generated shit online, now new models are getting trained on the slop. Slop training slop, and slop. We have gone full circle just in a matter of a few years.
Of course my first thought was: Let's use this as a tool for AI searches (when I don't need recent news).
Interesting concept. As a side benefit this would allow you to make steady progress fighting SEO slop as well, since there can be no arms race if you are ignoring new content.
You could even add options for later cutoffs… for example, you could use today’s AIs to detect yesterday’s AI slop.
technically you can ask chatgpt to return the same result by asking it to filter by year
I'm grateful that I published a large body of content pre-ChatGPT so that I have proof that I'm not completely inarticulate without AI.
I don't know how this works under the hood but it seems like no matter how it works, it could be gamed quite easily.
Can't we just append "before:2021-01-01" to Google?
I use this to find old news articles for instance.
This tool has no future. We have that in common with it, I fear.
What we really need to do is build an AI tool to filter out the AI automatically. Anybody want to help me found this company?
[dead]
[dead]
[dead]
[dead]
[dead]
> This is a search tool that will only return content created before ChatGPT's first public release on November 30, 2022.
The problem is that Google's search engine - but, oddly enough, ALL search engines - got worse before that already. I noticed that search engines got worse several years before 2022. So, AI further decreased the quality, but the quality had a downwards trend already, as it was. There are some attempts to analyse this on youtube (also owned by Google - Google ruins our digital world); some explanations made sense to me, but even then I am not 100% certain why Google decided to ruin google search.
One key observation I made was that the youtube search, was copied onto Google's regular search, which makes no sense for google search. If I casually search for a video on youtube, I may be semi-interested in unrelated videos. But if I search on Google search for specific terms, I am not interested in crap such as "others also searched for xyz" - that is just ruining the UI with irrelevant information. This is not the only example, Google made the search results worse here and tries to confuse the user in clicking on things. Plus placement of ads. The quality really worsened.