logoalt Hacker News

Fix your robots.txt or your site disappears from Google

92 pointsby bobbiechentoday at 5:03 PM60 commentsview on HN

Comments

WmWsjA6B29B4nfktoday at 7:52 PM

Google docs are pretty clear (https://developers.google.com/crawling/docs/robots-txt/robot...):

> Google's crawlers treat all 4xx errors, except 429, as if a valid robots.txt file didn't exist. This means that Google assumes that there are no crawl restrictions.

This is a better source than a random SEO dude with a channel full of AI-generated videos.

show 4 replies
jimberlagetoday at 8:19 PM

I remember back in the day, when SEO was a more viable channel, being surprised at how much of the game was convincing Google to crawl you at all.

I naively assumed that they would be happy to take in any and all data, but they had a fairly sophisticated algorithm for deciding "we've seen enough, we know what the next page in the sequence is going to look like." They value their bandwidth.

It led to a lot of gaming of how you optimally split content across high-value pages for search terms (the 5 most relevant reviews should go on pages targeting the New York metro, the next 5 most relevant for LA, etc.)

I'm surprised again, honestly. I kind of assumed the AI race meant that Google would go back to hoovering all data at the cost of extra bandwidth, but my assumption clearly doesn't hold. I can't believe I knew all that about Google and still made the same assumption twice.

show 2 replies
Igor_Wiwitoday at 9:35 PM

Thanks for the heads up. I am releasing 10 projects every month it's really easy to miss some of the SEO fundamentals, to fix it I created a Chrom extension to verify basic stuff https://chromewebstore.google.com/detail/becgiilhpcpakkecdho...

skybriantoday at 7:17 PM

Not sure if this is reliable.

- What does "unreachable" mean, exactly? A 404 or some more serious error?

- What is a "Diamond Product Expert" and do they speak for the company?

show 1 reply
cjtoday at 7:49 PM

Not having a robots.txt is fine as long as it's a 404. If it's a 403, you'll be de-indexed.

I have a feeling there's more to the story than what's in the blog post.

franzetoday at 7:50 PM

Fake or just miss-informed

this is the support page https://support.google.com/webmasters/community-video/360202...

this is the creators linkedin https://www.linkedin.com/in/iskgti/

he does not work for google, just a seo somewhere that creates videos and posts his hypothesis in forums

this is his youtube account https://m.youtube.com/@saket_gupta

nice high quality - propably ai created videos - still no relationship to reality

forintitoday at 7:00 PM

My logs tell me that Google ignores my robots.txt.

show 2 replies
senkotoday at 7:30 PM

> Your robots.txt file is the very first thing Googlebot looks for. If it can not reach this file, it will stop and won't crawl the rest of your site. Meaning your pages will remain invisible (on Google).

This implication (stopped crawl means your pages are invisible) directly contradicts Google's own documentation[0] that states:

> If other pages point to your page with descriptive text, Google could still index the URL without visiting the page. If you want to block your page from search results, use another method such as password protection or noindex.

What I get from the article is the big change is Google now treats missing robots.txt as if it disallowed crawling. Meaning you can still get indexed but not crawled (as per above).

My cynical take for this is this is a preparation for a future AI-related lawsuit. Everyone explicitly allowing Google (and/or other crawlers) is a proof they're doing it with website's permission.

Oh, you'd want to appear in Google search results without appearing in Gemini? Tough luck, bro.

[0] https://developers.google.com/search/docs/crawling-indexing/...

Aardwolftoday at 7:50 PM

If true, this would mean more websites with genuine content from the "old" internet won't show up (since many personal websites won't have this), while more SEO-optimized content farms that of course do put up a robots.txt will...

show 1 reply
dazctoday at 8:23 PM

I've witnessed a few catastrophes that have resulted in mistakes made via robots.txt, especially when using 'disallow' as an attempt to prevent pages being indexed.

I don't know if the claims made here are true but there really isn't any reason not to have a valid robots.txt available. One could argue that if you want Google to respect robots.txt then not having one should result in Googlebot not crawling any further.

linolevantoday at 7:00 PM

This is a crazy change. I wonder if part of the reasoning is that sites without a robots.txt tend to be very low-quality. Search is a very hard problem and in a world of LLM-generated internet, it's become way harder.

show 3 replies
estimator7292today at 7:33 PM

What I'm hearing is that if I tweak robots.txt I can exclude my site from google? Excellent news!

show 4 replies
crazygringotoday at 7:49 PM

This is interesting and unexpected if true.

My only thought is that virtually all "serious" sites tend to have robots.txt, and so not having it indicates a high likelihood of spam.

show 1 reply
gmiller123456today at 7:17 PM

Sounds like great news. Users will eventually figure out other search engines produce more relevant results and Google's dominance will fade. Hopefully they never "fix" it.

show 2 replies
vicparatoday at 8:00 PM

A lot of websites have robots.txt and sitemap.xml protected by cloudflare if you can imagine that. That's crazy.

ArcHoundtoday at 7:03 PM

To reach my site, users need to get through the AI summary first. Spoilers: they don't get through more often than not. This is based on the drop of views since AI summary started.

And honestly, I don't blame them. If the summary has the info, why risk going to a possibly ad-filled site?

show 3 replies
Animatstoday at 8:18 PM

So Google Search is now opt-in? Good.

show 1 reply
mwkaufmatoday at 7:35 PM

Yes it's _our_ fault Google search was enshittified.

bfleschtoday at 7:50 PM

Don't invest any second of your time into the US tech monopoly. That time is much better spent deploying non-US alternatives and backing up your data from US clouds, which could be blocked for us any moment.

Google is a rent-seeking parasitic middleman leeching off productive businesses, let them hang out with their best friends at the US administration.

shevy-javatoday at 8:00 PM

We need to fix Google.

nikanjtoday at 8:50 PM

I remember how religiously people used to care about their Google ranking. It's almost shocking to realize how fast that has changed. People used to spend tons of effort gaming site load speed, optimizing sitemaps and writing blog content.

All of that is fast getting completely irrelevant, people see ads on their favourite TikReels app, find their holiday presents on Temu and ask their questions from ChatGPT

show 1 reply
josefritzisheretoday at 6:56 PM

The irony is that their AI bots still hoover up all your site content regardless.

show 2 replies
Bengaliloltoday at 8:40 PM

...and not a single link to any Google dev page...

efilifetoday at 8:21 PM

"Here's the video from Google Support that covers it:"

This Google Support is another indian spammer that generates tens of nonsense videos and uploads them to YouTube: https://www.youtube.com/watch?v=2LJKNiQJ8LA

This guy is not affiliated with Google in any way other than spamming on their help forums like indian people tend to do

https://www.iskgti.com/

His own website has 92 score in SEO on lighthouse despite his claim he's a "SEO expert"

From the article:

> I don't have a robots.txt right now. It hasn't been there in a long time. Google still shows two results when I search for files on my site though:

guess why

Onavotoday at 7:16 PM

Is this a compliance issue? I can't imagine why they would willingly not scrape.

show 1 reply
brianbest101today at 7:37 PM

[dead]