It seems that OpenAI is scraping [certificate transparency] logs

132 points • by pavel_lishin • today at 1:48 PM • 73 comments • view on HN

Comments

Thousands of systems, from Google to script kiddies to OpenAI to nigerian call scammers to cybersecurity firms, actively watch the certificate transparency logs for exactly this reason. Yawn.

➕ show 6 replies

Aurornis • today at 3:02 PM

This could be OpenAI, or it could be another company using their header pattern.

It has long been common for scrapers to adopt the header patterns of search engine crawlers to hide in logs and bypass simple filters. The logical next step is for smaller AI players to present themselves as the largest players in the space.

Some search engines provide a list of their scraper IP ranges specifically so you can verify if scraper activity is really them or an imitator.

EDIT: Thanks to the comment below for looking this up and confirming this IP matches OpenAI’s range.

➕ show 1 reply

throwaway150 • today at 5:46 PM

I don't understand the outrage in some of the comments. The certificate transparency logs are literally meant to be read by absolutely whoever wants to read them. The clue is right in the name. It's transparency logs! Transparency!

I just don't understand how people with no clue whatsoever about what's going on feel so confident to express outrage over something they don't even understand! I don't mind someone not knowing something. Everybody has to learn things somewhere for the first time. But couldn't they just keep their outrage to themselves and take some time to educate themselves, to find out whether that outrage is actually well placed?

Some of the comments in the OP are also misinformed or illogical. But there's one guy there correcting them so that's good. I mean I'd say that https://en.wikipedia.org/wiki/Certificate_Transparency or literally any other post about CT is going to be far more informative than this OP!

➕ show 1 reply

bombcar • today at 3:38 PM

If you somewhat want to avoid this, get a wildcard certificate (LE supports them: https://community.letsencrypt.org/t/acme-v2-production-envir...

Then all they know is the main domain, and you can somewhat hide in obscurity.

➕ show 3 replies

bigbuppo • today at 5:56 PM

For many years now. The crawlers, scanners, and bots start hammering a website within a minute of a certificate being issued. Remember to get your garbage WCM installed and secured before installing the real certificate as you have about a 15 second window before they're hammering around for fresh wordpress installs. Granted, you people are all smart enough to have all that automated using a CI/CD pipeline so that you just commit a single file with the domain name to a git repo and all that magic happens.

poormathskills • today at 4:10 PM

Is it still “scraping” when the purpose of these transparency logs is to be used for this purpose?

➕ show 1 reply

throwaway613745 • today at 3:35 PM

OpenAI is scraping everything that is publicly accessible. Everything.

➕ show 2 replies

toddgardner • today at 5:06 PM

If you want to learn more about Certificate Transparency Logs, how to pull and search them, we just did a 3 part series about how we did this at CertKit: https://www.certkit.io/blog/searching-ct-logs

8cvor6j844qw_d6 • today at 4:26 PM

Anyone went with wildcard certificates to avoid disclosing subdomains in certificate transparency logs?

jcims • today at 3:27 PM

Given these are trivially forged, presumably they aren't really using a Mac for scraping, right? Just to elicit a 'standard' end user response from the server?

>useragent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.3; robots.txt;

➕ show 3 replies

basilikum • today at 3:51 PM

They definitely do. Before this comment CT logs – aside from DNS queries – were the only way to know about https://onion.basilikum.monster and you have to send the hostname in the SNI, otherwise you get another certificate back.

Of course I get some bot traffic including the OpenAI bot – although I just trust the user agent there, I have not confirmed that the origin IP address of the requests actually belongs to OpenAI.

That's just the internet. I like the bots, especially the wp-admin or .env ones. It's fun watching them doing their thing like little ants.

_pdp_ • today at 3:37 PM

I wonder if this can be used to contaminate OpenAI search indexes?

drwhyandhow • today at 1:51 PM

This has been long the case! I think there whole business model is based off scraping lol

xpe • today at 3:55 PM

Looking around at the comments, I have a birds-eye view. People are quite skilled at jumping to conclusions or assuming their POV is the only one. Consider this simplified scenario to illustrate:

    - X happened
    - Person P says "Ah, X happened."
    - Person Q interprets this in a particular way
        and says "Stop saying X is BAD!"
    - Person R, who already knows about X...
        (and indifferent to what others notice
         or might know or be interested in)
        ...says "(yawn)".
    - Person S narrowly looks at Person R and says
        "Oh, so you think Repugnant-X is ok?"

What a train wreck. Such failure modes are incredibly common. And preventable.* What a waste of the collective hours of attention and thinking we are spending here that we could be using somewhere else.

See also: the difference between positive and normative; charitable interpretations; not jumping to conclusions; not yucking someone else's yum

* So preventable that I am questioning the wisdom of spending time with any communication technology that doesn't actively address these failures. There is no point at blaming individuals when such failures are a near statistical certainty.

➕ show 1 reply

gmerc • today at 3:16 PM

Let's prompt inject it

matt3210 • today at 4:08 PM

Your content is stolen for training the moment you put it up

➕ show 3 replies

mxlje • today at 3:23 PM

So? It’s public information and a somewhat easily consumable stream of websites to scrape, if my job was to scrape the entire internet I’d probably start there, too.

kirito1337 • today at 5:14 PM

yawn, i saw this more than 1000 times

privacy doesnt exist in this world

➕ show 1 reply

alt Hacker News

It seems that OpenAI is scraping [certificate transparency] logs

Comments