logoalt Hacker News

It seems that OpenAI is scraping [certificate transparency] logs

132 pointsby pavel_lishintoday at 1:48 PM73 commentsview on HN

Comments

827atoday at 3:08 PM

Thousands of systems, from Google to script kiddies to OpenAI to nigerian call scammers to cybersecurity firms, actively watch the certificate transparency logs for exactly this reason. Yawn.

show 6 replies
Aurornistoday at 3:02 PM

This could be OpenAI, or it could be another company using their header pattern.

It has long been common for scrapers to adopt the header patterns of search engine crawlers to hide in logs and bypass simple filters. The logical next step is for smaller AI players to present themselves as the largest players in the space.

Some search engines provide a list of their scraper IP ranges specifically so you can verify if scraper activity is really them or an imitator.

EDIT: Thanks to the comment below for looking this up and confirming this IP matches OpenAI’s range.

show 1 reply
throwaway150today at 5:46 PM

I don't understand the outrage in some of the comments. The certificate transparency logs are literally meant to be read by absolutely whoever wants to read them. The clue is right in the name. It's transparency logs! Transparency!

I just don't understand how people with no clue whatsoever about what's going on feel so confident to express outrage over something they don't even understand! I don't mind someone not knowing something. Everybody has to learn things somewhere for the first time. But couldn't they just keep their outrage to themselves and take some time to educate themselves, to find out whether that outrage is actually well placed?

Some of the comments in the OP are also misinformed or illogical. But there's one guy there correcting them so that's good. I mean I'd say that https://en.wikipedia.org/wiki/Certificate_Transparency or literally any other post about CT is going to be far more informative than this OP!

show 1 reply
bombcartoday at 3:38 PM

If you somewhat want to avoid this, get a wildcard certificate (LE supports them: https://community.letsencrypt.org/t/acme-v2-production-envir...

Then all they know is the main domain, and you can somewhat hide in obscurity.

show 3 replies
bigbuppotoday at 5:56 PM

For many years now. The crawlers, scanners, and bots start hammering a website within a minute of a certificate being issued. Remember to get your garbage WCM installed and secured before installing the real certificate as you have about a 15 second window before they're hammering around for fresh wordpress installs. Granted, you people are all smart enough to have all that automated using a CI/CD pipeline so that you just commit a single file with the domain name to a git repo and all that magic happens.

poormathskillstoday at 4:10 PM

Is it still “scraping” when the purpose of these transparency logs is to be used for this purpose?

show 1 reply
throwaway613745today at 3:35 PM

OpenAI is scraping everything that is publicly accessible. Everything.

show 2 replies
toddgardnertoday at 5:06 PM

If you want to learn more about Certificate Transparency Logs, how to pull and search them, we just did a 3 part series about how we did this at CertKit: https://www.certkit.io/blog/searching-ct-logs

8cvor6j844qw_d6today at 4:26 PM

Anyone went with wildcard certificates to avoid disclosing subdomains in certificate transparency logs?

jcimstoday at 3:27 PM

Given these are trivially forged, presumably they aren't really using a Mac for scraping, right? Just to elicit a 'standard' end user response from the server?

>useragent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.3; robots.txt;

show 3 replies
basilikumtoday at 3:51 PM

They definitely do. Before this comment CT logs – aside from DNS queries – were the only way to know about https://onion.basilikum.monster and you have to send the hostname in the SNI, otherwise you get another certificate back.

Of course I get some bot traffic including the OpenAI bot – although I just trust the user agent there, I have not confirmed that the origin IP address of the requests actually belongs to OpenAI.

That's just the internet. I like the bots, especially the wp-admin or .env ones. It's fun watching them doing their thing like little ants.

_pdp_today at 3:37 PM

I wonder if this can be used to contaminate OpenAI search indexes?

drwhyandhowtoday at 1:51 PM

This has been long the case! I think there whole business model is based off scraping lol

xpetoday at 3:55 PM

Looking around at the comments, I have a birds-eye view. People are quite skilled at jumping to conclusions or assuming their POV is the only one. Consider this simplified scenario to illustrate:

    - X happened
    - Person P says "Ah, X happened."
    - Person Q interprets this in a particular way
        and says "Stop saying X is BAD!"
    - Person R, who already knows about X...
        (and indifferent to what others notice
         or might know or be interested in)
        ...says "(yawn)".
    - Person S narrowly looks at Person R and says
        "Oh, so you think Repugnant-X is ok?"
What a train wreck. Such failure modes are incredibly common. And preventable.* What a waste of the collective hours of attention and thinking we are spending here that we could be using somewhere else.

See also: the difference between positive and normative; charitable interpretations; not jumping to conclusions; not yucking someone else's yum

* So preventable that I am questioning the wisdom of spending time with any communication technology that doesn't actively address these failures. There is no point at blaming individuals when such failures are a near statistical certainty.

show 1 reply
gmerctoday at 3:16 PM

Let's prompt inject it

matt3210today at 4:08 PM

Your content is stolen for training the moment you put it up

show 3 replies
mxljetoday at 3:23 PM

So? It’s public information and a somewhat easily consumable stream of websites to scrape, if my job was to scrape the entire internet I’d probably start there, too.

kirito1337today at 5:14 PM

yawn, i saw this more than 1000 times

privacy doesnt exist in this world

show 1 reply