> We rolled out a change to update our fraud model, and that uses workload fingerprinting > ...

vintagedave • yesterday at 6:25 PM • 1 reply • view on HN

> We rolled out a change to update our fraud model, and that uses workload fingerprinting

> Since, in all likelyhood, your projects are similarly structured...

Thanks for the info. For what it's worth and to inform your retrospective, this included:

* A Wordpress frontend, with just a few posts, minimal traffic -- but one that had been posted to LinkedIn yesterday

* A Docusaurus-generated static site. Completely static.

* A Python server where workload would show OpenAI API usage, with consistent behavioural patterns for at least two months (and, I am strongly skeptical would have different patterns to any hosted service that calls OpenAI.)

These all seem pretty different to me. Some that _are_ similarly structured (eg a second Python OpenAI-using server) were not killed.

Some things come to mind for your post-mortem:

* If 3% of your services were affected, does that match your expected fraud rate? That is an awful lot of customers to take down in one go, and you'd want to be very accurate in your modeling. I can't see how you'd plan to kill that many without false positives and negative media.

* I'm speaking only for myself but I cannot understand what these three services have in common, nor how at least 2/3 of them (Wordpress, static HTML) could seem anything other than completely normal.

* How or why were customers not notified? I have used services before where if something seemed dodgy they would proactively reach out and say 'tell us if it's legit or in 24 hours it will be shut down' or for something truly bad, eg massive CPU usage affecting other services, they'd kill it right away but would _tell you_. Invisible SIGTERMS to random containers we find out about the hard way seems the exact opposite of sensible handling of supposedly questionable clients.

Replies

ndneighbor • yesterday at 7:58 PM

We have more info coming soon but I think the best way to frame this is actually working backwards and then explain how it impacted yours and other services.

So Railway (and other cloud providers) deal with fraud near constantly. The internet is a bad and scary place and we spend maybe a third to half of our total engineering cycles just on fraud/up-time related work. I don't wanna give any credit to script kiddies to the hostile nation states but we (and others) are under near and constant bombardment from crap workloads in the form of traffic, or not great CPU cycles, or sometimes more benignly, movie pirating.

Most cloud providers understandably don't like talking about it because ironically, the more they talk about it- the bad actors do indeed get a kick from seeing the chaos that they cause work. Begin the vicious cycle...

This hopefully answers:

> If 3% of your services were affected, does that match your expected fraud rate? That is an awful lot of customers to take down in one go, and you'd want to be very accurate in your modeling. I can't see how you'd plan to kill that many without false positives and negative media.

In our 5 year history, this is the third abuse related major outage. One being a Nation State DDoS, one being coordinated denial. This is the first one where it was a false positive taking down services automatically. We tune it constantly so its not really an issue except when it is.

So- with that background, we tune our boxes of lets say "performance" rules constantly. When we see bad workloads, or bad traffic, we have automated systems that "discourage" that use entirely.

When we updated those rules because we detected a new pattern, and then rolling it out, that's when we nailed the legit users, since this used the abuse pattern, it didn't show on your dash, hence the immediate gaslighting.

Which leads to the other question:

> How or why were customers not notified? I have used services before where if something seemed dodgy they would proactively reach out and say 'tell us if it's legit or in 24 hours it will be shut down' or for something truly bad, eg massive CPU usage affecting other services, they'd kill it right away but would _tell you_.

We don't want to tell fraudulent customers if they are effective or not. For this instance, it was a straight up logic bug on the heuristics match. But we have done this for our existence like black holing illegitimate traffic for example, then ban. We did this because some coordinated actors will deploy, get banned with: "reason" and then they would have backup accounts after they found that whatever they were doing was working. If you knew where to look, sometimes they will brag on their IRCs/Discords.

Candidly, we don't want to be transparent about this, but any user impact like this is the least we can do. Zooming out, macro wise, this is why Discord and other services are leaning towards ID verification. ...and it's hard for people on the non service provider side to appreciate the level of garbage out there in the internet. That said, that is an excuse- and we shovel that so that you can do your job and if we stop you, then thats on us which we own and hopefully do better about.

That said, you and others are understandably miffed (understatement) all we can do is work through our actions to rebuild trust.

alt Hacker News

Replies