logoalt Hacker News

keedayesterday at 8:28 PM1 replyview on HN

I wrote this in response to the below comment, which is now edited and unfortunately dead, so posting here:

I understand, that wasn't a comment on your efforts back then, just that it is a solved problem today. But that does not mean other scaling problems are comparable or comparably solved. The universe of scaling problems is immense!

Worse, different problems occur at different scales. In the 3rd party API system, years after the first re-architecting, some use-cases developed issues at scale that exceeded the already high operational parameters we benchmarked at, and required us to re-architect the service again, including building out a whole new cluster so we could isolate that traffic entirely.

It is really hard to predict how things will break until they do.

(As an aside, I remember reading a lot of interesting things about Blizzard's technology, even if Blizzard didn't publish those themselves. There were many people who researched their products and published their findings. For instance, someone analyzed wireshark traces and published a very detailed report about how they tuned their server-side networking stack. One thing that stood out was Blizzard used TCP for WoW, whereas the conventional wisdom was UDP for real-time multiplayer!)


Replies

dijityesterday at 8:48 PM

We used TCP for The Division, this was a major mistake and I don't think it was something people should repeat.

For example, if you have TCP_NODELAY and a few thousand players, you'll be swimming in about 1.2M packets per second pretty quickly.

This is enough to completely crush any stateful firewalls (UDP would pass through because no need to check state), so we had to do ACLs in network hardware instead, and append a magic number so that we could prevent flooding instead.

Another thing we found was that Windows networking activity only happens on Core0 (Windows 2012 R2); and that at 1.2M PPS: the driver crashes.

Logging in to a Windows machine which is AD connected when its network interface is dead is not ideal.

So, yeah, avoid TCP.

show 1 reply