logoalt Hacker News

asiblast Wednesday at 11:54 AM3 repliesview on HN

I guess I've never worked on something of Klarna's scale, but 15ms seems like a very small amount of time to cause a post-mortem-worthy event!


Replies

abrookewoodlast Wednesday at 12:35 PM

Yes, it does seem small, but the BEAM often has response times in microseconds (μs). If you are used to that and something 'blows out' to milliseconds, then I can see why alarms might get triggered.

show 1 reply
benmmurphylast Wednesday at 12:36 PM

this can easily happen in a BEAM system. say you have some shared state you want to access. you create a gen_server to protect this shared state. a gen_server is basically a huge mutex. the gen_server is just a normal beam process that handles requests sent to its message queue and then sends a reply message back. lets say it can process a request normally in 20us. so a 15ms pause would stack up 750 messages in its message queue. now maybe this is not enough to generate a huge outage on its own but maybe as part of your handling you are using the message queue in an unsafe way. so when you check the message queue for a message the BEAM will just search the whole message queue for a message that matches. there are certain patterns the BEAM is able to optimize to prevent the whole message queue being searched (i think almost every pattern is unsafe and the BEAM only optimizes the gen rpc style message patterns) . but if you are using an unsafe pattern when you have a message queue backlog it will destroy the throughput in the system because the time taken to process a message is a function of the message queue length and the message queue length becomes a function of how long it takes to process a message.

Also, the great thing is you might not even have an explicit `receive` statement in your gen_server code. You might just be using a library that is using a `receive` somewhere that is unsafe with a large message queue and now you are burned. The BEAM also added some alternative message queue thing so you are able to use this instead of the main message queue of a process which should be a lot safer but I think a lot of libraries still do not use this. This alternative is 'alias' (https://www.erlang.org/doc/system/ref_man_processes.html#pro...) which does something slightly different from what I thought which is to protect the queue from 'lost' messages. Without aliases 'timeouts' can end up causing the process message queue to be polluted with messages that are no longer being waited on. This can lead to the same problems with large message queues causing throughput of a process to drop. However, usually long lived processes will have a loop that handles messages in the queue.

show 2 replies
__jonaslast Wednesday at 12:06 PM

I am also curious about the mentioned incident, does anyone have a link to the postmortem the post talks about? Couldn't find anything online.

show 1 reply