Biggest thing to watch out with this approach is that you will inevitably have some failure or bug that will 10x, 100x, or 1000x the rate of dead messages and that will overload your DLQ database. You need a circuit breaker or rate limit on it.
Great application of first principles. I think it's totally reasonable also, at even most production loads. (Example: My last workplace had a service that constantly roared at 30k events per second, and our DLQs would at most have orders of hundreds of messages in them). We would get paged if a message's age was older than an hour in the queue.
The idea is that if your DLQ has consistently high volume, there is something wrong with your upstream data, or data handling logic, not the architecture.
> FOR UPDATE SKIP LOCKED
Learned something new today. I knew what FOR UPDATE did, but somehow I've never RTFM'd hard enough to know about the SKIP LOCKED directive. Thats pretty cool.
Why use shedlock and select-for-update-skip-locked? Shedlock stops things running in parallel (sort-of), but the other thing makes parallel processing possible.
re: SKIP LOCKED, introduced in postgres 9.5, here's an an archived copy [†] of the excellent 2016 2ndquadrant post discussing it
https://web.archive.org/web/20240309030618/https://www.2ndqu...
corresponding HN discussion thread from 2016 https://news.ycombinator.com/item?id=14676859
[†] it seems that all the old 2ndquadrant.com blog post links have been broken after their acquisition by enterprisedb
Segment uses MySQL as queue not even as DLQ. It works at their scale. So there are many (not all) systems that can tolerate this as queue.
I have simple flow: tasks are order of thousands an hour. I just use postgresql. High visibility, easy requeue, durable store. With appropriate index, it’s perfectly fine. LLM will write skip locked code right first time. Easy local dev. I always reach for Postgres for event bus in low volume system.
Would be interesting to see the numbers this system processes. My bet is that they are not that high.
I think that using Postgres as the message/event broker is valid, and having a DLQ on that Postgres system is also valid, and usable.
Having SEPARATE DLQ and Event/Message broker systems is not (IMO) valid - because a new point of failure is being introduced into the architecture.
lol a FOR UPDATE SKIP LOCKED post hits the HN homepage every few months it feels like
Another day, another “Using PostgreSQL for…” thing it wasn’t designed for. This isn’t a good idea. What happens when the queue goes down and all messages are dead lettered? What happens when you end up with competing messages? This is not the way.
Postgres is essentially a b-tree with a remote interface. Would you use a b-tree to store a dead letter queue? What is big O of insert & delete? what happens when it grows?
Postgres has a query interface, replication, backup and many other great utilities. And it’s well supported, so it will work for low-demand applications.
Regardless, you’re using the wrong data structure with the wrong performance profile, and at the margins you will spend a lot more money and time than necessary running it . And service will suffer.
Ofc I wouldn't us it for extremely high scale event processing, but it's great default for a message/task queue for 90% of business apps. If you're processing under a few 100m events/tasks per day with less than ~10k concurrent processes dequeuing from it it's what I'd default to.
I work on apps that use such a PG based queue system and it provides indispensable features for us we couldn't achieve easily/cleanly with a normal queue system such as being able to dynamically adjust the priority/order of tasks being processed and easily query/report on the content of the queue. We have many other interesting features built into it that are more specific to our needs as well that I'm more hesitant to describe in detail here.