How/why to sweep async tasks under a Postgres table

94 points • by ostler • last Friday at 6:28 PM • 33 comments • view on HN

Comments

It can be more complicated depending on your environment but I'd prefer instead to use a Pub/Sub pattern instead. You can have a generic topic if you want to model it like the generic table but it handles retries, metrics, scaling, long-running, etc all for you. If you are running in the cloud then Pub/Sub is something you don't need to maintain and will scale better than this solution. You also won't need to VACUUM the Pub/Sub solution ;)

koolba • last Friday at 7:50 PM

The article says:

> Never Handroll Your Own Two-Phase Commit

And then buried at the end:

> A few notable features of this snippet:

> Limiting number of retries makes the code more complicated, but definitely worth it for user-facing side-effects like emails.

This isn't two-phase commit. This is lock the DB indefinitely while remote system is processing and pray we don't crash saving the transaction after it completes. That locked also eats up a database connection so your concurrency is limited by the size of your DB pool.

More importantly, if the email sends but the transaction to update the task status fails, it will try again. And again. Forever. If you're going to track retries it would have to be before you start the attempt. Otherwise the "update the attempts count" logic itself could fail and lead to more retries.

The real answer to all this is to use a provider that supports idempotency keys. Then when you can retry the action repeatedly without it actually happening again. My favorite article on this subject: https://brandur.org/idempotency-keys

➕ show 6 replies

rictic • last Friday at 8:05 PM

Missing from the article: how to communicate progress and failure to the user?

This is much more complicated with task queues. Doable still! But often skipped, because it's tempting to imagine that the backend will just handle the failure by retrying. But there are lots of kinds of failure that can happen.

The recipient's server doesn't accept the email. The recipient's domain name expired. Actually, we don't have an email address for that recipient at all.

The user has seen "got it, will do, don't worry about it" but if that email is time sensitive, they might want to know that it hasn't been sent yet, and maybe they should place a phone call instead.

➕ show 1 reply

zanellato19 • today at 1:13 PM

I feel like this is basically what the Rails world does. Sidekiq handles a lot of this for you and it's honestly an amazing tool.

It does rely on redis, but it's basically the same idea.

➕ show 2 replies

adamzwasserman • today at 3:12 PM

Fair warning: don't bounce off the first paragraph like I did. "Dumb queries" made me think the author was arguing against SQL sophistication — I was halfway through composing a rebuttal about stored procedures and keeping logic in the database before I actually read the rest.

Turns out the article advocates exactly that. The example uses CTEs with multi-table inserts. "Dumb" here means "no synchronous external service calls," not "avoid complex SQL."

w23j • today at 3:11 PM

We use a similar approach.

Fun fact: A query like this will, once in a blue moon, return more than limit (here 1) row, since the inner query is executed multiple times and returns different ids, which is surprising for a lot of people. If your code does not expect that, it may cause problems. (The article seems to, since it uses a list and iteration to handle the result.)

  delete from task
  where task_id in
  ( select task_id
    from task
    order by random() -- use tablesample for better performance
    for update
    skip locked
    limit 1
  )
  returning task_id, task_type, params::jsonb as params

You can avoid that by using a materialized Common Table Expression. https://stackoverflow.com/questions/73966670/select-for-upda...

Also, if your tasks take a long time, it will create long-running transactions, which may cause dead tuple problems. If you need to avoid that, you can mark the task as "running" in a short-lived transaction and delete it in another. It becomes more complicated then, since you need to handle the case that your application dies while it has taken a task.

arjie • today at 9:21 AM

The sophisticated solution to this problem is Temporal, but yes, I also use an async task queue frequently because it's very easy to roll one's own.

efxhoy • last Friday at 9:23 PM

I like it! We have a service with a similar postgres task queue but we use an insert trigger on the tasks table that does NOTIFY and the worker runs LISTEN, it feels a bit tidier than polling IMO.

➕ show 2 replies

nullzzz • last Friday at 9:39 PM

I can recommend this architecture. So much easier to maintain and understand than using an extra service. The implementation here I didn’t go into much detail, but you can surely roll your own if this doesn’t cut it for you, or use a library like pgboss.

stack_framer • last Friday at 8:00 PM

> I like slim and stupid servers, where each endpoint wraps a very dumb DB query.

I thought I was alone in this, but I build all my personal projects this way! I wish I could use this approach at work, but too many colleagues crave "engineering."

➕ show 1 reply

rgbrgb • last Friday at 8:14 PM

If you're in TS/JS land, I like to use an open source version of this called graphile-worker [0].

[0]: https://worker.graphile.org

➕ show 1 reply

morshu9001 • last Friday at 8:26 PM

Never do RPCs during an xact like this! Fastest way to lock up your DB. I don't even mean at large scale. I've been forced many times to set up two-phase commit. That way you also get more flexibility and visibility into what it's doing.

claytongulick • today at 2:57 PM

At a brief scan of the code, is there a bug with the way task rows are selected and rolled back?

It looks like multiple task rows are being retrieved via a delete...returning statement, and for each row an email being sent. If there's an error, the delete statement is rolled back.

Let's hypothesize that a batch of ten tasks are retrieved, and the 9th has a bad email address, so the batch gets rolled back on error. Next retry the welcome email would be sent again for the ones that succeeded, right?

Even marking the task as "processed" with the tx in the task code wouldn't work, because that update statement would also get rolled back.

Am I missing something? (entirely possible, the code is written in "clever" style that makes it harder for me to understand, especially the bit where $sql(usr_id) is passed into the sql template before it's been returned, is there CTE magic there that I don't understand?)

I thought that this was the reason that most systems like this only pull one row at a time from the queue with skip locked...

Thanks to anyone who can clarify this for me if it is indeed correct!

➕ show 1 reply

alt Hacker News

How/why to sweep async tasks under a Postgres table

Comments