I see that the author took a 'heuristical' approach for retrying tasks (having a predetermined amount of time a task is expected to take, and consider it failed if it wasn't updated in time) and uses SQS. If the solution is homemade anyway, I can only recommend leveraging your database's transactionality for this, which is a common pattern I have often seen recommend and also successfully used myself:
- At processing start, update the schedule entry to 'executing', then open a new transansaction and lock it, while skipping already locked tasks (`SELECT FOR UPDATE ... SKIP LOCKED`).
- At the end of processing, set it to 'COMPLETED' and commit. This also releases the lock.
This has the following nice characteristics:
- You can have parallel processors polling tasks directly from the database without another queueing mechanism like SQS, and have no risk of them picking the same task.
- If you find an unlocked task in 'executing', you know the processor died for sure. No heuristic needed
Don't have to keep transaction open. What I do is:
1. Select next job
2. Update status to executing where jobId = thatJob and status is pending
3. If previous affected 0 rows, you didn't get the job, go back to select next job
If you have "time to select" <<< "time to do" this works great. But if you have closer relationship you can see how this is mostly going to have contention and you shouldn't do it.
This is exactly what we're doing. Works like a charm.
This introduces long-running transactions, which at least in Postgres should be avoided.