A couple of years ago, we experienced a silent data corruption incident in our checkout process due ...

DarenWatson • today at 11:25 AM • 2 replies • view on HN

A couple of years ago, we experienced a silent data corruption incident in our checkout process due to this specific edge case.

A user would generate the idempotency key by loading the front-end application, adding item(s) to their cart, submitting their order but timing out. The user would then navigate back to the front-end application and add another item and submit the order again. Since the user is submitting an identical idempotency key to the same transaction, our payment gateway would look up the request/transaction by idempotency key and see in its cache that there was a successful (200 OK) response to the previous request. The user now believes they purchased three items, however, our system only charged and shipped on two of the orders.

Consequently, the lesson we take away from the aforementioned incident is idempotency keys are really composite keys (Client_Provided_Key + Hash(Request_Payload)).

If a system receives an identical idempotency key (but with a different request payload) the idempotency key should be rejected with a 409 Conflict response with a message similar to "Idempotency key already used with different request payload". Alternatively, some teams argue it should be returned with a 400 Bad Request response. Systems should never return a failed cache response or replace old entries of data.

This article explains how to unlock your flow. The final idempotent key will not be located until the first request completes, but will rather exist when the request is in progress.

To safely accomplish your goal, you have to follow the following steps:

1. Acquire a distributed lock on the idempotent key.

2. Check for the existence of a key in your persistent store.

3. If an existing key is found, verify the hash of the payload against the hash for the payload type. If the hashes do not match, return a 409 error.

4. If the hashes match, look up the status of the payload. If the status shows COMPLETED in the persistent store, return the cached response. If the status shows PENDING in the persistent store, return a 429 Too Many Requests to the user or hold the connection open until the request reaches a PENDING state.

5. After processing the request, save the response to the persistent store before releasing the lock.

While this may look simple on paper, creating a distributed locking state machine for a single API endpoint is typically how developers have their first aha moments with idempotency. Becoming idempotent is often an enormous architectural shift and not just a middleware header check.

Replies

spockz • today at 12:30 PM

I wholeheartedly agree. Luckily it is a lot easier to reliably run a distributed KV store that only needs to lock the idempotency key over relatively short times rather than a whole database with millions of records or make arbitrary systems idempotent.

gib444 • today at 12:24 PM

Sounds like an interesting case of incorrectly trusting user input.

The idempotency key should have been viewed as the untrustworthy hint it really is. Then you can decide whether an untrustworthy hint is what you really need. At that point I'd hope someone on the team says "This is ordering - I think we need something trustworthy"

> Consequently, the lesson we take away from the aforementioned incident is idempotency keys are really composite keys (Client_Provided_Key + Hash(Request_Payload)).

Did the postmortem result in any other (wider) changes/actions, out of curiosity?

No idea if this was anything like what happened your case, and probably going off on a tangent, but I've seen so many cases where teams are split into backend and frontend, and they stop thinking about the product as a single distributed system (or, it exacerbates that lack of that thinking from before). Frontend often suggest "Oh we can just create an idempotency key" and any concerns from backend are dismissed. If they implement it incorrectly, backend are on the wrong 'team' to provide input.

alt Hacker News

Replies