What if you are integrated to a third party app and it gives you 5xx once? What do you log it as, an...

mewpmewp2 • yesterday at 5:38 PM • 6 replies • view on HN

What if you are integrated to a third party app and it gives you 5xx once? What do you log it as, and let's say after a retry it is fine.

Replies

kiicia • yesterday at 6:03 PM

As always „it depends”

- info - when this was expected and system/process is prepared for that (like automatic retry, fallback to local copy, offline mode, event driven with persistent queue etc) - warning - when system/process was able to continue but in degraded manner, maybe leaving decision to retry to user or other part of system, or maybe just relying on someone checking logs for unexpected events, this of course depends if that external system is required for some action or in some way optional - error - when system/process is not able to continue and particular action has been stopped immediately, this includes situation where retry mechanism is not implemented for step required for completion of particular action - fatal - you need to restart something, either manually or by external watchdog, you don’t expect this kind of logs for simple 5xx

bqmjjx0kac • yesterday at 5:41 PM

I would log a warning when an attempt fails, and an error when the final attempt fails.

➕ show 1 reply

marcosdumay • yesterday at 6:05 PM

Well, the GPs criteria are quite good. But what you should actually do depends on a lot more things than the ones you wrote in your comment. It could be so irrelevant to only deserve a trace log, or so important to get a warning.

Also, you should have event logs you can look to make administrative decisions. That information surely fits into those, you will want to know about it when deciding to switch to another provider or renegotiate something.

cpburns2009 • yesterday at 6:11 PM

It really depends on the third party service.

For service A, a 500 error may be common and you just need to try again, and a descriptive 400 error indicates the original request was actually handled. In these cases I'd log as a warning.

For service B, a 500 error may indicate the whole API is down, in which case I'd log a warning and not try any more requests for 5 minutes.

For service C, a 500 error may be an anomaly and treat it as hard error and log as error.

➕ show 1 reply

eterm • yesterday at 5:54 PM

This might be controversial, but I'd say if it's fine after a retry, then it doesn't need a warning.

Because what I'd want to know is how often does it fail, which is a metric not a log.

So expose <third party api failure rate> as a metric not a log.

If feeding logs into datadog or similar is the only way you're collecting metrics, then you aren't treating your observablity with the respect it deserves. Put in real counters so you're not just reacting to what catches your eye in the logs.

If the third party being down has a knock-on effect to your own system functionality / uptime, then it needs to be a warning or error, but you should also put in the backlog a ticket to de-couple your uptime from that third-party, be it retries, queues, or other mitigations ( alternate providers? ).

By implementing a retry you planned for that third party to be down, so it's just business as usual if it suceeds on retry.

➕ show 3 replies

alt Hacker News

Replies