This might be controversial, but I'd say if it's fine after a retry, then it doesn't ...

eterm • last Saturday at 5:54 PM • 3 replies • view on HN

This might be controversial, but I'd say if it's fine after a retry, then it doesn't need a warning.

Because what I'd want to know is how often does it fail, which is a metric not a log.

So expose <third party api failure rate> as a metric not a log.

If feeding logs into datadog or similar is the only way you're collecting metrics, then you aren't treating your observablity with the respect it deserves. Put in real counters so you're not just reacting to what catches your eye in the logs.

If the third party being down has a knock-on effect to your own system functionality / uptime, then it needs to be a warning or error, but you should also put in the backlog a ticket to de-couple your uptime from that third-party, be it retries, queues, or other mitigations ( alternate providers? ).

By implementing a retry you planned for that third party to be down, so it's just business as usual if it suceeds on retry.

Replies

mewpmewp2 • last Saturday at 6:05 PM

> If the third party being down has a knock-on effect to your own system functionality / uptime, then it needs to be a warning or error, but you should also put in the backlog a ticket to de-couple your uptime from that third-party, be it retries, queues, or other mitigations ( alternate providers? ).

How do you define uptime? What if e.g. it's a social login / data linking and that provider is down? You could have multiple logins and your own e-mail and password, but you still might lose users because the provider is down. How do you log that? Or do you only put it as a metric?

You can't always easily replace providers.

➕ show 1 reply

hk__2 • last Saturday at 6:10 PM

> This might be controversial, but I'd say if it's fine after a retry, then it doesn't need a warning. > > Because what I'd want to know is how often does it fail, which is a metric not a log.

It’s not controversial; you just want something different. I want the opposite: I want to know why/how it fails; counting how often it does is secondary. I want a log that says "I sent this payload to this API and I got this error in return", so that later I can debug if my payload was problematic, and/or show it to the third party if they need it.

hamandcheese • last Saturday at 8:12 PM

My main gripe with metrics is that they are not easily discoverable like logs are. Even if you capture a list of all the metrics emitted from an application, they often have zero context and so the semantics are a bit hard to decipher.

alt Hacker News

Replies