Yes. Examples of non-defects that should not be in the ERROR loglevel:
* Database timeout (the database is owned by a separate oncall rotation that has alerts when this happens)
* ISE in downstream service (return HTTP 5xx and increment a metric but don’t emit an error log)
* Network error
* Downstream service overloaded
* Invalid request
Basically, when you make a request to another service and get back a status code, your handler should look like:
logfunc = logger.error if 400 <= status <= 499 and status != 429 else logger.warning
(Unless you have an SLO with the service about how often you’re allowed to hit it and they only send 429 when you’re over, which is how it’s supposed to work but sadly rare.)I wish I lived in a world where that worked. Instead, I live in a world where most downstream service issues (including database failures, network routing misconfigurations, giant cloud provider downtime, and more ordinary internal service downtime) are observed in the error logs of consuming services long before they’re detected by the owners of the downstream service … if they ever are.
My rough guess is that 75% of incidents on internal services were only reported by service consumers (humans posting in channels) across everywhere I’ve worked. Of the remaining 25% that were detected by monitoring, the vast majority were detected long after consumers started seeing errors.
All the RCAs and “add more monitoring” sprints in the world can’t add accountability equivalent to “customers start calling you/having tantrums on Twitter within 30sec of a GSO”, in other words.
The corollary is “internal databases/backend services can be more technically important to the proper functioning of your business, but frontends/edge APIs/consumers of those backend services are more observably important by other people. As a result, edge services’ users often provide more valuable telemetry than backend monitoring.”
4xx is for invalid requests. You wouldn't log a 404 as an error
> Database timeout (the database is owned by a separate oncall rotation that has alerts when this happens)
So people writing software are supposed to guess how your organization assigns responsibilities internally? And you're sure that the database timeout always happens because there's something wrong with the database, and never because something is wrong on your end?