No; I’m not understanding your point about guessing. Could you restate?
As for queries that time out, that should definitely be a metric, but not pollute the error loglevel, especially if it’s something that happens at some noisy rate all the time.
> the database is owned by a separate oncall rotation
Not OP, but this part hits the same for me.
In the case your client app is killing the DB through too many calls (e.g. your cache is not working) you should be able to detect it and react, without waiting for the DB team to come to you after they investigated the whole thing.
But you can't know in advance if the DB connection errors are your fault or not, so logging it to cover the worse case scenario (you're the cause) is sensible.
> No; I’m not understanding your point about guessing. Could you restate?
In the general case, the person writing the software has no way of knowing that "the database is owned by a separate oncall rotation". That's about your organization chart.
Admittedly, they'd be justified in assuming that somebody is paying attention to the database. On the other hand, they really can't be sure that the database is going to report anything useful to anybody at all, or whether it's going to report the salient details. The database may not even know that the request was ever made. Maybe the requests are timing out because they never got there. And definitely maybe the requests are timing out because you're sending too many of them.
I mean, no, it doesn't make sense to log a million identical messages, but that's rate limiting. It's still an error if you can't access your database, and for all you know it's an error that your admin will have to fix.
As for metrics, I tend to see those as downstream of logs. You compute the metric by counting the log messages. And a metric can't say "this particular query failed". The ideal "database timeout" message should give the exact operation that timed out.
I think OP is making two separate but related points, a general point and a specific point. Both involve guessing something that the error-handling code, on the spot, might not know.
1. When I personally see database timeouts at work it's rarely the database's fault, 99 times out of 100 it's the caller's fault for their crappy query; they should have looked at the query plan before deploying it. How is the error-handling code supposed to know? I log timeouts (that still fail after retry) as errors so someone looks at it and we get a stack trace leading me to the bad query. The database itself tracks timeout metrics but the log is much more immediately useful: it takes me straight to the scene of the crime. I think this is OP's primary point: in some cases, investigation is required to determine whether it's your service's fault or not, and the error-handling code doesn't have the information to know that.
2. As with exceptions vs. return values in code, the low-level code often doesn't know how the higher-level caller will classify a particular error. A low-level error may or may not be a high-level error; the low-level code can't know that, but the low-level code is the one doing the logging. The low-level logging might even be a third party library. This is particularly tricky when code reuse enters the picture: the same error might be "page the on-call immediately" level for one consumer, but "ignore, this is expected" for another consumer.
I think the more general point (that you should avoid logging errors for things that aren't your service's fault) stands. It's just tricky in some cases.