Log level 'error' should mean that something needs to be fixed

454 points • by todsacerdoti • last Wednesday at 12:08 PM • 276 comments • view on HN

Comments

> When implementing logging, it's important to distinguish between an error from the perspective of an individual operation and an error from the perspective of the overall program or system. Individual operations may well experience errors that are not error level log events for the overall program. You could say that an operation error is anything that prevents an operation from completing successfully, while a program level error is something that prevents the program as a whole from working right.

This is a nontrivial problem when using properly modularized code and libraries that perform logging. They can’t tell whether their operational error is also a program-level error, which can depend on usage context, but they still want to log the operational error themselves, in order to provide the details that aren’t accessible to higher-level code. This lower-level logging has to choose some status.

Should only “top-level” code ever log an error? That can make it difficult to identify the low-level root causes of a top-level failure. It also can hamper modularization, because it means you can’t repackage one program’s high-level code as a library for use by other programs, without somehow factoring out the logging code again.

➕ show 6 replies

eterm • yesterday at 5:30 PM

How I'd personally like to treat them:

  - Critical / Fatal:  Unrecoverable without human intervention, someone needs to get out of bed, now.
  - Error : Recoverable without human intervention, but not without data / state loss. Must be fixed asap. An assumption didn't hold.
  - Warning: Recoverable without intervention. Must have an issue created and prioritised. ( If business as usual, this could be downgrading to INFO. )

The main difference therefore between error and warning is, "We didn't think this could happen" vs "We thought this might happen".

So for example, a failure to parse JSON might be an error if you're responsible for generating that serialisation, but might be a warning if you're not.

➕ show 6 replies

alex-moon • today at 2:08 PM

"If a SMTP mailer trying to send email to somewhere logs 'cannot contact port 25 on <remote host>', that is not an error in the local system and should not be logged at level 'error'."

But it is still an error condition, i.e. something does need to be fixed - either something about the connection string (i.e. in the local system) is wrong, or something in the other system or somewhere between the two is wrong (i.e. and therefore needs to be fixed). Either way, developers on this end (I mean someone reading the logs - true that it might not be the developers of the SMTP mailer) need to get involved, even if it is just to reach out to the third party and ask them to fix it on their end.

A condition that fundamentally prevents a piece of software from working not being considered an error is mad to me.

➕ show 2 replies

mfuzzey • yesterday at 4:15 PM

I think it's difficult to say without knowing how the system is deployed and administered. "If a SMTP mailer trying to send email to somewhere logs 'cannot contact port 25 on <remote host>', that is not an error in the local system"

Maybe or maybe not. If the connection problem is really due to the remote host then that's not the problem of the sender. But maybe the local network interface is down, maybe there's a local firewall rule blocking it,...

If you know the deployment scenario then you can make reasonable decisions on logging levels but quite often code is generic and can be deployed in multiple configurations so that's hard to do

➕ show 4 replies

bytefish • today at 7:06 AM

Making software is 20% actual development and 80% is maintenance. Your code and your libraries need to be easy to debug, and this means logs, logs, logs, logs and logs. The more the better. It makes your life easy in the long run.

So the library you are using fires too many debug messages? You know, that you can always turn it off by ignoring specific sources, like ignoring namespaces? So what exactly do you lose? Right. Almost nothing.

As for my code and libraries I always tend to do both, log the error and then throw an exception. So I am on the safe side both ways. If the consumer doesn’t log the exception, then at least my code does it. And I give them the chance to do logging their way and ignore mine. I am doing a best-guess for you… thinking to myself, what’s an error when I’d use the library myself.

You don’t trust me? Log it the way you need to log it, my exception is going to transport all relevant data to you.

This has saved me so many times, when getting bug reports by developers and customers alike.

There are duplicate error logs? Simply turn my logging off and use your own. Problem solved.

If it is a program level error, maybe a warning and returning the error is the correct way to do. Maybe it’s not? It depends on the context.

And this basically is the answer to any software design question: It depends.

➕ show 1 reply

jayofdoom • yesterday at 4:51 PM

In OpenStack, we explicitly document what our log levels mean; I think this is valuable from both an Operator and Developer perspective. If you're a new developer, without a sense of what log levels are for, it's very prescriptive and helpful. For an operator, it sets expectations.

https://docs.openstack.org/oslo.log/latest/user/guidelines.h...

FWIW, "ERROR: An error has occurred and an administrator should research the event." (vs WARNING: Indicates that there might be a systemic issue; potential predictive failure notice.)

➕ show 1 reply

yoan9224 • today at 2:22 PM

I've found the most practical rule is: "Would I want to be paged for this at 2 AM?"

If yes: ERROR If I want to check it tomorrow: WARNING If it's useful for debugging: INFO Everything else: DEBUG

The problem with the article's approach is that libraries don't have enough context. A timeout calling an external API might be totally fine if you're retrying, but it's an ERROR if you've exhausted retries and failed the user's request.

We solve this by having libraries emit structured events with severity hints, then the application layer decides the final log level based on business impact. A 500 from a recommendation service? Warning. A 500 from the payment processor? Error.

rwmj • yesterday at 4:46 PM

And the second rule is make all your error messages actionable. By that I mean it should tell me what action to take to fix the error (even if that action means hard work, tell me what I have to do).

➕ show 9 replies

teo_zero • yesterday at 5:41 PM

This doesn't resonate with my experience. I place the line between a warning and an error whether the operation can or can't be completed.

A connection timed out, retrying in 30 secs? That's a warning. Gave up connecting after 5 failed attempts? Now that's an error.

I don't care so much if the origin of the error is within the program, or the system, or the network. If I can't get what I'm asking for, it can't be a mere warning.

Xss3 • yesterday at 10:42 PM

Some programs are error resistant and need an additional level: Fatal.

A warning can be ignored safely. Warnings may be 'debugging enabled, results cannot be certified' or something similar.

An error should not be ignored, an operation is failing, data loss may be occurring, etc.

Some users may be okay with that data loss or failing operation. Maybe it isnt important to them. If the program continues and does not error in the parts that matter to the user, then they can ignore it, but it is still objectively an error occurring.

A fatal message cannot be ignored, the system has crashed. Its the last thing you see before shutdown is attempted.

➕ show 1 reply

AndroTux • yesterday at 5:43 PM

“cannot contact port 25 on <remote host>” may very well be a configuration error. How should the program know?

➕ show 3 replies

t43562 • today at 1:38 PM

Errors can be recovered automatically sometimes but at the level at which you log them you don't know if that's going to happen. I therefore think this suggestion is not easy to follow.

Even if your libraries use nothing but exceptions or return codes you still end up with levels. You still end up with logs that have information in them that gets ignored when it shouldn't be because there's so much noise that people get tired of all the "cries of wolf."

Occasionally one is at a high enough level to know for sure that something needs fixing and for this I use "CRITICAL" which is my code for "absolutely sure that you can't ignore this."

IMO it's about time AI was looking at the logs to find out if there was something we really need to be alerted to.

jillesvangurp • yesterday at 4:09 PM

Errors mean I get alerted. Zero tolerance on that from my side.

hedayet • yesterday at 7:56 PM

I agree with the principle: log level error should mean someone needs to fix something.

This post frames the problem almost entirely from a sysadmin-as-log-consumer perspective, and concludes that a correctly functioning system shouldn’t emit error logs at all. That only holds if sysadmins are the only "someone" who can act.

In practice, if there is a human who needs to take action - whether that’s a developer fixing a bug, an infra issue, or coordinating with an external dependency - then it’s an error. The solution isn’t to downgrade severity, but to route and notify the right owner.

Severity should encode actionability, not just system correctness.

jedberg • yesterday at 9:08 PM

I feel like it's more nuanced than OP writes. Presumably every log line comes from something like a try/catch. An edge case was identified, and the code did something differently.

Did it do what it was supposed to do, but in a different way or defer for retrying later? Then WARN.

Did it fail to do what it needed to do? ERROR

Did it do what it needed to do in the normal way because it was totally recoverable? INFO

Did data get destroyed in the process? FATAL

It should be about what the result was, not who will fix it or how. Because that might change over time.

➕ show 1 reply

aunty_helen • today at 12:33 AM

Good logging is critical and actually having the logs turned on in production. No point writing logs if you silence them.

My company now has a log aggregator that scans the logs for errors, when it finds one, creates a Trello card, uses opus to fix the issue and then propose a PR against the card. These then get reviewed, finished if tweaks are necessary and merged if appropriate.

aqme28 • yesterday at 6:05 PM

I agree with this take in a steady state, but the process of building software is just that-- it's a process.

So it's natural for error messages to be expected, as you progressively add and then clear up edge cases.

➕ show 1 reply

georgefrowny • yesterday at 6:38 PM

Easy to say, but there's "yes we know this is wrong but this will have to do for now" and "we don't expect to see this in real life unless something has gone sideways".

➕ show 1 reply

umpalumpaaa • yesterday at 10:12 PM

What I like about objective-c’s error handling approach is that a method that can fail is able to tell if a caller considers error handling or not. If the passed *error is NULL you know that that is no way for a caller to properly handle the error. My implementations usually have this logic:

if error == NULL and operationFailed then log error Otherwise Let client side do the error handling (in terms of logging)

knallfrosch • today at 9:12 AM

> If a SMTP mailer trying to send email to somewhere logs 'cannot contact port 25 on <remote host>', that is not an error in the local system and should not be logged at level 'error'.

A mail program not being to checks notes send emails sounds like an error to me. (Unless you implement retries.)

➕ show 1 reply

bandrami • today at 9:19 AM

It's like how an alert system that sends more than ~8 alerts a day effectively sends zero alerts.

Waterluvian • yesterday at 9:22 PM

I think this is one of those discussions where there's no one right answer (though there's many wrong answers). All you have to do is pick a reasonable definition, write it down, socialize it, and be consistent when using it.

I think discussions that argue over a specific approach are a form of playing checkers.

Insanity • today at 5:07 AM

Coincidentally was reviewing code yesterday that had a confusing/contradictory statement..

  error_msg = "xyz went wrong"
  log.warn(error_msg)

My comment on the CR was about this being an inherent contradiction and incredibly confusing to know if it's actually an error or a warning..

jmull • yesterday at 10:33 PM

I encourage people to think a few moments about what to log and at what level.

You’re kind of telling a story to future potential trouble-shooters.

When you don’t think about it at all (it doesn’t take much), you tend to log too much and too little and at the wrong level.

But this article isn’t right either. Lower-level components typically don’t have the context to know whether a particular fault requires action or not. And since systems are complex, with many levels of abstractions and boxes things live in, actually not much is in a position to know this, even to a standard of “probably”.

HarHarVeryFunny • yesterday at 4:56 PM

I agree with the sentiment, although not sure if "error" is the right category/verbiage for actionable logs.

In an ideal world things like logs and alarms (alerting product support staff) should certainly cleanly separate things that are just informative, useful for the developer, and things that require some human intervention.

If you don't do this then it's like "the boy that cried wolf", and people will learn to ignore errors and alarms since you've trained them to understand that usually no action is needed. It's also useful to be able to grep though log files and distinguish failures of different categories, not just grep for specific failures.

raldi • yesterday at 3:58 PM

Yes. Examples of non-defects that should not be in the ERROR loglevel:

* Database timeout (the database is owned by a separate oncall rotation that has alerts when this happens)

* ISE in downstream service (return HTTP 5xx and increment a metric but don’t emit an error log)

* Network error

* Downstream service overloaded

* Invalid request

Basically, when you make a request to another service and get back a status code, your handler should look like:

    logfunc = logger.error if 400 <= status <= 499 and status != 429 else logger.warning

(Unless you have an SLO with the service about how often you’re allowed to hit it and they only send 429 when you’re over, which is how it’s supposed to work but sadly rare.)

➕ show 3 replies

makeitdouble • yesterday at 4:39 PM

> This assumes an error/warning/info/debug set of logging levels instead of something more fine grained, but that's how many things are these days.

Does it ?

Don't most stacks have an additional level of triaging logs to detect anomalies etc ? It can be your New relic/DataDog/Sentry or a self made filtering system, but nowadays I'd assume the base log levels are only a rough estimate of whether an single event has any chance of being problematic.

I'd bet the author also has strong opinions about http error codes, and while I empathize, those ships have long sailed.

rsanek • yesterday at 9:28 PM

If something needs to be fixed, why is it just a log? How is someone supposed to even notice a random error log? At the places that I've worked, trying to make alerting be triggered on only logs was always quite brittle, it's just not best practice. Throw an exception / exit the program if it's something that actually needs fixing!

➕ show 1 reply

alexwasserman • yesterday at 4:36 PM

I have been particularly irritated in the past where people use a lower log level and include the higher log level string in the message, especially where it's then parsed, filtered, and alerted on my monitoring.

eg. log level WARN, message "This error is...", but it then trips an error in monitoring and pages out.

Probably breaching multiple rules here around not parsing logs like that, etc. But it's cropped up so many times I get quite annoyed by it.

➕ show 2 replies

peanut-walrus • yesterday at 8:16 PM

Disagree. If you have an error that NEEDS fixing, your program should exit. Error level logs for operation level errors are fine.

Glyptodon • yesterday at 10:21 PM

I agree errors should be errors. Many things that are logged for other reasons should use a different label.

That said, the thing I've cone find being useful as a subcategory of error are errors due to data problems vs errors due to other issues.

dpc_01234 • yesterday at 8:36 PM

Error log level should be renamed. It's just a terrible name that confuses usage.

➕ show 1 reply

dnautics • yesterday at 4:39 PM

let's say you a bunch of database timeouts in a row. this might mean that nothing needs to be fixed. But also, the "thing that needs to be fixed" might be "the ethernet cable fell out the back of your server".

How do you know?

➕ show 1 reply

theli0nheart • yesterday at 4:20 PM

I agree with this.

Not everything that a library considers an error is an application error. If you log an error, something is absolutely wrong and requires attention. If you consider such a log as "possibly wrong", it should be a warning instead.

Kinrany • yesterday at 7:36 PM

Why are logs usually assumed to be for human consumption only? It seems weird to me that log storage usually exists outside of the system and isn't a general purpose message bus.

tgv • yesterday at 6:09 PM

I log authorization errors as errors. Are they errors? It depends on how you read the logs. Perhaps you want to distinguish between internal, external and non-attributable errors for easier grepping.

Too • yesterday at 6:52 PM

Agree with the post. The job of blackbox is to turn probes into metrics. If a probe fails, that should just become a probe_success=0 metric. Blackbox did its job and should not log an error.

BiraIgnacio • yesterday at 7:22 PM

It means something is wrong, yes. Now, if it's worth fixing (granted, most of the time it would), that's another story.

leni536 • yesterday at 6:29 PM

I make error logs fail happy path functional/integration tests for the backend applications I'm currently writing.

plandis • yesterday at 7:04 PM

I agree. Error or higher should result in an alarm and indicates that some corrective action needs to be taken.

mycall • yesterday at 9:54 PM

Severity is the value and you set thresholds based on context of the error type.

shadowgovt • yesterday at 3:57 PM

This is the standard I use as well. In general, my rule of thumb is that if something is logging error, it would have been perfectly reasonable for the program to respond by crashing, and the only reason it didn't is that it's executing in some kind of larger context that wants to stay up in the event of the failure of an individual component (like one handler suffering a query that hangs it and having to be terminated by its monitoring program in a program with multiple threads serving web requests). In contrast, something like an ill-formed web query from an untrusted source isn't even an error because you can't force untrusted sources to send you correctly formed input.

Warning, in contrast, is what I use for a condition that the developer predicted and handled but probably indicates the larger context is bad, like "this query arrived from a trusted source but had a configuration so invalid we had to drop it on the floor, or we assumed a default that allowed us to resolve the query but that was a massive assumption and you really should change the source data to be explicit." Warning is also where I put things like "a trusted source is calling a deprecated API, and the deprecation notification has been up long enough that they really should know better by now."

Where all of this matters is process. Errors trigger pages. Warnings get bundled up into a daily report that on-call is responsible for following up on, sometimes by filing tickets to correct trusted sources and sometimes by reaching out to owners of trusted sources and saying "Hey, let's synchronize on your team's plan to stop using that API we declared is going away 9 months ago."

➕ show 2 replies

azov • yesterday at 7:55 PM

If my system doesn’t work - I want to be alerted. If notification was supposed to be sent but wasn’t - it’s an error regardless of whether it wasn’t sent because of a bug in my code or external service being down. It may be a warning if I’m still retrying, but if I gave up - it’s an error.

“External service down, not my problem, nothing I can do” is hardly ever the case - e.g. you may need to switch to a backup provider, initiate a support call, or at least try to figure out why it’s down and for how long.

29athrowaway • yesterday at 7:56 PM

Input errors do not need fixing, so no.

➕ show 1 reply

mschuster91 • yesterday at 6:44 PM

> If error level messages are not such a sign, I can assure you that most system administrators will soon come to ignore all messages from your program rather than try to sort out the mess, and any actual errors will be lost in the noise and never be noticed in advance of actual problems becoming obvious.

Bold of you to assume that there are system administrators. All too often these days it's "devops" aka some devs you taught how to write k8s yamls.

mkoubaa • yesterday at 5:59 PM

To me it's always a neat trick when you're not allowed to use print() in production code

vpribish • yesterday at 4:38 PM

I just started playing in the Erlang ecosystem and they have EIGHT levels of logging messages. it seems crazily over-specific, but they are the champions of robust systems.

I could live with 4

Error - alert me now.

Warning - examine these later,

Info - important context for investigations.

Debug - usually off in prod.

➕ show 3 replies

winningChild • yesterday at 8:05 PM

[dead]

blkflcn3 • yesterday at 8:24 PM

> What an error log level should mean (a system administrator's view)

That says it all:

- Backseat driving

- Not a developer by trade

alt Hacker News

Log level 'error' should mean that something needs to be fixed

Comments