And the second rule is make all your error messages actionable. By that I mean it should tell me what action to take to fix the error (even if that action means hard work, tell me what I have to do).
Can you please explain this? That sounds like identifying bugs but not fixing them but I realize you don’t mean that. One hopes the context information in the error will make it actionable when it occurs, never completely successfully, of course.
Also put the fucking data in the message that led to the decision to emit the logs. I can't remember how many times I have had a three part test trigger a log "blah: called with illegal parameters, shouldn't happen" and the illegal parameters were not logged.
So what error do you put if the server is over 500 miles away?
https://web.mit.edu/jemorris/humor/500-miles
Or you can't connect because of a path MTU error.
Or because the TTL is set to low?
Your software at the server level has no idea what's going wrong at the network level, all you can send is some kind of network problem message.
Maybe that makes sense for a single-machine application where you also control the hardware. But for a networked/distributed system, or software that runs on the user's hardware, the action might involve a decision tree, and a log line is a poor way to convey that. We use instrumentation, alerting and runbooks for that instead, with the runbooks linking into a hyperlinked set of articles.
My 3D printer will try to walk you through basic fixes with pictures on the device's LCD panel, but for some errors it will display a QR code to their wiki which goes into a technical troubleshooting guide with complex instructions and tutorial videos.
This can be difficult or just not possible.
What is possible is to include as much information about what the system was trying to do. If there's an file IO error, include the the full path name. Saying "file not found" without saying which file was not found infuriates me like few other things.
If some required configuration option is not defined, include the name of the configuration option and from where it tried to find said configuration (config files, environment, registry etc). And include the detailed error message from the underlying system if any.
Regular users won't have a clue how to deal with most errors anyway, but by including details at least someone with some system knowledge has a chance of figuring out how to fix or work around the issue.
Exactly. Some applications keep running way after you have long gone. If there is useful information to provide give it.
This is just plain wrong, I vehemently disagree. What happens if a payment fails on my API, and today that means I need to go through a 20-step process with this pay provider, my database, etc. to correct that. But what’s worse is if this error happens 11,000 times and I run a script to do my 20 step process 11,000 times, but it turns out the error was raised in error. Additionally, because the error was so explicit about how to fix it, I didn’t talk to anyone. And of course, the suggested fix was out of date because docs lag vs. production software. Now I have 11,000 pissed off customers because I was trying to be helpful.
Suppose I'm writing an http server and the error is caused by a flaky power supply causing the disk to lose power when the server attempts to read a file that's been requested. How is the http server supposed to diagnose this or any other hardware fault? Furthermore, why should it even be the http server's responsibility to know about hardware issues at all?