logoalt Hacker News

nik282000yesterday at 4:14 PM6 repliesview on HN

I work at a plant with a site wide SCADA/HMI (Siemens WinCC) system, every alarm is displayed on every HMI regardless of its proximity to the machine or even its ability to address the issue. And any given minute a hundred or more alarms can be generated, the majority being nuisance messages like "air pressure almost low" or my favorite " " (no message set) but scattered among those is the occasional "no cooling water - explosion risk".

This plant is operated and deigned to the spec of an international corp with more than 20 factories, it's not a mom-and-pop operation. No one seems to think the excessive, useless, alarms are an issue and that any damage caused by missed warnings is the fault of the operator. When approaching management and engineering about this the responses range from "it's not in the budget" to " you're maintenance, fix all the problems and the alarms will go away".

The only way for this kind of issue to be resolved is with regulation and safety standards. An operator can't safely operate equipment when alarms are not filtered or sorted in some way. It's like forcing your IT guy to watch web server access logs live to spot vulnerabilities being exploited.


Replies

terminalshortyesterday at 5:17 PM

This is a fundamental organizational and societal problem. An engineer would look at the situation and think "what is the best way to get the failure rate below a tolerable limit?" But a lawyer looks at the situation and thinks "how do I minimize liability and bad PR?" and a bureaucrat thinks "how can I be sure the blame doesn't land on me when something goes wrong?" And the answer to both of those questions is to throw an alarm on absolutely everything. So if there is a problem they can always say "our system detected the anomaly in advance and threw an alarm." Overall the system will be less safe and more expensive, but the lawyer's and bureaucrat's problems are solved. Our society is run by lawyers and bureaucrats, so their approach will win out over the engineer's. (And China's society is run by engineers, so it will win out over ours.)

show 5 replies
anonymousiamyesterday at 4:55 PM

The criticality of the alerts should be classified, and presented with the alert. Users should have the ability to filter non-critical messages on certain platforms.

Unfortunately, some systems either don't track criticality, or some of the alerts are tagged with the wrong level.

(One example of the latter is the Ruckus WAP, which has a warning message tagged at the highest level of criticality, so about two or three times a month, I see the critical alert: "wmi_unified_mgmt_rx_event_handler-1864 : MGMT frame, ia_action 0x0 ia_catageory 0x3 status 0x0", which should be just an informational level alert, with nothing to be done about it. I've reported this bug to Ruckus a few times over the past five years, but they don't seem to care.)

show 1 reply
varjagyesterday at 4:31 PM

I think it's regulated in places, as it was certainly an HMI concern ever since Three Mile Island. Our customer is really grilling vendors for generating excessive alarms. Generally for a system to pass commissioning it has to be all green, and if it starts event bombing after you're going to be chewed.

show 1 reply
miki123211yesterday at 6:10 PM

Useless warnings are a great CYA tactic.

THe more of them you have, the more likely it is that there's a warning if something happens. Whether the warning is ever noticed is secondary, what matters is the fact that there was a warning and the operator didn't react to it appropriately, which makes the situation the fault of the operator.

show 1 reply
CamperBob2yesterday at 4:45 PM

The only way for this kind of issue to be resolved is with regulation and safety standards.

Are you sure that's not what caused the problem in the first place? Unqualified and/or captured regulators who come up with safety standards that are out of touch with how the system needs to work in the real world?

show 1 reply
lostdogyesterday at 5:04 PM

I wonder if you could calculate a "probability of response to major alert" and make it the inverse of the total or irrelevant alerts. Then you get to ask "our probability of major alert saliency is onlt 6%. Why have the providers set it at this level, and what can we do to raise it?"