logoalt Hacker News

cletusyesterday at 8:55 PM4 repliesview on HN

Story time. I used to work for Facebook (and Google) and lots of games were played around bugs.

At some point the leadership introduced an SLA for high then medium priority bugs. Why? because bugs would sit in queues for years. The result? Bugs would often get downgraded in priority at or close to the SLA. People even wrote automated rules to see if their bugs filed got downgraded to alert them.

Another trick was to throw it back to the user, usually after months, ostensibly to request information, to ask "is this still a problem?" or just adding "could not reproduce". Often you'd get no response. sometimes the person was no longer on the team or with the company. Or they just lost interest or didn't notice. Great, it's off your plate.

If you waited long enough, you could say it was "no longer relevant" because that version of the app or API had been deprecated. It's also a good reason to bounce it back with "is still this relevant?"

Probably the most Machiavellian trick I saw was to merge your bug with another one vaguely similar that you didn't own. Why? Because this was hard to unwind and not always obvious.

Anyone who runs a call center or customer line knows this: you want to throw it back at the customer because a certain percentage will give up. It's a bit like health insurance companies automatically sending a denial for a prior authorization: to make people give up.

I once submitted some clear bugs to a supermarket's app and I got a response asking me to call some 800 number and make a report. My bug report was a complete way to reproduce the issue. I knew what was going on. Somebody simply wanted to mark the issue as "resolved". I'm never going to do that.

I don't think you can trust engineering teams (or, worse, individuals) to "own" bugs. They're not going to want to do them. They need to be owned by a QA team or a program team that will collate similar bugs and verify something is actually fixed.

Google had their own versions of things. IIRC bugs had both a priority and s everity for some reason (they were the same 99% of the time) between 0 and 4. So a standard bug was p2/s2. p0/s0 was the most severe and meant a serious user-facing outage. People would often change a p2/s2 to p3/s3, which basically meant "I'm never going to do this and I will never look at it again".

I've basically given up on filing bug reports because I'm aware of all these games and getting someone to actually pay attention is incredibly difficult. So much of this comes down to stupid organizational-level metrics about bug resolution SLAs and policies.


Replies

scottlambyesterday at 10:34 PM

> Google had their own versions of things. IIRC bugs had both a priority and s everity for some reason (they were the same 99% of the time) between 0 and 4. So a standard bug was p2/s2. p0/s0 was the most severe and meant a serious user-facing outage. People would often change a p2/s2 to p3/s3, which basically meant "I'm never going to do this and I will never look at it again".

Yeah, I've done that. I find it much more honest than automatically closing it as stale or asking the reporter to repeatedly verify it even if I'm not going to work on it. The record still exists that the bug is there. Maybe some day the world will change and I'll have time to work on it.

I'm sure the leadership who set SLAs on medium-priority bugs anticipated a lot of bugs would become low-priority. They forced triage; that's the point.

> People even wrote automated rules to see if their bugs filed got downgraded to alert them.

This part though is a sign people are using the "don't notify" box inappropriately, denying reporters/watchers the opportunity to speak up if they disagree about the downgrade.

toast0today at 6:32 AM

> IIRC bugs had both a priority and s everity for some reason (they were the same 99% of the time) between 0 and 4. So a standard bug was p2/s2. p0/s0 was the most severe and meant a serious user-facing outage

I've seen this at a couple places... I think it's supposed to help model things like if something is totally down, that's an S0... But if it's the site for the Olympics and it's a year with no Olympics, it's not a P0.

Personally, that kind of detail doesn't seem to matter to me, and it's hard to get people to agree to standards about it, so the data quality isn't likely to be good, so it can't be used for reporting. A single priority value is probably more useful. Priority helps responsible parties decide what issue to fix first, and helps reporters guess when their issue might be addressed.

> People would often change a p2/s2 to p3/s3, which basically meant "I'm never going to do this and I will never look at it again".

I learned this behavior because closing with wontfix would upset people who filed issues for things that I understand, but am not going to change. I'm done with it, but you're going to reopen it if I close it, so whatever, I'll leave it open and ignore it. Stalebot is terrible, but it will accept responsibility for closing these kinds of things.

AlBugdyyesterday at 9:36 PM

> Google had their own versions of things. IIRC bugs had both a priority and severity for some reason (they were the same 99% of the time) between 0 and 4.

At the company I worked with (not Google, but a major one) this was the same. We used Salesforce, the "Lightning Experience" or whatever it was called [0]. Our version was likely customized for our company, but I think the idea was the same - one, I think the "priority", was for our eyes only, one was for the customer (the "severity"). If the customer was insistent on raising the severity, we'd put it as sev1, but the priority was what we actually thought it was. I was actually surprised that for the ~4 years I was there no one made the mistake of telling the customer the priority as a mistake, especially when a lot of people were sloppily copy-pasting text from Slack or other internal tools that sometimes referred to a case as either the severity or the priority.

Those were heavy customers with SLAs, though, not supermarket apps or anything like that.

What was sad was that our internal tools, no matter how badly written, with 90's UI and awful security practices, our tools were 50 times as fast as whatever Salesforce garbage we had to deal with. Of course, there was a lot of unneeded redundancy between the tools so the complexity didn't stay in the Salesforce tool. But somehow the internal tools written by someone 10 years ago, barely maintained, who had to still deal with complex databases of who-what-when-how, felt like you had the DB locally on a supercomputer while SF felt like you were actually asking a very overworked person to manually give you your query right on each click. I'm exaggerating, but just by a bit.

[0] That name was funny because it was slow as shit. Each click took 5 to 20 seconds to update the view. I wonder what the non-Lightning version was.

vovaviliyesterday at 11:18 PM

What an interesting display of a principal-agent problem.