Oversimplified click bait. The purpose never changed from catching bad bugs before it was sent to prod. The goal of CI is to prevent the resulting problems from doing damage and requiring emergency repairs.
This is of course true as a blanket "gotcha" headline- although I wouldn't call a failed test the CI itself failing. A real failure would be a false positive, a pass where there wasn't coverage, or a failure when there was no breaking change. Covering all of these edge cases can become as tiresome as maintaining the application in the first place (of course this is a generalization)
> Whenever a CI run fails, we can re-run it. If it passes the second time, we are sure it was flaky.
Or you have a concurrency issue in your production code?
The premise of the article has some weight, but the final conclusion with the suggestion to change the icons seems completely crazy.
Green meaning "to the best of our knowledge, everything is good with the software" is well understood.
Using green to mean "we know that this doesn't work at all" is incredibly poor UI (EDITED from "beyond idiotic" due to feedback, my bad).
And whilst flaky tests are the most problematic for a CI system, it's because they often work (and usually, from my experience most flaky tests are because they are modelling situations that don't usually happen in production) and so are often potentially viable builds for deployment with a caveat. If anything, they should be marked orange if they are tests that are known to be problematic.
I agree. The same can be said for testing too: their main purpose is to find mistakes (with secondary benefits of documenting, etc.). Whenever I see my tests fail, I'm happy that they caught a problem in my understanding (manifested either as a bug in my implementation, or a bug in my test statement).
> One dreaded and very common situation is when a failing CI run can be made to pass by simply re-running it. We call this flaky CI.
> Flaky CI is nasty because it means that a CI failure no longer reliably indicates that a mistake was caught. And it is doubly nasty because it is unfixable (in theory); sometimes machines just explode.
> Luckily flakiness can be detected: Whenever a CI run fails, we can re-run it. If it passes the second time, we are sure it was flaky.
One of the specialties that I have (unwillingly!) specialized in at my current company is CI flakes. Nearly all flakes, well over 90% of them, are not "unfixable", nor are they even really some boogy man unreliable thing that can't be understood.
The single biggest change I think we made that helped was having our CI system record the order¹ in which tests are run. Rerunning the tests, in the same order, makes most flakes instantly reproduce locally. Probably the next biggest reproducer is "what was the time the test ran?" and/or running it in UTC.
But once you get from "it's flakey" (and fails "seeming" "at" "random") to "it fails 100% of the time on my laptop when run this way" then it becomes easier to debug, b/c you can re-run it, attach a debugger, etc. Database sort issues (SQL is not deterministically ordered unless you ORDER BY), issues with database IDs (e.g., test expects row ID 3, usually gets row ID 3, but some other test has bumped us to row ID 4²), timezones — those are probably the biggest categories of "flakes".
While I know what people express with "flake", "flake" as a word is usually "failure mode I don't understand".
(Excluding truly transitory issues like a network failure interfering with a docker image pull, or something.)
(¹there are a lot of reasons people don't have deterministically ordered CI runs. Parallelism, for example. Our order is deterministic, b/c we made a value judgement that random orderings introduce too much chaos. But we still shard our tests across multiple VMs, and that sharding introduces its own changes to the order, as sometimes we rebalance one test to a different shard as devs add or remove tests.)
²this isn't usually because the ID is hardcoded, it is usually b/c, in the test, someone is doing `assert Foo.id == Bar.id`, unknowningly. (The code is usually not straight-forward about what the ID is an ID to.) I call this ID type confusion, and it's basically weakly-typed IDs in langs where all IDs are just some i32 type. FooId and BarId types would be better, and if I had a real type system in my work's lang of choice…
This is stupidly obvious but you'd be surprised how many people have the attitude that competent developers should have tested their code manually before making PRs so you shouldn't need CI.
I think this can be generalised into saying that the purpose of tests is to fail. I've seen far too many tests that are written to pass. You need to write tests to fail.
> When it passes, it's just overhead: the same outcome you'd get without CI.
The outcome still isn't the same. CI, even when everything passes, enables other developers to build on top of your partially-built work as it becomes available. This is the real purpose of CI. Test automation is necessary, but only to keep things sane amid you continually throwing in fractionally-complete work.
[dead]
The biggest problem I've seen with CI isn't the failing part, it's what teams do when it fails. The "just rerun it" culture kills the whole point.
We had a codebase where about 15% of CI runs were flaky. Instead of fixing the root causes (mostly race conditions in tests and one service that would intermittently timeout), the team just added auto-retry. Three attempts before it actually reported failure. So now a genuinely broken build takes 3x longer to tell you it's broken, and the flaky stuff just gets swept under the rug.
The article's right that failure is the point, but only if someone actually investigates the failure instead of clicking retry.