logoalt Hacker News

deathanatostoday at 3:36 PM2 repliesview on HN

> One dreaded and very common situation is when a failing CI run can be made to pass by simply re-running it. We call this flaky CI.

> Flaky CI is nasty because it means that a CI failure no longer reliably indicates that a mistake was caught. And it is doubly nasty because it is unfixable (in theory); sometimes machines just explode.

> Luckily flakiness can be detected: Whenever a CI run fails, we can re-run it. If it passes the second time, we are sure it was flaky.

One of the specialties that I have (unwillingly!) specialized in at my current company is CI flakes. Nearly all flakes, well over 90% of them, are not "unfixable", nor are they even really some boogy man unreliable thing that can't be understood.

The single biggest change I think we made that helped was having our CI system record the order¹ in which tests are run. Rerunning the tests, in the same order, makes most flakes instantly reproduce locally. Probably the next biggest reproducer is "what was the time the test ran?" and/or running it in UTC.

But once you get from "it's flakey" (and fails "seeming" "at" "random") to "it fails 100% of the time on my laptop when run this way" then it becomes easier to debug, b/c you can re-run it, attach a debugger, etc. Database sort issues (SQL is not deterministically ordered unless you ORDER BY), issues with database IDs (e.g., test expects row ID 3, usually gets row ID 3, but some other test has bumped us to row ID 4²), timezones — those are probably the biggest categories of "flakes".

While I know what people express with "flake", "flake" as a word is usually "failure mode I don't understand".

(Excluding truly transitory issues like a network failure interfering with a docker image pull, or something.)

(¹there are a lot of reasons people don't have deterministically ordered CI runs. Parallelism, for example. Our order is deterministic, b/c we made a value judgement that random orderings introduce too much chaos. But we still shard our tests across multiple VMs, and that sharding introduces its own changes to the order, as sometimes we rebalance one test to a different shard as devs add or remove tests.)

²this isn't usually because the ID is hardcoded, it is usually b/c, in the test, someone is doing `assert Foo.id == Bar.id`, unknowningly. (The code is usually not straight-forward about what the ID is an ID to.) I call this ID type confusion, and it's basically weakly-typed IDs in langs where all IDs are just some i32 type. FooId and BarId types would be better, and if I had a real type system in my work's lang of choice…


Replies

pm215today at 4:09 PM

A fairly large category of the flaky CI jobs I see is "dodgy infrastructure". For instance one recurring type for our project is one I just saw fail this afternoon, where a gitlab CI runner tries to clone the git repo from gitlab itself and gets an HTTP 502 error. We've also had issues with "the s390 VM that does CI job running is on an overloaded host, so mostly it's fine but occasionally the VM gets starved of CPU and some of the tests time out".

We do also have some genuinely flaky tests, but it's pretty tempting to hit the big "just retry" button when there's all this flakiness we can't control mixed in there.

win311fwgtoday at 3:51 PM

> those are probably the biggest categories of "flakes".

Interesting. In my experience, it is always either a concurrency issue in the program under test or PBTs finding some extreme edge case that was never visited before.