I'm a huge fan of property based testing, I've built some runners before, and I think it can be great for UI things too so very happy to see this coming around more.
Something I couldn't see was how those examples actually work, there are no actions specified. Do they watch a user, default to randomly hitting the keyboard, neither and you need to specify some actions to take?
What about rerunning things?
Is there shrinking?
edit - a suggestion for examples, have a basic UI hosted on a static page which is broken in a way the test can find. Like a thing with a button that triggers a notification and doesn't actually have a limit of 5 notifications.
How effective is property based testing in practice? I would assume it has no trouble uncovering things like missing null checks or an inverted condition because you can cover edge cases like null, -1, 0, 1, 2^n - 1 with relatively few test cases and exhaustively test booleans. But beyond that, if I have a handful of integers, dates, or strings, then the state space is just enormous and it seems all but impossible to me that blindly trying random inputs will ever find any interesting input. If I have a condition like (state == "disallowed") or (limit == 4096) when it should have been 4095, what are the odds that a random input will ever pass this condition and test the code behind it?
Microsoft had a remotely similar tool named Pex [1] but instead of randomly generating inputs, it instrumented the code to enable executing the code also symbolically and then used their Z3 theorem proofer to systematically find inputs to make all encountered conditions either true or false and with that incrementally explore all possible execution paths. If I remember correctly, it then generated a unit test for each discovered input with the corresponding output and you could then judge if the output is what you expected.
[1] https://www.microsoft.com/en-us/research/publication/pex-whi...
Hey, yeah the default specification includes a set of action generators that are picked from randomly. If you write a custom spec you can define your own action generators and their weights.
Rerunning things: nothing built for that yet, but I do have some design ideas. Repros are notoriously shaky in testing like this (unless run against a deterministic app, or inside Antithesis), but I think Bombadil should offer best-effort repros if it can at least detect and warn when things diverge.
Shrinking: also nothing there yet. I'm experimenting with a state machine inference model as an aid to shrinking. It connects to the prior point about shaky repros, but I'm cautiously optimistic. Because the speed of browser testing isn't great, shrinking is also hard to do within reasonable time bounds.
Thanks for the questions and feedback!