Interesting writeup. I wonder somewhat what this looks like from the customer side; one downside I've observed with some serverless in the past is that it can introduce up-front latency delays as the system spins up support to handle your spike. I know the CI consensus seems to be that latency matters little in a process that's going to take a long time to run to completion anyway... But I'm also a developer of CI, and that latency is painful during a tight-loop development cycle.
(The good news is that if the spikes are regular, a sufficiently-advanced serverless can "prime the pump" and prep-and-launch instances into surplus compute before the spike since historical data suggests the spike is coming).
> one downside I've observed with some serverless in the past is that it can introduce up-front latency delays as the system spins up support to handle your spike
[cofounder of blacksmith here]
This is exactly one of the symptoms of running CI on traditional hyperscalers we're setting out to solve. The fundamental requirement for CI is that each job requires its own fresh VM (which is unlike traditional serverless workloads like lambdas). To provision an EC2 instance for a CI job:
- you're contending against general on-demand production workloads (which have a particular demand curve based on, say, the time of day). This can typically imply high variance in instance provisioning times.
- since AWS/GCP/Azure deploy capacity out as spot instances with a guaranteed pre-emption warning, you're also waiting for the pre-emption windows to expire before a VM can be handed to you!