logoalt Hacker News

Show HN: I built a fuse box for microservices

18 pointsby rodrigorcsyesterday at 2:04 PM14 commentsview on HN

Hey HN! I'm Rodrigo, I run distributed systems across a few countries. I built Openfuse because of something that kept bugging me about how we all do circuit breakers.

If you're running 20 instances of a service and Stripe starts returning 500s, each instance discovers that independently. Instance 1 trips its breaker after 5 failures. Instance 14 just got recycled and hasn't seen any yet. Instance 7 is in half-open, probing a service you already know is dead. For some window of time, part of your fleet is protecting itself and part of it is still hammering a dead dependency and timing out, and all you can do is watch.

Libraries can't fix this. Opossum, Resilience4j, Polly are great at the pattern, but they make per-instance decisions with per-instance state. Your circuit breakers don't talk to each other.

Openfuse is a centralized control plane. It aggregates failure metrics from every instance in your fleet and makes the trip decision based on the full picture. When the breaker opens, every instance knows at the same time.

It's a few lines of code:

  const result = await openfuse.breaker('stripe').protect(
    () => chargeCustomer(payload)
  );
The SDK is open source, anyone can see exactly what runs inside their services.

The other thing I couldn't let go of: when you get paged at 3am, you shouldn't have to find logs across 15 services to figure out what's broken. Openfuse gives you one dashboard showing every breaker state across your fleet: what's healthy, what's degraded, what tripped and when. And, you shouldn't need a deploy to act. You can open a breaker from the dashboard and every instance stops calling that dependency immediately. Planned maintenance window at 3am? Open beforehand. Fix confirmed? Close it instantly. Thresholds need adjusting? Change them in the dashboard, takes effect across your fleet in seconds. No PRs, no CI, no config files.

It has a decent free tier for trying it out, then $99/mo for most teams, $399/mo with higher throughput and some enterprise features. Solo founder, early stage, being upfront.

Would love to hear from people who've fought cascading failures in production. What am I missing?


Comments

nevontoday at 9:52 AM

How does it deal with partial failures like the upstream being unreachable from one datacenter but not the other, or from one region but not another? Or when the upstream uses anycast or some other way to route to different origins depending on where the caller is?

Making your circuit breaker state global seems like it would just exacerbate the problem. Failures are often partial in the real world.

show 1 reply
netiktoday at 7:02 AM

This a great idea, but it's a great idea when on-prem.

During some thread, some where, there's going to be a roundtrip time between my servers and yours, and once I am at a scale where this sort of thing matters, I'm going to want this on-prem.

What's the difference between this and checking against a local cache before firing the request and marking the service down in said local cache so my other systems can see it?

I'm also concerned about a false positive or a single system throwing an error. If it's a false positive, then the protected asset fails on all of my systems, which doesn't seem great. I'll take some requests working vs none when money is in play.

You also state that "The SDK keeps a local cache of breaker state" -- If I've got 50 servers, where is that local cache living? If it's per process, that's not great, and if it's in a local cache like redis or memcache, I'm better off using my own network for "sub microsecond response" vs the time to go over the wire to talk to your service.

I've fought huge cascading issues in production at very large social media companies. It takes a bit more than breakers to solve these problems. Backpressure is a critical component of this, and often turning things off completey isn't the best approach.

show 1 reply
kkapelontoday at 7:33 AM

> Your circuit breakers don't talk to each other.

Doesn't this suffer from the opposite problem though? There is a very brief hiccup for Stripe and instance 7 triggers the circuitbreaker. Then all other services stop trying to contact Stripe even though Stripe has recovered in the mean time. Or am I missing something about how your platform works?

show 1 reply
cluckindantoday at 7:52 AM

Instead of paying for a SaaS, a team can autoprogram an on-prem clone for less.

show 1 reply
whalesaladtoday at 4:32 AM

const openfuse = new OpenfuseCloud(...);

what happens when your service goes down

show 1 reply
dsltoday at 5:36 AM

Now I have seen it all... a SaaS solution for making local outages global.

show 1 reply