Hacker News: rodrigorcs

New comment by rodrigorcs in "Show HN: I built a fuse box for microservices"

rodrigorcs — Sat, 21 Feb 2026 13:04:25 +0000

This is incredibly generous context... thank you. A few of these hit close to problems I'm thinking about.

The Decider pattern you're describing (reading keys from memcache to decide behavior at runtime) is essentially what Openfuse is trying to productize. A centralized place that tells your fleet how to behave, without each process figuring it out independently. So it's validating to hear that's where Twitter landed organically.

On the PM2 point: you're right, holding a connection per process doesn't scale well at that huge scale. A local sidecar that receives state updates and exposes them via socket or shared memory to sibling processes is a much better model at that density. That's not how it works today, each process holds its own connection, but your framing is exactly how I'd want to evolve it. However, I can't say that is in the short-term goals for now, need to validate the product first and add some important features + publish the self hosted version.

On the dogpile: the half-open state is where this matters most. When a breaker opens and then transitions to half-open, you don't want 50 instances all sending probe requests simultaneously. The coalescing pattern you're describing from DataLoader is a neat way of solving it, I wonder if I can implement this somehow without adding a service/proxy closer to the clients just for that.

On failure modes: agreed, "service is down" is the simplest case. Catatonic connections, slow degradation, partial responses that look valid but aren't, those are harder to classify. Right now Openfuse trips on error rates, timeouts, and latency. However, the back-end is ready for custom metrics, I just didn't implement them yet. Having the breaker tripping based on OpenTelemetry metrics is also something I am looking forward to try, which opens a whole new world.

I'm not going to pretend this is built for Twitter-scale problems today. But hearing that the patterns you arrived at are directionally where this is headed is really encouraging.

New comment by rodrigorcs in "Show HN: I built a fuse box for microservices"

rodrigorcs — Thu, 19 Feb 2026 18:24:45 +0000

I know this sounds weird, but it is in fact self-hosting first :)

The reason why I only launched the cloud version of it is just so I could have a faster iteration pace in the back-end after having people actually using it reliably.

Now it is pretty solid and self hosting is the next thing to go out.

If you check the SDK code, it is ready for self hosting.

New comment by rodrigorcs in "Show HN: I built a fuse box for microservices"

rodrigorcs — Thu, 19 Feb 2026 17:20:15 +0000

> I don't really see what problem this solves. If you have proper timeouts and circuit breakers in your service this shouldn't really matter.

Each service discovering by their own is not really the main problem to be solved with my proposal, the thing is that by doing it locally, we lack observability and there is no way to act on them.

> what we done is to create flag where we put the % value we want to bring back

Oh I see, well that is indeed a good problem to solve. Openfuse does not do that gradual recovery but it would be possible to add.

Do you think that by having that feature and having the Openfuse solution self-hosted, it would be something you would give a try? Not trying to sell you anything, just gathering feedback so I can learn from the discussion.

By the way, if you don't mind, how often do you have to run that type of recovery?

New comment by rodrigorcs in "Show HN: I built a fuse box for microservices"

rodrigorcs — Thu, 19 Feb 2026 14:38:16 +0000

You're right, for intra-cluster calls where failures are scoped between the node itself and the infra around it, per-instance breakers are what you want. I wouldn't suggest centralizing those, and I might be wrong, but in most of these scenarios there is no fallback anyways (maybe except Redis?)

Openfuse is aimed at the other case: shared external dependencies where 15 services all call the same dependency and each one is independently discovering the same outage at different times. Different failure modes, different coordination needs, and you have no way to manually intervene or even just see what's open. Think of your house: every appliance has its own protection system, but that doesn't exempt you from having the distribution board.

You can also put it between your service/monolith and your own other services, e.g. if a recommendations engine, or a loyalty system in an E-Commerce or POS softwares go down, all hotpath flows from all other services will just bypass their calls to it. So with "external" I mean another service, whether it's yours or from a vendor.

On the feature flag point: that's interesting because you're essentially describing the pain of building circuit breaker behavior on top of feature flag infrastructure. The "switching back" problem you mention is exactly what half-open state solves: controlled probe requests that test recovery automatically and restore traffic gradually, without someone manually flipping a flag and hoping. That's the gap between "we can turn things off" and "the system recovers on its own." But yeah, we can all call Openfuse just feature flags for resilience, as I said: it's a fusebox for your microservices.

Curious how you handle the recovery side, is it a feature flag provider itself? or have you built something around it and store in your own database?

New comment by rodrigorcs in "Show HN: I built a fuse box for microservices"

rodrigorcs — Thu, 19 Feb 2026 11:26:57 +0000

Great question. Openfuse has a "systems" concept for exactly this. Each system is an isolated unit within an environment with its own breaker state. So you'd have us-east/stripe and eu-west/stripe as separate breakers. If Stripe is unreachable from us-east but healthy from eu-west, only the us-east breaker trips. The state is coordinated across all instances within a system, not globally across everything. You scope it to match your actual failure domains.

New comment by rodrigorcs in "Show HN: I built a fuse box for microservices"

rodrigorcs — Thu, 19 Feb 2026 08:11:57 +0000

Feel free to take a look at the SDK code if you want to, it's open :) https://github.com/openfuseio/openfuse-sdk-node

New comment by rodrigorcs in "Show HN: I built a fuse box for microservices"

rodrigorcs — Thu, 19 Feb 2026 08:10:16 +0000

It makes the awareness global so instances stop independently hammering a service that the rest of the fleet already knows is down. You can always override manually too and it will propagate to all servers in <15s

New comment by rodrigorcs in "Show HN: I built a fuse box for microservices"

rodrigorcs — Thu, 19 Feb 2026 08:07:30 +0000

Yup, that is true for both Cloud and Self-hosted, it never blocks any executions by any external factors other than the breaker is KNOWN as open. The state sync and the hot path are 2 completely separated flows.

New comment by rodrigorcs in "Show HN: I built a fuse box for microservices"

rodrigorcs — Thu, 19 Feb 2026 08:05:24 +0000

Good question, that's exactly why the trip decision isn't based on a single instance seeing a few errors. Openfuse aggregates failure metrics across the fleet before making a decision.

So instance 7 seeing a brief hiccup doesn't trip anything, the breaker only opens when the collective signal crosses your threshold (e.g., 40% failure rate across all instances in a 30s window). A momentary blip from one instance doesn't affect the others.

And when it does trip, the half-open state sends controlled probe requests to test recovery, so if Stripe bounces back quickly, the breaker closes again automatically.

New comment by rodrigorcs in "Show HN: I built a fuse box for microservices"

rodrigorcs — Thu, 19 Feb 2026 08:01:28 +0000

Totally possible, and some teams do. You need a state store, a evaluator job, a propagation layer to push state changes to every instance, a SDK, a dashboard, alerting, audit logging, RBAC, and a fallback strategy for when the coordination layer itself goes down.

It's not complex individually, but it takes time, and it's the ongoing maintenance that gets you. Openfuse is a bet that most teams would rather pay $99/mo than maintain that.

That said, a self-hosted option is on the near-term roadmap for teams that need it.

New comment by rodrigorcs in "Show HN: I built a fuse box for microservices"

rodrigorcs — Thu, 19 Feb 2026 07:58:59 +0000

I agree with more of this than you might expect.

On-prem: You're right, and it's on the roadmap. For teams at the scale you're describing, a hosted control plane doesn't make sense. The architecture is designed to be deployable as a self-hosted service, the SDK doesn't care where the control plane lives, just that it can reach it (you can swap the OpenfuseCloud class with just the Openfuse one, using your own URL).

Roundtrip time: The SDK never sits in the hot path of your actual request. It doesn't check our service before firing each call. It keeps a local cache of the current breaker state and evaluates locally, the decision to allow or block a request is pure local memory, not a network hop. The control plane pushes state updates asynchronously. So your request latency isn't affected. The propagation delay is how quickly a state change reaches all instances, not how long each request waits.

False positives / single system errors: This is exactly why aggregation matters. Openfuse doesn't trip because one instance saw one error. It aggregates failure metrics across the fleet, you set thresholds on the collective signal (e.g., 40% failure rate across all instances in a 30s window). A single server throwing an error doesn't move that needle. The thresholds and evaluation windows are configurable precisely for this reason.

Local cache location: It's in-process memory, not Redis or Memcache. Each SDK instance holds the last known breaker state in memory. The control plane pushes updates to connected SDKs. So the per-request check is: read a boolean from local memory. The network only comes into play when state changes propagate, not on every call. The cache size for 100 breakers is ~57KB, and for 1000, which is quite extreme, is ~393KB.

Backpressure: 100% agree, breakers alone don't solve cascading failures. They're one layer. Openfuse is specifically tackling the coordination and visibility gap in that layer, not claiming to replace load shedding, rate limiting, retry budgets, or backpressure strategies. Those are complementary. The question I'm trying to answer is narrower: when you do have breakers, why is every instance making that decision independently? why do you have no control over what's going on? why do you need to make a code change to temporarily disconnect your server from a dependency? And if you have 20 services, you configure it 20 times (1 for each repo)?

Would love to hear more about what you've seen work at scale for the backpressure side. That would be a next step :)

Show HN: I built a fuse box for microservices

rodrigorcs — Wed, 18 Feb 2026 14:04:04 +0000

Hey HN! I'm Rodrigo, I run distributed systems across a few countries. I built Openfuse because of something that kept bugging me about how we all do circuit breakers.

If you're running 20 instances of a service and Stripe starts returning 500s, each instance discovers that independently. Instance 1 trips its breaker after 5 failures. Instance 14 just got recycled and hasn't seen any yet. Instance 7 is in half-open, probing a service you already know is dead. For some window of time, part of your fleet is protecting itself and part of it is still hammering a dead dependency and timing out, and all you can do is watch.

Libraries can't fix this. Opossum, Resilience4j, Polly are great at the pattern, but they make per-instance decisions with per-instance state. Your circuit breakers don't talk to each other.

Openfuse is a centralized control plane. It aggregates failure metrics from every instance in your fleet and makes the trip decision based on the full picture. When the breaker opens, every instance knows at the same time.

It's a few lines of code:

  const result = await openfuse.breaker('stripe').protect(
    () => chargeCustomer(payload)
  );

The SDK is open source, anyone can see exactly what runs inside their services.

The other thing I couldn't let go of: when you get paged at 3am, you shouldn't have to find logs across 15 services to figure out what's broken. Openfuse gives you one dashboard showing every breaker state across your fleet: what's healthy, what's degraded, what tripped and when. And, you shouldn't need a deploy to act. You can open a breaker from the dashboard and every instance stops calling that dependency immediately. Planned maintenance window at 3am? Open beforehand. Fix confirmed? Close it instantly. Thresholds need adjusting? Change them in the dashboard, takes effect across your fleet in seconds. No PRs, no CI, no config files.

It has a decent free tier for trying it out, then $99/mo for most teams, $399/mo with higher throughput and some enterprise features. Solo founder, early stage, being upfront.

Would love to hear from people who've fought cascading failures in production. What am I missing?

Comments URL: https://news.ycombinator.com/item?id=47061013

Points: 28

# Comments: 23

New comment by rodrigorcs in "Ask HN: What are you working on? (February 2026)"

rodrigorcs — Tue, 10 Feb 2026 22:34:22 +0000

Really interesting idea! I've only seen stuff like that in ETL pipelines (which are a pain). This sits somehow between a python notebook and a ETL pipeline.

By the way, I just shared in my company's Slack and looks like there is no opengraph data for it. Not a complain, just pointing out in case you didn't notice/think of it :)

Best of luck!

New comment by rodrigorcs in "Ask HN: What are you working on? (February 2026)"

rodrigorcs — Tue, 10 Feb 2026 14:26:53 +0000

I've been building Openfuse (https://openfuse.io), a centralized circuit breaker platform.

Started building it about a year ago after dealing with the same problem across multiple companies: circuit breakers scattered across dozens of services, each configured slightly differently, no single place to see what's happening when things go sideways. The existing options are either libraries you embed in every service (Resilience4j, opossum, etc.) leaving every server stateful, or going full service mesh which is overkill for most teams.

Openfuse gives you a central control plane for circuit breaker policies across your stack. You define your reliability rules in one place, get visibility into breaker states, and can react without redeploying anything.

Been a great project and I'm genuinely happy with where it landed. If you're running microservices or an integration-heavy monolith and have ever cursed at a cascading failure, I'd love to hear how you're handling it today! :)