logoalt Hacker News

Cloudflare outage on December 5, 2025

186 pointsby meetpateltechtoday at 3:35 PM123 commentsview on HN

Comments

flaminHotSpeedotoday at 3:49 PM

What's the culture like at Cloudflare re: ops/deployment safety?

They saw errors related to a deployment, and because it was related to a security issue instead of rolling it back they decided to make another deployment with global blast radius instead?

Not only did they fail to apply the deployment safety 101 lesson of "when in doubt, roll back" but they also failed to assess the risk related to the same deployment system that caused their 11/18 outage.

Pure speculation, but to me that sounds like there's more to the story, this sounds like the sort of cowboy decision a team makes when they've either already broken all the rules or weren't following them in the first place

show 10 replies
paraditetoday at 3:56 PM

The deployment pattern from Cloudflare looks insane to me.

I've worked at one of the top fintech firms, whenever we do a config change or deployment, we are supposed to have rollback plan ready and monitor key dashboards for 15-30 minutes.

The dashboards need to be prepared beforehand on systems and key business metrics that would be affected by the deployment and reviewed by teammates.

I've never seen a downtime longer than 1 minute while I was there, because you get a spike on the dashboard immediately when something goes wrong.

For the entire system to be down for 10+ minutes due to a bad config change or deployment is just beyond me.

show 3 replies
Scaevolustoday at 3:45 PM

> Disabling this was done using our global configuration system. This system does not use gradual rollouts but rather propagates changes within seconds to the entire network and is under review following the outage we recently experienced on November 18.

> As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:

They really need to figure out a way to correlate global configuration changes to the errors they trigger as fast as possible.

> as part of this rollout, we identified an increase in errors in one of our internal tools which we use to test and improve new WAF rules

Warning signs like this are how you know that something might be wrong!

show 1 reply
Bendertoday at 5:05 PM

Suggestion for Cloudflare: Create an early adopter option for free accounts.

Benefit: Earliest uptake of new features and security patches.

Drawback: Higher risk of outages.

I think this should be possible since they already differentiate between free, pro and enterprise accounts. I do not know how the routing for that works but I bet they could do this. Think crowd-sourced beta testers. Also a perk for anything PCI audit or FEDRAMP naughty-word related.

lionkortoday at 4:45 PM

Cloudflare is now below 99.9% uptime, for anyone keeping track. I reckon my home PC is at least 99.9%.

uyzstvqstoday at 5:03 PM

What I'm missing here is a test environment. Gradual or not; why are they deploying straight to prod? At Cloudflare's scale, there should be a dedicated room in Cloudflare HQ with a full isolated model-scale deployment of their entire system. All changes should go there first, with tests run for every possible scenario.

Only after that do you use gradual deployment, with a big red oopsie button which immediately rolls the changes back. Languages with strong type systems won't save you, good procedure will.

jakub_gtoday at 5:00 PM

The interesting part:

After rolling out a bad ruleset update, they tried a killswitch (rolled out immediately to 100%) which was a code path never executed before:

> However, we have never before applied a killswitch to a rule with an action of “execute”. When the killswitch was applied, the code correctly skipped the evaluation of the execute action, and didn’t evaluate the sub-ruleset pointed to by it. However, an error was then encountered while processing the overall results of evaluating the ruleset

> a straightforward error in the code, which had existed undetected for many years

show 1 reply
miyurutoday at 3:50 PM

Whats going on with cloudflare's software team?

I have seen similar bugs in cloudflare API recently as well.

There is an endpoint for a feature that is available only to enterprise users, but the check for whether the user is on an enterprise plan is done at the last step.

rachrtoday at 4:24 PM

Time for Cloudflare to start using the BOFH excuse generator. https://bofh.d00t.org/

8cvor6j844qw_d6today at 4:47 PM

Is there some underlying factors that resulted in the recent outages (e.g., new processes, layoffs, etc.) or just a series of pure coincidences?

show 2 replies
xnorswaptoday at 3:47 PM

My understanding, paraphrased: "In order to gradually roll out one change, we had to globally push a different configuration change, which broke everything at once".

But a more important takeaway:

> This type of code error is prevented by languages with strong type systems

show 3 replies
egorfinetoday at 4:36 PM

> provides customers with protection against malicious payloads, allowing them to be detected and blocked. To do this, Cloudflare’s proxy buffers HTTP request body content in memory for analysis.

I have a mixed feeling about this.

On the other hand, I absolutely don't want a CDN to look inside my payloads and decide what's good for me or. Today it's protection, tomorrow it's censorship.

At the same time this is exactly what CloudFlare is good for - to protect sites from malicious requests.

show 1 reply
nish__today at 5:01 PM

Is it crazy to anyone else that they deploy every 5 minutes? And that it's not just config updates, but actual code changes with this "execute" action.

liampullestoday at 4:35 PM

The lesson presented by the last few big outages is that entropy is, in fact, inescapable. The comprehensibility of a system cannot keep up with its growing and aging complexity forever. The rate of unknown unknowns will increase.

The good news is that a more decentralized internet with human brain scoped components is better for innovation, progress, and freedom anyway.

show 1 reply
gkoztoday at 3:54 PM

I sometimes feel we'd be better off without all the paternalistic kitchensink features. The solid, properly engineered features used intentionally aren't causing these outages.

show 1 reply
dreamcompilertoday at 4:03 PM

"Honey we can't go on that vacation after all. In fact we can't ever take a vacation period."

"Why?"

"I've just been transferred to the Cloudflare outage explanation department."

denysvitalitoday at 3:53 PM

Ironically, this time around the issue was in the proxy they're going to phase out (and replace with the Rust one).

I truly believe they're really going to make resilience their #1 priority now, and acknowledging the release process errors that they didn't acknowledge for a while (according to other HN comments) is the first step towards this.

HugOps. Although bad for reputation, I think these incidents will help them shape (and prioritize!) resilience efforts more than ever.

At the same time, I can't think of a company more transparent than CloudFlare when it comes to these kind of things. I also understand the urgency behind this change: CloudFlare acted (too) fast to mitigate the React vulnerability and this is the result.

Say what you want, but I'd prefer to trust CloudFlare who admits and act upon their fuckups, rather than trying to cover them up or downplaying them like some other major cloud providers.

@eastdakota: ignore the negative comments here, transparency is a very good strategy and this article shows a good plan to avoid further problems

show 4 replies
rany_today at 4:07 PM

> As part of our ongoing work to protect customers using React against a critical vulnerability, CVE-2025-55182, we started rolling out an increase to our buffer size to 1MB, the default limit allowed by Next.js applications.

Why would increasing the buffer size help with that security vulnerability? Is it just a performance optimization?

show 2 replies
blibbletoday at 5:00 PM

amateur level stuff again

nish__today at 4:57 PM

No love lost, no love found.

hrimfaxitoday at 3:57 PM

Having their changes fully propagate within 1 minute is pretty fantastic.

show 2 replies
snafeautoday at 3:53 PM

A lot of these kind of bugs feel like they could be caught be a simple review bot like Greptile... I wonder if Cloudlare uses an equivalent tool internally?

show 2 replies
antilopertoday at 3:52 PM

Make faster websites:

> we started rolling out an increase to our buffer size to 1MB, the default limit allowed by Next.js applications.

Why is the Next.js limit 1 MB? It's not enough for uploading user generated content (photographs, scanned invoices), but a 1 MB request body for even multiple JSON API calls is ridiculous. There frameworks need to at least provide some pushback to unoptimized development, even if it's just a lower default request body limit. Otherwise all web applications will become as slow as the MS office suite or reddit.

show 2 replies
_pdp_today at 4:07 PM

So no static compiler checks and apparently no fuzzers used to ensure these rules work as intended?

show 1 reply
fidotrontoday at 3:50 PM

> This change was being rolled out using our gradual deployment system, and, as part of this rollout, we identified an increase in errors in one of our internal tools which we use to test and improve new WAF rules. As this was an internal tool, and the fix being rolled out was a security improvement, we decided to disable the tool for the time being as it was not required to serve or protect customer traffic.

Come on.

This PM raises more questions than it answers, such as why exactly China would have been immune.

show 1 reply
lapcattoday at 3:55 PM

> This is a straightforward error in the code, which had existed undetected for many years. This type of code error is prevented by languages with strong type systems. In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur.

Cloudflare deployed code that was literally never tested, not even once, neither manually nor by unit test, otherwise the straightforward error would have been detected immediately, and their implied solution seems to be not testing their code when written, or even adding 100% code coverage after the fact, but rather relying on a programming language to bail them out and cover up their failure to test.

show 1 reply
iLoveOncalltoday at 4:29 PM

The most surprising from this article is that CloudFlare handles only around 85M TPS.

show 1 reply
jgalt212today at 4:38 PM

I do kind of like who they are blaming React for this.

rvztoday at 4:04 PM

> Instead, it was triggered by changes being made to our body parsing logic while attempting to detect and mitigate an industry-wide vulnerability disclosed this week in React Server Components.

Doesn't Cloudflare rigorously test their changes before deployment to make sure that this does not happen again? This better not have been used to cover for the fact that they are using AI to fix issues like this one.

Better not be any presence of vibe coders or AI agents being used to be touching such critical pieces of infrastructure at all and I expected Cloudflare to learn from the previous outage very quickly.

But this is quite a pattern but might need to consider putting the unreliability next to GitHub (which goes down every week).

kachapopopowtoday at 3:45 PM

why does this seem oddly familiar (fail-closed logic)

da_grift_shifttoday at 3:54 PM

It's not an outage, it's an Availability Incident™.

https://blog.cloudflare.com/5-december-2025-outage/#what-abo...

show 1 reply
websiteapitoday at 3:53 PM

i wonder why they cannot partially rollout. like the other outage they have to do a global rollout.

show 3 replies
jpetertoday at 3:39 PM

Unwrap() strikes again

show 3 replies
barbazootoday at 3:41 PM

> Customers that did not have the configuration above applied were not impacted. Customer traffic served by our China network was also not impacted.

Interesting.

show 1 reply
alwaysroottoday at 4:02 PM

[flagged]