The lack of additional alerts in the Remediation section is a little bit concerning. Adding an alert for serving stale root zone data is great, but I think a few more would be very useful too:
- There's a clear uptick in SERVFAIL responses at 7:00 UTC but they don't start their response until an hour later after receiving external reports. This uptick should have automatically triggered an alert. It can't have been within the normal range because they got customer reports about it.
- The resolver failed to load the root zone data on startup and resorted a fallback path. Even if this isn't an error for the resolver it should still be an alert for the static_zone service, because its only client is failing to consume its data.
- The static_zone service should also alert when some percentage of instances fail to parse the root zone data, to get ahead of potential problems before the existing data becomes stale.
The alerts you suggested are sufficiently obvious that I'm sure the team has already implemented, or plans to implement, them. The public postmortem report is likely just a small snippet of the more interesting remediation actions.
SERVFAIL might not be a good enough signal for alerting. I definitely see third-party DNS providers returning SERVFAIL for their own reasons; if it's a popular host (looking at you, Route 53), then you'll proxy those through and end up alerting on AWS's issue instead of your own.
They might just want a prober; ask every server for cloudflare.com every minute. If that errors, there are big problems. (I remember Google adding itself as malware many many years ago. Nice to have a continuous check for that sort of thing, which I am sure they do now.)
> They might just want a prober; ask every server for cloudflare.com every minute. If that errors, there are big problems.
Yeah this was my first thought too. Why don't they have such a system in place for as many permutations of their public services as they can think of? We're a small company and we've had this for critical stuff for several years.
In my experience it's because the monitoring system is usually controlled by another team. So you they don't know what they should be testing, and the developers who know how to do it aren't easily able to set it up as part of deployment.
Add issues like network-visibility, and you wind up talking about a cross-team, cross-org effort to stick an HTTP poller and get the traffic back to some server - and so going into production without it winds up being the easier path (because it'll work fine - provided nothing goes wrong).
The real problem here is that things started failing on September 21, but no one noticed until October 4th. Why was there no logging when resolvers started failing to load the local root zone on September 21?
I noticed this issue quite quickly ("reported" at 7:54 UTC [1]), and I noticed I wasn't alone thanks to Twitter / X. I tried to get in touch with Cloudflare to report this issue - but I haven't found any meaningful contact other than Twitter.
For such an important service, I'm impressed there is no contact email / form where you can get in touch with the engineers responsible for keeping the service up and running.
Other than that, kudos for the well written blog post - as always!
Do you truly believe that the turnaround time for you contacting support, speaking to a first-tier support agent, having the request escalated to engineering, having a support engineer triaging it to the right team, and finally having the report being gauged so urgent that it pages the oncall for that team, is faster than automated infrastructure monitoring that the teams already have?
For something as core as DNS, I'm sure engineers were aware about the issue within minutes. There's a lot of politics and processes between that and public acknowledgment of an incident.
Even when I just had 2 paid domains with Cloudflare, there was real time chat support that seemed to always have an available engineer when something came up.
Maybe they don't provide support like that to free users.
Discord is the place to go. All CF Devs are there.
I got on particular fix one time in just a few minutes, and scalated pretty quickly. I got a recomendation fix on my side (a code change on my deployment that cover the problem) and a permanent fix on CF side 4h latter (time that took to changes propagate to all colos)
I don't think that statement makes sense here. In what way is the user of 1.1.1.1 the product? And for cloud services in general, they mostly tend to be subscription or usage based. In both cases, the user is clearly the customer.
I get the "you're the product, not the customer" for social networks or ad-based search engines, but certainly not for cloud products.
Might be a stretch for cloud services generally, but for public dns resolvers there's definitely an argument to be made: a) you pay nothing for the service, b) providing the service costs money, c) the provider benefits indirectly from the added insights and market power of user traffic
People who are actually correct on the basis of significant experience don't tend to feel it necessary to remind others at every possible occasion (unless not doing so would lead to significant harm to others).
The buried lede here seems to be that this is yet another serious outage (indirectly) caused by using DNSSEC, though I understand why they don’t emphasize this part, given their strong advocacy for DNSSEC adoption.
Having had to troubleshoot a third-party service not so dissimilar to 1.1.1.1 and prove to them that their infrastructure was misbehaving in a similar manner, I'll take the error thank you.
I don't know why Cloudflare, like Amazon, often get a free-pass on HN for their DNS implementation bugs. Regardless of DNSSEC's merits or otherwise, this bug isn't inherent to DNSSEC.
Strangely I noticed this because some parts of eBay stopped loading. I spent a while troubleshooting my privacy/adblock nonsense because surely CloudFlare couldn't be down but that's the only conclusion I could come to.
I thought it was a WSL (Linux on Windows) issue. I did some googling and saw some people discussing DNS issues on WSL, which were to be resolved by setting up a manual DNS config.
They used 8.8.8.8 in their example, which I of course changed to 1.1.1.1!
(Un)surprisingly the issue did not resolve after this, and I went on to do other things, hoping the gremlins would find something better to do in the meantime.
CF Workers (that runs WebAssembly) are all over place. They may not run the main logic (not the actual Ngix, or DNSEC code) but they are used for several maintaince tasks.
These guys always have the best write-ups. I aspire to write postmortems like they do. In fact, maybe my next context for a ChatGPT query will be “you are a member of of CloudFlare’s public relations team…”
All joking aside, it occurred to me that the vast majority of internet and even tech users know very little about DNS. For the longest time, I was in the same boat. After having been in a role where it was necessary to understand the record types and deploy DNS configs, I’m quite thankful I learned. Just remember…it’s always DNS.
I love the visualisations. It's really hard to make visualisations that are easy to read, look pretty and are still technically accurate. Cloudflare really excels at that.
It's certainly possible to do. Public recursive DNS isn't that big of a spec that you can't have two independent teams doing it, and mandate they do things differently when possible. Run the two /24s on different ASNs, too.
I've had sales teams pitch me on their authoritative DNS service running heterogenously, although I guessing both partitions gathered config from a single place, but we didn't get that far in the potential customer pipeline.
Of course, just because it's possible doesn't mean they're likely to do it.
I mean, if its truly the same then it sort of defeats the purpose.
That's my way of saying: it is seperate infrastructure
Remember that Cloudflare is probably the second largest DNS Resolver in the world. They aren't just going to tack two IP's onto the same system. The entire system would be completely independent for reasons exactly like what happened today.
Some people on my team experienced random breakage around 3rd party 2FA yesterday during this time period. I wouldn't be at all surprised if this was the cause, somehow. I've no idea how, though. It's so easy to become dependent on single points of failure without realising.
We noticed this through our own, homegrown scripts that check for this, having been screwed by an outage a few years ago. I'm happy they so quickly acknowledge and explain these issues. Good work!
I would rather they be open about their failures than deceptive about it. Of course simply not failing would be ideal, but we don't live in a perfect world. If a single, external point of failure causes your system to crumble, that's a design problem, not a dependency problem.
To your point, Cloudflare leadership are pretty active on HN. They generally do a pretty good job of providing detailed explanations to good-faith questions here and providing decent post-mortems of major incidents to the HN community.
They do take care to avoid engaging with people who are opposed to their dominance on ideological levels ("no one should be the gatekeeper for that much of the internet", etc) and there are a small handful of questions they seem to avoid (e.g. direct feature-to-feature comparisons between Warp and Mullvad)
They use transparency as a cover for rookie mistakes it's not the same as actual transparency. Especially as these are really bad examples of doing it wrong.
They're practicing "just culture" (as in justice), which rewards explaining and root causing your failures, and rejects the concept that "someone sucks" in favor of "systems can always be improved".
In general, I always seem to find comments along the lines of this are very easy to thoroughly disprove. There has been consistent criticism of Cloudflare for many years, ever since the majority of web traffic started going through their anti-DDOS and anti-bot gateways.
Here's a HN post with lots of very critical comments[0] from 7 years ago, including a fairly scathing one from 'tptacek. Even way back then, you'd get the same comments you hear today like:
> So rather than demand fixes for the fundamental issues that enable ddos attacks (preventing IP spoofing, allowing infected computers to remain connected, etc), we just continue down this path of massive centralization of services into a few big players that can afford the arms race against bonnets. Using services like Cloudflare as a 'fix' is wrecking the decentralized principles of the Internet. At that point we might as well just write all apps as Facebook widgets.
> My guess would be their weird ‘site protection’ stuff is burning too many people and negatively impacting their reputation.
What's always been interesting to me about this take is it's not as though Cloudflare is randomly inserting themselves in internet traffic.
Cloudflare customers have choice in the marketplace and they chose Cloudflare for whatever reasons. If end-users take issue with accessing the site of a Cloudflare customer they should take it up with the owners of the site that chose Cloudflare. Theoretically the Cloudflare customer would take it up with them if it becomes problematic. Cloudflare has no obligation to the site end-users other than meeting the needs of their customer who does have obligation to their end-users (theoretically).
Cloudflare is, ostensibly, providing a solution for their customers. How that impacts their customer's end-users is between Cloudflare and the customer.
I've never loved cloudflare - as someone doing this long before they existed I see through their wordy blog posts about rookie mistakes. It's embarrassing really.
300 pops around the world delivering 210 Tbps of capacity, mitigation of some of the largest DDoS attacks in history, 20% of internet traffic. Workers, Pages, R2, D1, Zero Trust, Stream, Images, Warp, 1.1.1.1, etc, etc, etc - all at incredible scale.
But yes, of course you have been doing the exact same thing since before Prince was born.
Look at a historical graph of internet users, bandwidth, etc. "Same scale long before" just isn't possible.
I'm not saying there isn't Cloudfront, Akamai, Fastly, Azure Edge/Verizon, etc. Hell UUNet, whatever you want. I'm saying the idea of someone providing hundreds of terabits of connectivity, connecting to over 12,000 networks, and supporting what is likely at least a billion users before Prince was born is completely absurd and impossible. There were only 118 million telephones worldwide in 1958 (the year Prince was born)[0].
I'm not referring to "DevOps crap". I'm referring to a wide product suite of functionality and geographic spread and scale that at best even 20 years ago would have taken an army of sysadmins and developers to build and maintain with a staggering fleet of Linux boxes running LAMP or whatever you prefer.
I was a very early Fastly customer (~2013 or so). I continue to look at them from time to time and what Oracle has done to them is atrocious (yet typical). They clearly have some usage and market share within that target market.
Speaking of market share, Cloudflare gets the most attention because they (by far) have the largest market share in terms of CDN/DDoS/etc and anything they do has the most significant impact on internet users at large. Depending on your source Cloudflare has roughly 50% of "CDN" marketshare, Fastly has something in the single digit percent range. Even Amazon CloudFront is around half that of Cloudflare.
Between Oracle and having less than 1/10th the market share that's why no one talks about Fastly. Compared to Cloudflare they're essentially irrelevant unless you're one of Oracle's enterprise customers that will deal with their sales people and tactics.
I was confused at first how Prince is relevant, but it seems GP is referring to the CEO of Cloudflare, Matthew Prince, who was born in 1974. (1958 appears to be the birth year of Prince the musician)
- There's a clear uptick in SERVFAIL responses at 7:00 UTC but they don't start their response until an hour later after receiving external reports. This uptick should have automatically triggered an alert. It can't have been within the normal range because they got customer reports about it.
- The resolver failed to load the root zone data on startup and resorted a fallback path. Even if this isn't an error for the resolver it should still be an alert for the static_zone service, because its only client is failing to consume its data.
- The static_zone service should also alert when some percentage of instances fail to parse the root zone data, to get ahead of potential problems before the existing data becomes stale.