Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
1.1.1.1 lookup failures on October 4th, 2023 (cloudflare.com)
226 points by todsacerdoti on Oct 4, 2023 | hide | past | favorite | 80 comments


The lack of additional alerts in the Remediation section is a little bit concerning. Adding an alert for serving stale root zone data is great, but I think a few more would be very useful too:

- There's a clear uptick in SERVFAIL responses at 7:00 UTC but they don't start their response until an hour later after receiving external reports. This uptick should have automatically triggered an alert. It can't have been within the normal range because they got customer reports about it.

- The resolver failed to load the root zone data on startup and resorted a fallback path. Even if this isn't an error for the resolver it should still be an alert for the static_zone service, because its only client is failing to consume its data.

- The static_zone service should also alert when some percentage of instances fail to parse the root zone data, to get ahead of potential problems before the existing data becomes stale.


The alerts you suggested are sufficiently obvious that I'm sure the team has already implemented, or plans to implement, them. The public postmortem report is likely just a small snippet of the more interesting remediation actions.


SERVFAIL might not be a good enough signal for alerting. I definitely see third-party DNS providers returning SERVFAIL for their own reasons; if it's a popular host (looking at you, Route 53), then you'll proxy those through and end up alerting on AWS's issue instead of your own.

They might just want a prober; ask every server for cloudflare.com every minute. If that errors, there are big problems. (I remember Google adding itself as malware many many years ago. Nice to have a continuous check for that sort of thing, which I am sure they do now.)


> They might just want a prober; ask every server for cloudflare.com every minute. If that errors, there are big problems.

Yeah this was my first thought too. Why don't they have such a system in place for as many permutations of their public services as they can think of? We're a small company and we've had this for critical stuff for several years.


In my experience it's because the monitoring system is usually controlled by another team. So you they don't know what they should be testing, and the developers who know how to do it aren't easily able to set it up as part of deployment.

Add issues like network-visibility, and you wind up talking about a cross-team, cross-org effort to stick an HTTP poller and get the traffic back to some server - and so going into production without it winds up being the easier path (because it'll work fine - provided nothing goes wrong).


The real problem here is that things started failing on September 21, but no one noticed until October 4th. Why was there no logging when resolvers started failing to load the local root zone on September 21?


not to be glib but you should consider working for them


If you can see shit falling apart and you're not even inside the org, it's probably a tire fire, and working there will just be stressful


"shit falling apart" is a little dramatic. They're a big reputable org that always does writeups when they fuck up. tech people appreciate that.

I have my own unrelated issues with Cloudflare as a company


Sure. CloudFlare recruiters, DM me :-)


My only concern:

7:57 UTC: first reports coming in

I noticed this issue quite quickly ("reported" at 7:54 UTC [1]), and I noticed I wasn't alone thanks to Twitter / X. I tried to get in touch with Cloudflare to report this issue - but I haven't found any meaningful contact other than Twitter.

For such an important service, I'm impressed there is no contact email / form where you can get in touch with the engineers responsible for keeping the service up and running.

Other than that, kudos for the well written blog post - as always!

[1]: https://nitter.net/DenysVitali/status/1709476961523835246


Do you truly believe that the turnaround time for you contacting support, speaking to a first-tier support agent, having the request escalated to engineering, having a support engineer triaging it to the right team, and finally having the report being gauged so urgent that it pages the oncall for that team, is faster than automated infrastructure monitoring that the teams already have?

For something as core as DNS, I'm sure engineers were aware about the issue within minutes. There's a lot of politics and processes between that and public acknowledgment of an incident.


The timeline doesn't mention any internal monitor system but "external reports".

What I want as an engineer is an engineering contact (similar to a NOC) that "normal customers" aren't aware of.

Something as critical as DNS should have that - but maybe I expect too much from a free service.


But you are sure of this, even though there were no indications in that direction, and several in the opposite.


This sounds so foreign to my experience.

Even when I just had 2 paid domains with Cloudflare, there was real time chat support that seemed to always have an available engineer when something came up.

Maybe they don't provide support like that to free users.


You have to pay $200/month for access to live chat support.

https://www.cloudflare.com/plans/


I don't even have a Cloudflare account. You don't need a Cloudflare account to use their public DNS servers


Get what you pay for I guess.


Discord is the place to go. All CF Devs are there.

I got on particular fix one time in just a few minutes, and scalated pretty quickly. I got a recomendation fix on my side (a code change on my deployment that cover the problem) and a permanent fix on CF side 4h latter (time that took to changes propagate to all colos)


> For such an important service, I'm impressed there is no contact email

cloud: you are the product, not the customer


I don't think that statement makes sense here. In what way is the user of 1.1.1.1 the product? And for cloud services in general, they mostly tend to be subscription or usage based. In both cases, the user is clearly the customer.

I get the "you're the product, not the customer" for social networks or ad-based search engines, but certainly not for cloud products.


Might be a stretch for cloud services generally, but for public dns resolvers there's definitely an argument to be made: a) you pay nothing for the service, b) providing the service costs money, c) the provider benefits indirectly from the added insights and market power of user traffic


And I wouldn't consider 1.1.1.1 to be a "cloud product". You don't need a Cloudflare subscription to use it


dude, there's even cloud in the name of the vendor


I'm expecting @tptacek to come in anytime reminding us of his thoughts on DNSSEC

https://sockpuppet.org/blog/2015/01/15/against-dnssec/


People who are actually correct on the basis of significant experience don't tend to feel it necessary to remind others at every possible occasion (unless not doing so would lead to significant harm to others).

They're usually busy doing other things.


unless not doing so would lead to significant harm to others

Ding ding ding.

At any rate: it's very funny that DNSSEC took 1.1.1.1 down, but this bug can't honestly be pinned on DNSSEC itself.


In a way they're lucky DNSSEC took it down, otherwise they may have not noticed the issue of using stale data for much longer.


They have expired the data correctly, which uncovered a bug in fetching of a new DNSSEC record.

If the DNSSEC didn't add new unnecessary complexity to an otherwise working system, there would be no bug, and no stale data.


"Grant me the serenity to accept the things I cannot change, the courage to change the things I can and the wisdom to know the difference."

but yea, repeatedly running into a wall of ignorance encourages shifting to "busy doing other things"


The buried lede here seems to be that this is yet another serious outage (indirectly) caused by using DNSSEC, though I understand why they don’t emphasize this part, given their strong advocacy for DNSSEC adoption.


As I read it, DNSSEC signature expiry was what triggered people noticing the root data was stale. That would seem to be a somewhat positive outcome.


> DNSSEC signature expiry was what triggered people noticing the root data was stale

If you noticed your brakes had failed because you ended up in a ditch I wouldn't really say that's a positive outcome.

Frankly I can't believe they don't have better monitoring for a system as critical as that.


Would you rather an error or stale dns?


Having had to troubleshoot a third-party service not so dissimilar to 1.1.1.1 and prove to them that their infrastructure was misbehaving in a similar manner, I'll take the error thank you.


Well it's a free public resolver service. There's plenty of other options and you can even run your own very easily.


I don't know why Cloudflare, like Amazon, often get a free-pass on HN for their DNS implementation bugs. Regardless of DNSSEC's merits or otherwise, this bug isn't inherent to DNSSEC.


Strangely I noticed this because some parts of eBay stopped loading. I spent a while troubleshooting my privacy/adblock nonsense because surely CloudFlare couldn't be down but that's the only conclusion I could come to.


I thought it was a WSL (Linux on Windows) issue. I did some googling and saw some people discussing DNS issues on WSL, which were to be resolved by setting up a manual DNS config.

They used 8.8.8.8 in their example, which I of course changed to 1.1.1.1!

(Un)surprisingly the issue did not resolve after this, and I went on to do other things, hoping the gremlins would find something better to do in the meantime.


> 1.1.1.1 has a WebAssembly app called static_zone running on top of the main DNS logic that serves those new versions when they are available.

webassembly? what is that word even doing in a post mortem about DNSSEC failures?


CF Workers (that runs WebAssembly) are all over place. They may not run the main logic (not the actual Ngix, or DNSEC code) but they are used for several maintaince tasks.


Wasm runtimes are great for stuff like plugins! Seems like this static_zone thing was something like that (but they call them apps)



Earlier discussion while outage was active: https://news.ycombinator.com/item?id=37763143


These guys always have the best write-ups. I aspire to write postmortems like they do. In fact, maybe my next context for a ChatGPT query will be “you are a member of of CloudFlare’s public relations team…”

All joking aside, it occurred to me that the vast majority of internet and even tech users know very little about DNS. For the longest time, I was in the same boat. After having been in a role where it was necessary to understand the record types and deploy DNS configs, I’m quite thankful I learned. Just remember…it’s always DNS.

edit: typo


I aspire to never write a postmortem again.


I love the visualisations. It's really hard to make visualisations that are easy to read, look pretty and are still technically accurate. Cloudflare really excels at that.


Did 1.0.0.1 also go down? The article doesn't say.


Of course it did it's the same service


A highly reliable service might run one partition on a completely separate serving stack. It's worth asking.


With a completely separate architecture that also loads the root zones in a completely different way?


It's certainly possible to do. Public recursive DNS isn't that big of a spec that you can't have two independent teams doing it, and mandate they do things differently when possible. Run the two /24s on different ASNs, too.

I've had sales teams pitch me on their authoritative DNS service running heterogenously, although I guessing both partitions gathered config from a single place, but we didn't get that far in the potential customer pipeline.

Of course, just because it's possible doesn't mean they're likely to do it.


Maybe also run by a separate company with a different CEO for additional redundancy. :) so now we've got 1. and 8.


Best to service from a different planet, in case something happens to this one.


I hear Musk has been working on that. (j/k, 40 min ping time...)


I mean, if its truly the same then it sort of defeats the purpose.

That's my way of saying: it is seperate infrastructure

Remember that Cloudflare is probably the second largest DNS Resolver in the world. They aren't just going to tack two IP's onto the same system. The entire system would be completely independent for reasons exactly like what happened today.


It is the same system. It's multicast


*anycast


So, another monitoring issue here.

It's incredible how far the SW industry gone in the last decades, but the way monitoring is done is the same as in the 90's.


Some people on my team experienced random breakage around 3rd party 2FA yesterday during this time period. I wouldn't be at all surprised if this was the cause, somehow. I've no idea how, though. It's so easy to become dependent on single points of failure without realising.


This got me. I spent an hour trying to figure out why my Internet seemingly went down but not fully


Would this have affected any authoritative DNS served by them (eg and also taking 1.1.1.1 out of the equation..)

Thinking not correct?


We noticed this through our own, homegrown scripts that check for this, having been screwed by an outage a few years ago. I'm happy they so quickly acknowledge and explain these issues. Good work!


[flagged]


I would rather they be open about their failures than deceptive about it. Of course simply not failing would be ideal, but we don't live in a perfect world. If a single, external point of failure causes your system to crumble, that's a design problem, not a dependency problem.


To your point, Cloudflare leadership are pretty active on HN. They generally do a pretty good job of providing detailed explanations to good-faith questions here and providing decent post-mortems of major incidents to the HN community.

They do take care to avoid engaging with people who are opposed to their dominance on ideological levels ("no one should be the gatekeeper for that much of the internet", etc) and there are a small handful of questions they seem to avoid (e.g. direct feature-to-feature comparisons between Warp and Mullvad)


They use transparency as a cover for rookie mistakes it's not the same as actual transparency. Especially as these are really bad examples of doing it wrong.


They're practicing "just culture" (as in justice), which rewards explaining and root causing your failures, and rejects the concept that "someone sucks" in favor of "systems can always be improved".


Up until a few months ago the HN crowd loved Cloudflare. How sentiment has changed in such a short period.

My guess would be their weird ‘site protection’ stuff is burning too many people and negatively impacting their reputation.


In general, I always seem to find comments along the lines of this are very easy to thoroughly disprove. There has been consistent criticism of Cloudflare for many years, ever since the majority of web traffic started going through their anti-DDOS and anti-bot gateways.

Here's a HN post with lots of very critical comments[0] from 7 years ago, including a fairly scathing one from 'tptacek. Even way back then, you'd get the same comments you hear today like:

> So rather than demand fixes for the fundamental issues that enable ddos attacks (preventing IP spoofing, allowing infected computers to remain connected, etc), we just continue down this path of massive centralization of services into a few big players that can afford the arms race against bonnets. Using services like Cloudflare as a 'fix' is wrecking the decentralized principles of the Internet. At that point we might as well just write all apps as Facebook widgets.

0: https://news.ycombinator.com/item?id=13718947


> My guess would be their weird ‘site protection’ stuff is burning too many people and negatively impacting their reputation.

What's always been interesting to me about this take is it's not as though Cloudflare is randomly inserting themselves in internet traffic.

Cloudflare customers have choice in the marketplace and they chose Cloudflare for whatever reasons. If end-users take issue with accessing the site of a Cloudflare customer they should take it up with the owners of the site that chose Cloudflare. Theoretically the Cloudflare customer would take it up with them if it becomes problematic. Cloudflare has no obligation to the site end-users other than meeting the needs of their customer who does have obligation to their end-users (theoretically).

Cloudflare is, ostensibly, providing a solution for their customers. How that impacts their customer's end-users is between Cloudflare and the customer.


I've never loved cloudflare - as someone doing this long before they existed I see through their wordy blog posts about rookie mistakes. It's embarrassing really.


maybe to compensate Cloudflare's success blog posts where they usually represent themselves as the saviors of the world.


Quite. Nobody else can do what they do! (Brb doing the same thing before Prince was even born)


This is peak HN comment.

300 pops around the world delivering 210 Tbps of capacity, mitigation of some of the largest DDoS attacks in history, 20% of internet traffic. Workers, Pages, R2, D1, Zero Trust, Stream, Images, Warp, 1.1.1.1, etc, etc, etc - all at incredible scale.

But yes, of course you have been doing the exact same thing since before Prince was born.


People had global networks of the same scale long before, they just didn't offer the same features because they had different products.

Also DevOps crap is not a selling point, as much as hn wishes it to be.

Also fastly et Al have the same or better ability but nobody talks about them?


Look at a historical graph of internet users, bandwidth, etc. "Same scale long before" just isn't possible.

I'm not saying there isn't Cloudfront, Akamai, Fastly, Azure Edge/Verizon, etc. Hell UUNet, whatever you want. I'm saying the idea of someone providing hundreds of terabits of connectivity, connecting to over 12,000 networks, and supporting what is likely at least a billion users before Prince was born is completely absurd and impossible. There were only 118 million telephones worldwide in 1958 (the year Prince was born)[0].

I'm not referring to "DevOps crap". I'm referring to a wide product suite of functionality and geographic spread and scale that at best even 20 years ago would have taken an army of sysadmins and developers to build and maintain with a staggering fleet of Linux boxes running LAMP or whatever you prefer.

I was a very early Fastly customer (~2013 or so). I continue to look at them from time to time and what Oracle has done to them is atrocious (yet typical). They clearly have some usage and market share within that target market.

Speaking of market share, Cloudflare gets the most attention because they (by far) have the largest market share in terms of CDN/DDoS/etc and anything they do has the most significant impact on internet users at large. Depending on your source Cloudflare has roughly 50% of "CDN" marketshare, Fastly has something in the single digit percent range. Even Amazon CloudFront is around half that of Cloudflare.

Between Oracle and having less than 1/10th the market share that's why no one talks about Fastly. Compared to Cloudflare they're essentially irrelevant unless you're one of Oracle's enterprise customers that will deal with their sales people and tactics.

[0] - https://www.encyclopedia.com/science-and-technology/computer....


> in 1958 (the year Prince was born)

I was confused at first how Prince is relevant, but it seems GP is referring to the CEO of Cloudflare, Matthew Prince, who was born in 1974. (1958 appears to be the birth year of Prince the musician)

Not that it affects your point in any way.


Hah, I don't know enough about Cloudflare to know who Matthew Prince is.

Yet he was born in 1974 so I still maintain this is a ridiculous viewpoint.


> Visit 1.1.1.1 from any device to get started with our free app that makes your Internet faster and safer.

Ironic.


I like how its a 42 joke.

4(0b10) 7:00 ends at 11:02 (4 hr 2 min) on a 4 sum 2x2. And refs to 1.1.1.1 vs 1.0.0.1




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: