This is a recent video presentation by Jonah Edwards, who runs the Core Infrastructure Team at the Internet Archive. He explains the IA’s server, storage, and networking infrastructure, and then takes questions from other people at the Archive.
I found it all interesting. But the main takeaway for me is his response to Brewster Kahle’s question, beginning at 13:06, about why the IA does everything in-house rather than having its storage and processing hosted by, for example, AWS. His answer: lower cost, greater control, and greater confidence that their users are not being tracked.
I have ADD and typically eschew watching video if I can get the same, quality content faster in text.
I loved this and watched it to the end.
To any that feel the need to hide or disparage this because it seems to promote doing things on your own vs in the cloud, this isn’t some tech ops Total Money Makeover where you read a book and you’re suddenly in some sort of anti-credit cult. This is hard shit, and it’s the basics of the hard shit that I grew up with as being the only shit.
Yes, you can serve your own data. No one should fault you for doing that if you want. It takes the humble intelligence of the core team and everyone at IA to pull that off at this scale. If you don’t want to do the hard things, you could use the cloud. There are financial reasons also for one or the other, just as there are reasons people live with their family, rent, lease, and buy homes and office space- an imperfect analogy of course.
I hope that some of the others that could go on to work at the big guys or have been working there and want a challenge consider applying to IA when there’s an opening. They’ve done an incredible job, and I look forward to the cool things they accomplish in the future.
Thank you for this! It was definitely geared towards an internal audience but it makes me very happy to know that it was enjoyed and appreciated more broadly.
I am going to get a transcript done and up soon as well -- I just gave the talk on Friday so haven't had time to do so yet.
Speaking of Infrastructure, it is amazing that the initial set of the Apache BigData projects started at Internet Archive [0] whilst Alexa Internet, a startup Brewster Kahle sold to Amazon in 1999, formed the basis of Alexa Web Information Service, one of the first ''AWS'' products [1] which is still up? https://aws.amazon.com/awis/
On the scale of big hosting operations, 60Gbps outbound is not that much. If you're buying full tables IP transit from major carriers at IX points, I've seen 10GbE for $700-900/mo, and 100GbE circuits for under $7k/month. Of course you wouldn't want to have just one transit, but I'm fairly sure if somebody said to me 'here's $20,000 a month to buy transit', on the west coast, it's within the realm of possible.
Ideally of course they should be able to meet a fairly wide number of downstream eyeball ISPs at the major IX points in the bay area and offload a lot of traffic with settlement-free peering.
60Gbps outbound from AWS, Azure or GCP would be astronomically incredibly expensive.
Exactly this. There are some logistical complexities (e.g. some of our bandwidth is funded by the E-Rate Universal Service Program for libraries which runs on a July-June fiscal year and so rapid upgrades on that front aren't possible), but by and large egress bandwidth isn't our primary challenge. Intersite links as I noted in the video are the current big one, and that can and does involve occasional time-consuming construction -- but honestly over the past year, a combination of total blowout of my usual capacity planning (including equipment budgets) plus the logistical complexities of lockdown have resulted in slowness to upgrade as fast as we'd like to.
God bless HE Fremont. They are the unsung story of the Internet backbone. If one were to make a list of companies that at some point had a major fraction of their hosted physical infrastructure at HE I suspect it would make people's jaws drop.
It's been a huge blessing for a whole generation of startups to have a radically well-connected space that just about anyone can drop in equipment at multi-gig unmetered (albeit admittedly extremely constrained on power and cooling). It is honestly a part of what has made Silicon Valley great. Well, that and being able to cobble together a few replacement servers from Fry's components (RIP!) or schlep out to some ex-CTO's Sunnyvale garage in the middle of the night to offload some lightly used VA Linux 1U's...
Even today, you can get 15A and a 42U cabinet to call your own with unmetered gigabit for $400/mo - and probably less if you ask nicely.
IME with cloud in the small and in the large: network prices are artificially high on the cloud providers and are very easy to get discounted if you are a big spender.
It seems they are regularly maxing out their network infrastructure. If it's so cheap, how come they don't just buy more? Is it the cost of the actual hardware? (I know they recently upgraded)
They are maxing out the fiber links between their own datacenters, which is in the process of being addressed. If the bits can't get from the datacenter full of hard drives to the datacenter that connects to the internet, not much point in buying additional transit capacity.
I have no clue how people can afford what AWS charges for bandwidth. I did the math once for migrating a project to AWS and the bandwidth alone costed 10x my entire current infrastructure for that project, which is something I run for free.
Because people have nothing to compare their AWS cost to. They don't know how much it would cost them to host their service outside of AWS.
And it is not only a cost comparison. You need different kind of people to manage in-house vs cloud, not that you need less or more, just different skills
Not only that: It is really hard to predict AWS cost. So many variables go in. And starting with a small side project in AWS is easy, and then each additional step is a small step ...
For lots of orgs I’ve seen it creeps up slowly until you’re paying 10-50x of full transit without any peers but by that point you’re too locked in to do anything
Depends on whether bandwidth is important to what you're doing. In many applications it isn't so even at the inflated prices charged by AWS et al don't really matter in the context of other expenses.
To be fair here, when you're pouring that much money into AWS you probably have a better contract and can negotiate the price down quite a bit. Additionally, you could use CloudFront to further reduce your bandwidth costs.
That's not to say that it wouldn't be incredibly expensive, but probably far less than what you see on the pricing page.
A hybrid solution might be possible. Put your core infra on AWS, but get a very cheap CDN (or custom solution) in front of it to handle the 60Gbps so that only a small fraction will hit your AWS infra. Do the same for storage, e.g. build your own ceph cluster on bare-metal instead of Amazon S3.
Tracking should be STRICTLY illegal and ONLY acceptable with a verifiable OPT-IN and a transparency on to WHOM the data was sent/read-by/received. With STILL an option to selectively opt-out.
I agree with the sentiment, but there needs to be some degree of leeway.
For example are server logs considered tracking? It seems unreasonable to require that logs not be kept.
Edit: On further thought, I don't even know that tracking should be banned. Instead, I would argue that advertising in the way enabled by tracking should be banned. That way the incentive is removed with less bureaucracy.
I don't agree there. There is no reason that someone has building bureaucracy as they're goal. The goal is to get something out of it, which is accomplished via the bureaucracy.
> We need to kill bureaucratic interests in EVERY THING.
While I don't necessarily disagree, this is orthogonal to the original issue.
> Politicians should be conscripted servants with no method for empowering or financing themselves.
I don't think conscripted means what you think it means...
Also, that would be impossible to implement. Either you don't try to cut off every path, in which case you have the option of limiting bureaucracy, or you try to cut off every path, increasing bureaucracy.
> They should run on policy ALONE.
I agree. Also orthogonal to the topic at hand. Also, please let me know if you find a way of implementing this without requiring bureaucracy as a critical component.
And yet, the opposite is happening. Government agencies are happy to tap into user data. And sometimes it is mandatory to keep data depending on what you are doing and your country legislation.
AFAIK the US is among the countries where you have the least such requirements, but there are still some sectors where logging is mandatory, like financial services. Many other first world countries (never mind dictatorships) require ISPs to keep data for a year or more.
Curious about the cost. Does that already include manpower and various acquisition cost of constructing their internal network (hardware, fiber link between site)?
I guess the biggest downside is the speed of scaling that they can do. As it is limited by how fast they can purchase and install new storage device. But with the use case of Internet Archive, that shouldn't matter much.
I've seen this argument a lot but I'm not sure how well it holds. The price for performance ratio on cloud providers is so poor that you can overprovision in advance (to mitigate the extra delay involved in adding extra hardware) and still come out ahead.
Also, bare-metal doesn't necessarily mean owning the hardware. You can rent it too. There are providers that provide bare-metal in one-click and sometimes available within minutes.
> The price for performance ratio on cloud providers is so poor that you can overprovision in advance (to mitigate the extra delay involved in adding extra hardware) and still come out ahead.
It really depends on what scale you're talking about. When you're a startup and suddenly land on the front page of HN, you might need 100x or 1000x your current capacity - in which case AWS will be useful to no end.
If, on the other hand, you're an established name with quite a bit of traffic already and the maximum uptick you will reasonably experience is 2x-3x, the argument holds far less water.
He said his storage pricing is 2-5x cheaper than google archive line. That is $1.2/3x= $0.4/TB/month. Compare that against $20/month S3. He has 50x less cost. He can afford to overprovision.
I found it all interesting. But the main takeaway for me is his response to Brewster Kahle’s question, beginning at 13:06, about why the IA does everything in-house rather than having its storage and processing hosted by, for example, AWS. His answer: lower cost, greater control, and greater confidence that their users are not being tracked.