I think what you're seeing in this post has product implications for Google Clou...

boulos · on Nov 12, 2016

See my note about SLAs and ToR failures. We probably could promise something for our Local SSD offering (tail latency < 1ms!), but high-performance, guaranteed networked storage is just tricky.

As I said, rolling your own will not give you a guarantee, it will just give you the responsibility for failure. We don't offer the guarantee, because we don't want you to believe it can't fail.

ocdtrekkie · on Nov 12, 2016

At a certain scale, it's better to just take that responsibility rather than rely on someone to be able to handle that burden. When something fails, the worst thing is being out of control of it and unable to do anything about it.

Last time I had a failed storage controller, HP delivered a replacement in four hours. Last time a service provider went down, I had no visibility into the repair process, and had to explain there was nothing I could do... For five days.

It seems like you've outright admitted you can't guarantee what they need, yet you still urge them not to leave your business model. Maybe come back when your business model can meet their needs.

ckozlowski · on Nov 13, 2016

One of the things I stress when talking to customers is that when one is moving to cloud, it's not just the business model that changes, but the architecture model.

That may sound like cold comfort on the face of it, but it's key to getting the most out of cloud and exceed the possibilities of an on-prem architecture. Rule #1 is, everything fails. The key advantage to a good cloud provider (and there are many) is not that they can deliver a guarantee against failure (as boulos stated correctly) but that they'll allow you to design for failure. The issue becomes when the architecture in the cloud resembles that which was on-premise. While there's still some advantages, they're markedly fewer, and as you said, there's nothing you can do to prioritize your fix.

They key to having a good cloud deployment is effectively utilizing the features that eliminate single points of failure so that the same storage controller failure that might knock you out in your on-prem can't knock you out in the cloud, even though the repair time for the latter might be longer. That brings its own challenges, but brings huge advantages when it comes together.

Disclosure: I work for AWS.

z3t4 · on Nov 13, 2016

Most ppl just want to deal with one provider though. Meaning there has to be a middle man btw you and the customer.

boulos · on Nov 13, 2016

That is quite honestly amazing turn around. Most people aren't in a position to get a replacement delivered from a vendor that quickly, but if you are, awesome. Again though, there's a big gap between simple, local storage (we could probably provide a tighter SLO/SLA on Local SSD for example) and networked storage as a service.

As someone else alluded to downthread: anyone claiming they can provide guaranteed throughput and latency on networked block devices at arbitrary percentiles in the face of hardware failure is misleading you. I don't disagree that you might feel better and have more visibility into what's going on when it's your own hardware, but it's an apples and oranges comparison.

wtbob · on Nov 13, 2016

> Most people aren't in a position to get a replacement delivered from a vendor that quickly

Back when I worked in enterprise services, it was a requirement that we had good support contracts — 1-, 4- or 8-hour turnarounds were standard.

If you're running a serious business, support contracts are a must.

boulos · on Nov 13, 2016

Physically delivered, no matter customer location? Sign me up ;).

nixgeek · on Nov 13, 2016

I'm not sure why you're so surprised, this is standard logistics management for warranty replacement services.

Manufacturer specifically asks customer to keep it informed where the equipment is physically located, and then prepositions spares to appropriate depots in order to meet the contractual requirement established when the customer paid them often a large amount of money for 4-hour Same Day (24x7x365) coverage for that device.

This isn't how hyperscale folks operate for the same reason Fortune 100's rarely take out anything more than third-party coverage when their employees rent vehicles, it becomes an actuarial decision, the # of 'Advance Replacement' warranty contracts and the $ involved, vs. buying a % of spares and keeping those in the datacenter then RMA'ing the defective component for refund/replace (on a 3-5 week turnaround time).

tl;dr - Operating 100-500 servers you should likely pay Dell and they'll ship you the spare and a technician to install it, operating >500 servers and you should do the sums and make the best decision for your business, operating >5000 servers and you probably want to just 'cold spare' and replace yourself.

phamilton · on Nov 13, 2016

We had such a contract with Dell when I worked in HPC (at a large university). Since our churn was so high (top 500 cluster) we had common spare parts on site (RAM, drives) but when we needed a new motherboard it was there within 4 hours.

Now, these contracts aren't like your average SaaS offering. We had a sales rep and the terms of the support contract were personalized to our needs. I imagine some locations were offered better terms for 4 hour service than others.

As I learned long ago, never say no to a customer request. Just provide a quote high enough to make it worthwhile.

emmelaich · on Nov 13, 2016

As nixgeek says, this is standard -- for a price.

That said I suspect most enterprises pay too much; as in their money might be better spent buying triple mirroring on a JBOD rather than a platinum service contract with a fancy all in one high end RAID-y machine.

matwood · on Nov 13, 2016

Back when I worked in a big corp with an actual DC onsite we had similar contracts. I assume they worked anywhere in the US as we had multiple locations. IIRC they were with HP.

kijiki · on Nov 13, 2016

At the scale gitlab is talking about, it is usually better to source from a vendor that doesn't provide ultra-fast turnaround, and just keep a few spare servers and switches, so you can have 0-hour turnaround for hardware failure.

Get servers from Quanta, Wistron or Supermicro, and switches from Quanta, Edge-core or Supermicro. The $$$ you save vs a name-brand OEM more than pays for the spares. Use PXE/ONIE and an automation tool like Ansible or Puppet to manage the servers and switches, and you can get a replacement server or switch up and in service in minutes, not hours.

If you're moving out of the cloud to your own infrastructure, it makes sense to build and run similarly to how those cloud providers do.

nixgeek · on Nov 14, 2016

I kinda disagree, see my other comment on logistical models, this isn't 5000+ system scale or even 500+ system scale.

There's non-trivial cost involved in simply being staffed to accommodate the model you propose, all of the ODM's have some "sharp edges" and you need to be prepared to invest engineering effort into RFP/RFQ processes, tooling and dealing with firmware glitches, etc.

Remember that 500-rack (per buy) scale is table stakes for "those cloud providers", it is their business, whereas GitLab is a software company. Play to your strengths.

kijiki · on Nov 15, 2016

In my experience, you'll be dealing with firmware glitches even with mainstream OEMs. You can avoid a lot of the RFP/RFQ and scale issues by going to a ODM/OEM hybrid like Edge-core, or Quanta. Or Penguin Computing or Supermicro. If you already have a relationship with Dell or HP, you probably won't get quite as good pricing, but they're still options.

I am shockingly biased (I co-founded Cumulus Networks) but working with a software vendor who can help you with the entire solution is very helpful.

The scale gitlab has talked about in this thread is firmly in the range where self-sparing/ODM/disaggregation make sense. I think 500 racks is a huge overestimate, I think the cross-over point is closer to 5 racks.

smallnamespace · on Nov 12, 2016

I think you're missing the point a bit -- the claim is not that the business model prevents these guarantees, but that it's inherently difficult for any party to do.

> had to explain there was nothing I could do... For five days

The good alternative would be that you controlled the infrastructure and it didn't break. The bad alternative is you directly control infrastructure, it breaks, and then you get fired.

toomuchtodo · on Nov 12, 2016

Wikipedia, Stack Overflow, and Github all do just fine hosting their own physical infrastructure.

It's untrue that "the cloud" is always the right way. When you're on a multi-tenant system, you will never be the top priority. When you build your own, you are.

Google and AWS have vested interests (~30% margins) in getting you into the cloud. Always do the math to see if it's cost effective comparatively speaking.

nixgeek · on Nov 13, 2016

Sure and GitHub also has literally an order of magnitude more infrastructure than is being discussed in these proposals and retains Amazon Web Services and Direct Connect for a bunch of really good reasons.

Transparency from GitLab is excellent but you shouldn't really generalise statements about cloud suitability without the full picture or "Walking a mile in their shoes".

sytse · on Nov 13, 2016

Yep, and we at GitLab have Direct Connect to AWS as a requirement for picking out new colo.

Pyxl101 · on Nov 13, 2016

If you guys use AWS, have you taken a look at EBS provisioned IOPS? EBS provisioned IOPS comes to mind here since it allows you to specify the desired IOPS of a volume, and provides an SLA.

> Providers don't provide a minimum IOPS, so they can just drop you.

The reason I ask is because the blog post generalizes a lot about what cloud providers offer or what the cloud is capable of, but doesn't explore some of the options available to address those concerns, like provisioned IOPS with EBS, dedicated instances with EC2, provisioned throughput with DynamoDB, and so on.

sytse · on Nov 13, 2016

EBS goes up to 16TB, we need an order of magnitude more.

You're right that the blog post was generalizing.

BTW We looked into AWS but didn't want to use an AWS only solution because of maximum size, costs, and the reusability of the solution.

hagbarddenstore · on Nov 14, 2016

Per disk. You can attach multiple disks to a single EC2 node.

So, RAID multiple EBS volumes and you have a larger disk.

solipsism · on Nov 12, 2016

That's not a strawman. It's just a false statement.

toomuchtodo · on Nov 12, 2016

Thanks for the correction.

empath75 · on Nov 13, 2016

By point of comparison, we had storage problems that went on for months or years. We had a vendor that was responsive to slas, but the problem was bigger than just random hardware failures, it was just fundamentally unsuited to what we were trying to do with it. That's the risk you take when you try to build your own.

ocdtrekkie · on Nov 13, 2016

In this case, it sounds like the cloud was fundamentally unsuited for what GitLab was trying to do with it. So definitely still a risk!

empath75 · on Nov 13, 2016

As someone on the private cloud team for a large internet company, dealing with storage problems is a nightmare that never ends. AWS is a walk in the park by comparison.

ethbro · on Nov 13, 2016

> See my note about SLAs and ToR failures. [...] but high-performance, guaranteed networked storage is just tricky

New job title: Networked Infrastructure Actuary

jwildeboer · on Nov 13, 2016

OTOH, solving the tricky parts in scalable ways is exactly the kind of unique selling point that cloud providers should offer.

hueving · on Nov 12, 2016

A ToR failure doesn't have to mean the end if you're willing to wire each server to two ToRs and duplicate traffic streams to both. It's a waste, but it's one way to achieve high reliability if you have customers willing to pay for it.

sytse · on Nov 13, 2016

I'm not sure we want to pay huge money. Right now it looks like our cloud hosting bill for 2 months (about $250k) can pay for the hardware to host 4x as much https://docs.google.com/spreadsheets/d/1XG9VXdDxNd8ipgPlEr7N...

prirun · on Nov 13, 2016

My business partner and I ran a website (rubylane.com) for many years, just the 2 of us, colocated at he.net. With good hardware (we used Supermicro servers bought from acme.com), it's not a big deal. We mirrored every disk drive with hardware raid1, had all the servers connected in a loop via their serial ports, had the consoles redirected to the serial ports, and it was not much of a hassle. When we were first starting, we used cheap hardware and that caused us some pain. The other very useful thing we had setup is virtual IP addresses for all of the services: search engine, database, www server, image server, etc. The few times we ever had trouble with the site or needed to take a machine out of service, we could redirect its services to another machine with fakeip.

Recently I bought an old SuperMicro server on eBay, configured it with 2 6-way AMD Opterons, 3 8-way SATA controllers, and 32GB of memory. With 4TB 5700RPM drives in an 8-way software RAID6, it could do 800MB/sec. I realize it's not small file random I/O, but a blended configuration where you put small files on SSD and large files on spinning disk would probably be pretty sweet.

My intuition is that learning all the ins-and-outs of AWS, and how to react and handle all kinds of situations, is not that much easier than learning how to react with your own hardware when problems come up. Especially consider that AWS is constantly changing things and it's out of your control, whereas with your own hardware, you get to decide when things change.

If you can colocate physically close, it's a lot easier. Our colocation was in Fremont but our office was in San Fran, so it was a haul if we had to install or upgrade equipment. But even so, there was only 1 or 2 times in 7 years that we needed to spend 2 consecutive days at the colo. One of those was during a major upgrade where it turned out that the (cheap) hardware we bought was faulty.

sytse · on Nov 13, 2016

Thanks. We plan to use a remote hands service to install new servers.

Dwolb · on Nov 13, 2016

How are you calculating cost to your organization to run your own hardware? With a cloud provider you're benefitting from their own engineering pooled across thousands of customers.

When you run your own hardware you have all the engineering you were already doing plus investing to upkeep and to improve your architecture.

As Google's boulos said, that's where the real costs are.

sytse · on Nov 13, 2016

Indeed with metal you have less flexibility and much higher engineering costs that offset your savings. We think metal will be more affordable as we scale but that is not the reason for doing it. We do it because it is the only way to scale Ceph.

nixgeek · on Nov 14, 2016

I know at least two 500TB+ clusters running on IaaS and don't think "only way to scale Ceph" is to buy and rack machines.

In an earlier comment you said EBS only goes to 16TB and that is "an order of magnitude less" than your requirement, however, thats per volume, you can attach many volumes in much the same way as servers have many disks.

Scale horizontally not vertically, add more OSD instances? With each you can attach a number of EBS or PD volumes which each IOPS characteristics that in aggregate are sufficient to service your workload?

If you want to avoid EBS or PD entirely, is there a reason you can't look at 'i2' or 'd2' instance types?

https://cloud.google.com/compute/docs/disks/performance https://aws.amazon.com/ebs/details/#VolumeTypes

At a fundamental level you're just moving the problem and trading managing metal (which is hard) for I/O guarantees.

"Why is this harder than you might expect?" - you stated elsewhere that you'll have Remote Hands do rack/stack. Providers like Equinix refer to this as "Smart Hands". Everyone who's managed a reasonable-sized environment finds this term highly ironic, as the technician can and will replace the wrong drive, pull the wrong cable, etc.

I've done an non-trivial amount of infrastructure 'stuff' (design, procurement, install, maintenance, migration) for some well-known companies, if you want to Hangout for an hour and pick my brain, gratis, my e-mail is in my profile.

user5994461 · on Nov 13, 2016

> The OP is asking for guaranteed IOPS/latency SLA for which they're willing to pay HUGE money for by ROLLING THEIR OWN.

There is no guarantee of SLA in any distributed system. The best you can do is measure things and know what you'll get most of the time.

If you want SLA, you can make a single server with 10TB of memory as storage. That's solid choice! :D