It's interesting and great to see this stuff take center stage more and more. Those lucky enough to work at places like Twitter, Google, Facebook and other large tech companies will have already seen how this kind of thing dominates the datacenter's there and has been at the core of their systems for many years. People on the outside though are rarely exposed to this concept of datacenter scale computing aside from things like hadoop and so the idea of leveraging a datacenter level API is hard to fathom. I believe it's still going to be a while before this truly becomes mainstream and really we'll see the historical two level forward shift whereby on one side we've previously achieved compute as a service with EC2 and GCE, while developing platform as a service on the other side with Google App Engine and Heroku. These service different demographics and we'll continue to see the level of abstractions occur providing SREs/Devops engineers access to Cluster As A Service and microservice platforms for developers/programmers to give them more flexibility.
Google SREs by last count were 1 engineer to 1000 machines. In 10 years the common devops engineer at a 200 person startup will leverage the same number of resources using layers of abstraction like Mesos and Kubernetes.
I agree with you but I don't think we'll have to wait 10 years to see that happening. We're still a fairly small startup but we're transitioning our whole platform to be built around Mesos (and we're definitely not the only ones in that case).
I don't know why many people think that they need to be at datacenter scale computing to benefit from abstractions like Mesos, it's completely wrong imo.
It's quite a big shift in mindset but it makes the life of everyone (dev and ops) so much easier when you stop having to think about single machines.
>>> I don't know why many people think that they need to be at datacenter scale computing to benefit from abstractions like Mesos, it's completely wrong imo
My next startup will be built on Mesosphere. Faster time to MVP and no "go dark for 18 months" when I have to scale.
> Google SREs by last count were 1 engineer to 1000 machines
That number does not seem particularly impressive, if it is accurate. Even "traditional" well-run enterprise IT organizations are often in the 1 admin/SRE to 600-ish machines, so I have a hard time seeing that Google can only do ~2x as good at their scale and with their level of focus.
1 SRE to 5k machines, 10k machines, that makes more sense to me.
I don't know which kinds of enterprise orgs you're envisioning but when I think about "traditional" enterprise IT orgs, I'm picturing companies in healthcare, insurance, certain finance business units, non-profits, public sector, and defense, not any place ever mentioned a lot here on HN in a technical discussion. I've worked with a LOT of them and I'd be surprised if the ratio of ops engineers to servers was anything better than 1:10 as a ball-park number. Network guy, storage guy, DB guy, VM guy, automation guy, net-sec guy, system-sec guy... that's pretty typical for maybe a 20-server rollout and grows shy of sub-linearly with number of users (wild guess of O(n^1/2) ). If you mean specific to SRE-type of roles or auto-scaled groups in EC2 / GCE to count as servers perhaps that may be the case, but my idea of a traditional enterprise IT org is that it views automation with fear and hesitation preferring to add more people before trying to "disrupt" their business with automation tools and so will be stuck somewhere around 1:30 at best.
Admittedly, I probably have a bad impression of enterprise from consulting because who would pay for automation consulting at $$$ / hr when you do it pretty well with existing resources in the first place?
I think part of the difference is due to the fact that the business of administering healthcare, insurance, financials and other businesses are infinitely more complex than say, serving up search results or 140 character micro blog posts.
It seems much easier in my opinion to scale a single function (search or tweet) than the kinds of tasks that a healthcare company has to do like say...scanning faxes from doctors, applying OCR and properly placing them into a pharmacy order system.
having a ratio where you scale linearly is really not common in mega-scale web services since you have infrastructure services like borg / mesos / omega / autopilot / etc.
you can take a snapshot and say you are 1:n because today you have so many sre and so many machines, but it is very unlikely to be the same ratio down the road.
And goggle dont have to woory about the hard stuf in the same way that full on IBM sysplex thats running 20% of a coutrys bank acounts does. RBS COUGHCOUGH
the 1 admin to 600 machines quoted as the high end for traditional IT datacenter is, in my experience, a murky number. It's usually a ratio of people/virtual machines, not physical machines.
When you remove the VM smokescreen and count physical boxes it's more like 1 person/100 machines, which is abysmal. I've seen order-of-magnitude people efficiency increases with automation like we're discussing here.
The article's premise is a poor introduction to the project. Sure, reinvent MOSIX if you want :) but don't pretend it'll serve more than a niche of a niche.
Firstly, "distributed computing is the norm"? It's just not. Most businesses & app authors will never need to care about ultra-distributed computing, with all its problems and trade-offs. You can move faster with "local-only" computing and scale vertically very cheaply compared to a few years ago - 4 dedicated CPUs + hundreds of gigabytes of RAM save programmer hours, and get your problem solved faster.
For light scaling issues (compared to Google) Redis, MariaDB and other abstractions over local files have some great options for future scaling, and are well-trodden, obvious choices.
Secondly, who cares about "wasted" resources of a whole underutilised server when reliable dedicated servers are so cheap, and in such plentiful supply?
Thirdly, "organizations must employ armies of people to manually configure and maintain each individual application on each individual machine"? - in the 90s maybe! Surely anyone with more than a few applications to worry about is on board with some basic configuration management.
Twitter-size scaling is a "nice problem to have". For all but the best-funded & bullish companies, solve them only when you start to have them.
(my bias: I run a managed service provider in the UK - we tend to help customers scale vertically by shovelling server images around with minimum down time. We say "underused" dedicated server capacity at fixed monthly costs is usually cheaper than chasing the phantom of "optimum" AWS usage.)
> Secondly, who cares about "wasted" resources of a whole underutilised server when reliable dedicated servers are so cheap, and in such plentiful supply?
The people bankrolling Google / Facebook / Twitter's electricity bills seem to care quite a bit.
There is another trope that gets repeated often, (and this is not even remotely directed at you, just a digression hopefully somewhat on topic) "performant languages runtimes are an anachronism, a bog slow language in which a programmer can code fast is way more useful than any of the performance bull crap". Typically the person repeating that would a be a webdev. However, in these large scale scenarios core infrastructural code can save orders of magnitude more in money in running costs than saving days in software development. So yeah at the interesting places algorithms and efficiency continue to matter. A reason that Google always managed to be ahead is partly due to how successful it was in minimizing running costs.
Totally agree on that front. Efficient software = fewer servers. You can't know when to invest in new capacity if you can't tell the difference between hitting the limit of the hardware, and a fixable performance problem. Seemded like Twitter wasn't sure what was going on there for a few years :)
We do a lot of finding ENORMOUS performance problems with customer servers - simple stuff like a thundering herd, a vital missing index or a filesystem that's being overtaxed That's the kind of scale we work at. But those sorts of insights can make the difference between "help we might need a new server" and "oh thank god it's all working again".
At Google scale it matters, sure. But most companies aren't Google scale. Writing your single-company 100-user CRUD webapp in Java or C++ "for performance" is the ultimate false economy.
/I used to work for a company that had a big Java app. We laughed at our client who needed 60 Rails servers to deliver worse performance than our single-instance app. But they probably saved more on dev costs than they spent on servers.
Arguably, most systems are distributed systems today. Even if you're just connecting a Rails app with a MySQL DB.
If we had a fabric (which is called Mesos ;) which allowed you to write elastic, distributed systems without the need for worrying about interconnecting hosts and segmenting hosts into static partitions, etc. wouldn't that be a big win?
Spark is also a great example for an app that was built directly on top of Mesos - the authors could focus on implementing the actual logic rather than worrying about interconnection. Same is true by the way for systems like Chronos and Marathon - all of which run on top of Mesos.
We get Stacked Memory HMC soon, Its not hard to envision a 2U server with ultra high speed 1TB+ Memory in 2020.
At 10nm, 2U Server would get a Dual 32 Core Xeon.
Everyone would do In Memory database and computing, most of today's scaling problem would be 10x easier then. It would properly take a lot longer to reach the scale problem Google and Twitter once had.
So i dont see distributed computing become the norm either. At least not in the next few years.
This author talks about POSIX as if a bunch of people sat down and invented a portable OS API. The reality is closer to: a group of people at AT&T created Unix, a number of other companies and universities modified Unix, and then people sat down and said "how can we unify the APIs of the fragmented Unix world"
The Programming API layer is fragmented in the UNIX world. There's mostly only ANSI C, X, and gcc, which exist everywhere, but it's still not trivial to port between *NIX.
The kernel API is the most useless fragmentation of them all. It's addressed by POSIX and UDI (http://www.project-udi.org/), but the differences between the most important UNIX kernels don't justify the problems they make.
Interesting to think that the assumptions of C as a language evolved mostly against the VAX architecture. It seems that if the VAX doesn't make a distinction, then C doesn't (tend to) have any concept of that distinction either. Examples:
- Pointer types are basically fungible in C (otherwise there would be no void-ptr type)
- There's no compile-time knowledge of the segment a pointer references to prevent you from dereferencing a pointer to an offset from segment A when segment B is loaded (compare this to Rust's parameterization of Box types by their allocator)
- Struct padding is painful and tacked on
- "unsigned char" isn't default even though it'd make much more sense for it to be (What "char"acter is negative? You can have a signed byte/octet, but a character is—in 1979, at least—basically an enum/sum type.
for pre-ANSI C most of these questions are answered by "it depends on the compiler" which is why ANSI C happened. Of course the answer is still "it depends on the compiler" for this behavior, since it's not defined what it does in ANSI C, just that it's not portable to do it.
I remember working on a Fujitsu box in 1993, which used some sort of weird System V Unix with pre-ANSI C and weirdly abbreviated man pages in Engrish. (shudder) Yeah.
Oh this is definitely true. Most unix programs are not portable to POSIX, even when they don't intentionally use features that aren't in POSIX (e.g. kqueue).
UNIX API fragmentation is still a problem, but my point is that POSIX was trying to unify existing implementations rather than a greenfields project to generate a set of portable APIs.
Mesos doesn’t run one big distributed app across servers in a cluster, but rather interleaves the workloads on nodes in the cluster to maximize the utilization of compute, memory, and I/O resources on the cluster. Mesos can run on bare metal clusters or on machines that have some level of virtualization on them to provide some isolation between workloads running on the same machine. Google did a lot of the original work to create control groups, or cgroups, a lightweight application container for Linux, which is the basis for all LXC containers.
In light of all the excitement around Docker & containers, it's important to note that Mesos has been using cgroups / LXC (for isolation) since its inception.
Yes, we need new, standardizing APIs. However this made me cringe:
> Exposing machines as the abstraction to developers unnecessarily complicates the engineering, causing developers to build software constrained by machine-specific characteristics, like IP addresses and local storage.
It reminded me of the old RPC approach of making potentially any method call a remote method call. It didn't work out, simply because remote call latencies are orders of magnitudes higher than for in-process calls, so you need different (coarser) APIs for them.
By the same token, while abstractions are very welcome, they should still allow distinctions between local and non-local resources, for various definitions of "local".
One thing worth noting is that the indirection Ben describes is, basically, the indirection that is already in the Apache Mesos kernel. In real world environments (e.g., Twitter), there is negligible overhead.
Also, locality and latency can both be expressed in terms of placement rules and schedulers on Mesos can use those rules to guarantee or express preference for task placement that optimizes around reduced latencies.
The point that I take away is that the ability to express your needs in a declarative way (e.g., "place these two tasks such that they have such-and-such latency) is much more scalable, flexible and resilient than coding to machine-specific internals. The latter is easier to update and supports delegation of responsibilities.
John Wilkes of Google puts it this way:
"Our own experience has been that allowing our developers unfettered access to the internals of infrastructure systems has been a problem, and we're moving away from that model as fast as we can.
Constructing large-scale complex systems with many interdependencies leads to brittle, fragile systems if they rely on internal implementation mechanisms.
Allowing internal customers to rely on internal implementation mechanisms has made it hard to adopt new technologies, because we only know what knobs they set - not why.
The fix for both is similar: describe the desired end state, not how to get there."
> The point that I take away is that the ability to express your needs in a declarative way (e.g., "place these two tasks such that they have such-and-such latency) is much more scalable, flexible and resilient than coding to machine-specific internals. The latter is easier to update and supports delegation of responsibilities.
While I take your point, "local" is a leaky concept inside a data-center. On a fast network, "in cache on the neighboring computer" can be closer than "on my hard drive" for most purposes.
It seems like what might be ideal would be a specification language that doesn't care where things are, an implementation that tries to deal with that automagically, and a way to specify portions (to all) precisely that is checked against the high-level specification.
To further your point, you can get faster access directly via RDMA on an Infiniband network than the SATA bus can push data. The fastest SATA busses are around 16gbps I believe and you can get HDR IB switches that clock 50gbps today.
With the direction things are moving, "the data center as a computer" is absolutely the right approach.
This stuff is very confuse, and it confuses all the HN commenters even more.
There is no such thing as a datacenter OS.
Simplification: An OS is a kernel and it's associated base software.
The kernel drives the hardware. You need to talk to disks. Memory. what not. Kernel is needed.
You need to talk to the kernel and tie these components together. You write software. Boom, you have an OS.
A datacenter isnt a disk and memory and devices. its a bunch of computers which themselves drive these devices.
If you invented a "datacenter" with a bunch of disks and an API to drive them, then a bunch of CPUs and an API to drive them, and so on, you'll end up with a supercomputer and a single, unreliable, unsafe OS (which is exactly why nobody makes super computers anymore. They make clutsters of computers. Cheaper, more reliable. Clusters. Ie... datacenters).
The only thing that could be needed is a universal API for resource access. Need a db? Here's an API. Need disk space? Here's an API. and so on.
This is exactly what AWS is and does. S3 doesnt expose the OS. Its just a filesystem API.
ELBs arent an OS. They're a load balancer API.
It turns out that below that, there's an actual traditional OS because thats the way it works reliably.
You have other ways to interconnect these systems in smarter ways (see plan9) but its always running a "regular" OS in the end, too.
Is a 40,000 core disaggregated rack really that different from a 4 core laptop? Google pioneered this way of thinking [1]. tl;dr the datacenter is a computer and that computer will inevitably have an OS.
Indeed, ie the datacenter doesnt expose an OS but a set of APIs. There's enough differences between that and a full OS (which also provides its own API) to call bs IMO.
One way to think about this problem is to look at components that people thought were useful in an operating system and then think about what a distributed version might look like.
E.g. for an OS we have:
filesystem
scheduler
cron
Then you can go look at the history of these primitives so the same lessons don't have to be relearned. For example, the Linux kernel has gone through many iterations of its scheduler with cgroups, cfs etc. Why did they do that? Why were previous incarnations not good enough? etc.
While not a complete data center operating system, Mesos, along with some of the distributed applications running on top, provide some of the essential building blocks from which a full data center operating system can be built: the kernel (Mesos), a distributed init.d (Marathon/Aurora), cron (Chronos), and more.
Interested in learning more about or contributing to Mesos? Check out mesos.apache.org and follow @ApacheMesos on Twitter. We’re a growing community with users at companies like Twitter, Airbnb, Hubspot, OpenTable, eBay/Paypal, Netflix, Groupon, and more.
Upvoted. In fact while reading the article I was just thinking to myself that the HN comments would be all about Plan 9. Kind of surprised that it has not been mentioned enough.
What I am really keen to find out in the coming years is what MirageOS makes of this. If you are not familiar this article http://queue.acm.org/detail.cfm?id=2566628 explains it way better than I could. I wouldnt claim it is there yet but seems to be sitting right at an envious position full of realizable potential. Its written in OCaml to boot.
In my mind when I think of abstractions above the level of individual machines, I pretty much automatically go to Erlang as the only language that I know of that operates at the right level of abstraction for that.
It's the only (semi-mainstream?) language that I know of that includes the infrastructure in the language itself to make the individual underlying machines (OS / hardware) appear irrelevant and allow programming to seamlessly span a group of computers.
I definitely wouldn't consider Mesos to be anything like an operating system for the datacenter. That's just marketing language and confuses things.
Mesos is basically an application scheduler. It doesn't manage the base operating systems or machine provisioning. Mesos is concerned with ensuring that one or multiple applications are launched and running on a cluster of machines.
The Saltstack framework is the only thing I know of that would provide all the primitives and control necessary for an operator to truly control a full 5-300,000 node datacenter from one station. It's the only thing out there that will allow for super low latency response to commands across the cluster. It also can do configuration management etc if you want, or a lot of people use it to just trigger existing chef/puppet jobs.
I've been using Saltstack in conjunction with Mesos to build out this full datacenter-as-OS stack. Works... :-)
Mesos itself is only considered as the kernel of a operating system for the datacenter (should also be the message of the article) - not the entirety of what _will_ be. It is Mesos in conjunction with distributed storage, app schedulers, interactive and batch systems for data processing that will make the OS. Just like Mac OS X isn't the Mach kernel alone, but a wide palette of applications that makes the entire OS experience.
> It's the only thing out there that will allow for super low latency response to commands across the cluster. It also can do configuration management etc if you want, or a lot of people use it to just trigger existing chef/puppet jobs.
FWIW, Ansible also does low-latency communications through pipelining/accelerated mode. Ditto for the config management and triggering.
I don't believe even with Ansible's accelerated mode it can come anywhere close to Salt for large deployments. It is still agentless and has to run from a single source and using ssh which is going to be slower than 0MQ. Also using something like salt-syndic to create topologies can reliably scale to thousands of machines and still cope with the load.
A singular OS that acts on well defined abstractions and interfaces making the number of resources the massive machine uses transparent. Software won't be the only thing changing to support this shift in large scale computing but hardware as well. Would this type of machine be autonomous with little user management? AI could assist the machine in optimizing configuration. Pretty soon it could interface with drones that repair hardware and even add new nodes. Zero configuration environment. The operator simply tells it what it should be running, until Skynet is unleashed.
The thing I like about Mesos is that when you approach your problem you are planning for High Availability from the beginning. You don't get that with Docker. Docker let's you make your provisioning stable and repeatable, but you don't get any advantage on HA. You are on your own with Docker. But, Mesos guides the whole architecture of your app towards HA which is nice.
Docker aids software deployment through the automation of Linux Containers (LXC). While Docker has made the the life of developers and operators easier, Docker plus Mesosphere provides an easy way to automate and scale deployment of containers in a production environment.
Combined with rump kernels I can see this as a the data center of the future.. simply a micro-kernel running a specific application without the bloat of a full OS while an orchestration unit/OS such as Mesos does all the setup and tear down of such services providing resources on demand.. lovely.
Mesos is just one piece of the puzzle. I think the Hadoop project proper will be the first to get there. Maybe v3 or v4. That said, given the diversity under the Hadoop umbrela I expect distribution to appear, like RHEL is a GNU/Linux OS there might be a (Cloudera ?) Hadoop OS.
I wouldn't be betting my marbles on Hadoop, or anything Map Reduce now adays. Spark beats it pretty handily for data processing although HBase isn't too bad.
Not sure about that (haven't used yarn however), YARN seems extremely Hadoop specific, whereas Mesos is a "closer to the metal" generic framework to build frameworks for scheduling things on.
If I was to compare the two, I'd call YARN an application scheduler and Mesos a more meta scheduler. You could build yarn ontop of Mesos. You likely couldn't easily build Mesos ontop of Yarn. It should be relatively trivial to make a YARN Framework, which someone has already done with Mesos:
True but Hadoop is the whole tooling and with YARN they are distancing themselves from Map/Reduce. IMHO, it looks like they are heading in that right direction but not quite there. That why it maybe v4 before we can think of Hadoop as an OS for datacenter.
Mesos is closer to the metal but still is only a piece of what is needed. Picturing it as an OS is misleading while picturing Hadoop V2 as a proto-OS is not that much in accurate.
YARN is still in its infancy and doesn't run in production for user-facing applications at any large site. Also, there are big limitations when it comes to scaling YARN. YARN also has a lot of overhead on each machine the worker processes run on.
Twitter runs mesos in a multi 10k node cluster for user facing, production applications.
I feel like this is on the right track, but sadly fails to address the underlying problem. Until the network is virtualized everything else will suffer. If the cloud is to become the default platform for services (I think it will), then the user must be able to define their own virtual data center. This includes their own virtual nets. What the author writes about will be built upon these virtual nets, but until the network itself is able to be programmed we'll likely be drifting between optimal states of computing.
The DCOS supports network virtualization. Changes in underlying infrastructure can be manifested all the way up the stack to applications and their schedulers. E.g., the DCOS can rewrite network routing tables based on placement decisions.
So true; think about what multi-tasking and virtual memory have done for developer convenience _and_ system performance over the last decades. Bet we are going to see the same effects in this space too - there are so many interesting things to do in scheduling! Especially when you can distinguish between clever placement of long lived tasks vs fitting short-lived low-latency jobs.
On another note; next up: "data center_s_ need an operating system". Very interested to see the move towards multi DC/availability zones.
Datacenter and back-end apps simply don't fit on a single machine anymore. Every app of reasonable scale is probably a distributed system of some sort. That, and there are a new class of "apps" (or, more precisely, datacenter services) that were built to operate across fleets of machines from day one, such as Spark, Hadoop, Cassandra, Kafka, Elasticsearch, and so on
On the other hand, a single machine can get pretty big today, for example http://www.supermicro.com/products/system/2U/6028/SYS-6028U-... is 2u, has 12 drive bays and can have 1.5 tb ram. You could do 24 bays if you are using 2.5" drives (most ssds). A lot of complexity can be avoided by getting many fewer, but bigger machines (doesn't help much if you need massive CPU though, but intels architectures are getting better every cycle)
The problem with this is that singular systems require a lot of engineering effort and produce interesting side effects when you attempt to make them reliable.
and sharing state between them? or ensuring that only one is doing things when only one should be doing things? (STONITH needs to be implemented somehow)
the performance penalties you enjoy if you go down the shared disk path (which is still going to be a fun failure when it does fail)
Distributed computing is now the norm, not the exception, and we need a data center operating system that delivers a layer of abstraction and a portable API for distributed applications.
And then we'll need an OS for handling multiple data centers. Its gonna be a crazy day the first time someone realizes they accidentally rebooted 5% of all servers in the western hemisphere.
This screams Erlang all over the place: "portable", common API everywhere, automatic aggregation of resources, processes spawning wherever there is a CPU too a little bit too idle, process isolation...
I'm no Erlang expert, but it seems to me all the fundamental bricks are here (and were for multiple decades!) to build this "datacenter platform".
Mosix, Virtual Iron, vNUMA, ScaleMP, etc. are variants of the RPC fallacy: making remote/slow/independent things pretend to be local/fast/fate-shared. http://www.eecs.harvard.edu/~waldo/Readings/waldo-94.pdf Successful distributed software tends to go the opposite direction, making asynchrony and unreliability the base case.
This article feels out of touch to somebody who frequently types in mpirun from a login node. It seems like the level abstraction is already achived by MPI and the task schedulers that come with it.
Given that VMS futures, POSIX support, RMS features and programming, server and application scaling, and VM sprawl have all been discussed recently, this is really something else to ponder.
Well the main idea behind Mesos is to create an aggregation layer that spans server capacity in local and remote datacenters. This aggregation does not tightly bind the servers into a shared memory system, but rather treats them like a giant resource pool onto which applications can be deployed. It is a bit like the inverse of server virtualization really.
Mesos runs quite nicely on CoreOS. Mesos doesn't replace the native Linux on each of the boxes in the datacenter. The Linux on each box provides the execution environment.
CoreOS is still restricted to one machine. It's about installing one operating system, which will run on multiple machines. With CoreOS you have X number of machines, running CoreOS, and Y number of applications running on top, with the CoreOS installations coordinating.
I think the point of the article is to have one operating system, running multiple physical boxes. You could then in turn have something like containers running on top of that OS.
Google SREs by last count were 1 engineer to 1000 machines. In 10 years the common devops engineer at a 200 person startup will leverage the same number of resources using layers of abstraction like Mesos and Kubernetes.