Say a non-OS hacker wants a unikernel. What's the sanest way to go about getting to that?
Options that come to mind are:
- build your application as a linux kernel module, load it into a normal kernel, and generally ignore the userspace that runs anyway
- take Linux and hack it down pretty aggressively plus splice your code into it
- find some github unikernel effort and go from there (which I think the OP does)
- take some other OS - freebsd? - and similarly hack out parts
Other?
I like the idea of a x64 machine running a VM connected to a network card as a generic compute resource that does whatever tasks are assigned by sending it data over the network. It's not been worth the hassle relative to a userspace daemon, but one day I may find the time and would be interested in the HN perspective on where best to start the OS level hackery.
> The Unikernel Linux (UKL) project started as an effort to exploit Linux’s configurability.. Our experience has led us to a more general goal: creating a kernel that can be configured to span the spectrum between a general-purpose operating system, amenable to a large class of applications, and a highly optimized, possibly application- and hardware-specialized, unikernel... other technologies occupying a similar space have come along, especially io_uring and eBPF. io_uring is interesting because it amortizes syscall overhead. eBPF is interesting because it’s another way to run code in kernel space (albeit for a very limited definition of “code”).
> Unikernel Linux (UKL) is a small patch to Linux and glibc which allows you to build many programs, unmodified, as unikernels. That means they are linked with the Linux kernel into a final vmlinuz and run in kernel space. You can boot these kernels on baremetal or inside a virtual machine. Almost all features and drivers in Linux are available for use by the unikernel.
For starters, assuming the Linux variant, build a statically compiled application, pack it into an initramfs as the only file there, for simplicity name it `/init`, bundle the initramfs with the kernel, boot. At this point, your app should be the PID 1 and the only process running (with the exception of a bunch of kernel threads). At this point you can do whatever you want.
There are projects that permit statically linking a traditional kernel with a traditional application into a unikernel. NetBSD pioneered this with their rump kernel build framework, and I believe there's at least one Linux build framework that mimics this. The build frameworks cut out the syscall layer; an application calling read(2) is basically calling the kernel's read syscall implementation directly. Often you don't need to change any application source code. The build frameworks handle configuring and building the kernel image, and statically linking the kernel image with your application binary to produce the unikernel image.
I probably should have mentioned I've built unikernels with some of the tooling you've described here. It just seems very academic and edge case compared to a single static user space Linux binary that while technically isn't a by the book unikernel, all I guess I meant was that it's diminishing returns beyond that.
There is a framework for OCaml for this: https://mirage.io/
So if you are interested in learning OCaml and want a unikernel, this would be a possible path to take.
OCaml is a good language but perhaps unikernel does not mean what I thought it did:
> fully-standalone, specialised unikernel that runs under a Xen or KVM hypervisor.
Or maybe xen / kvm are no longer called operating systems?
I'm interested in having my code be responsible for thread scheduling and page tables - no OS layer to syscall into - but am not as keen on DIYing the device drivers to get it talking to the rest of the world.
> I replace the [QubesOS] Linux firewall VM with a MirageOS unikernel. The resulting VM uses safe (bounds-checked, type-checked) OCaml code to process network traffic, uses less than a tenth of the memory of the default FirewallVM, boots several times faster, and should be much simpler to audit or extend.
> Or maybe xen / kvm are no longer called operating systems?
> I'm interested in having my code be responsible for thread scheduling and page tables - no OS layer to syscall into [...]
You might be confusing Xen and KVM here? Xen and KVM are rather different in this regard.
KVM runs on a full Linux kernel (as far as I know). But running your application as unikernels on top of Xen is more comparable to the old Exokernel concept.
- take Linux and hack it down pretty aggressively plus splice your code into it
But rather than starting with a Linux distro and hacking it down, I'd start the other way: Boot the kernel directly (via a UEFI bootloader). You can embed a basic filesystem structure (/dev, /proc, /etc, etc.) in a binary blob inside the kernel file itself on build (kind of dumb that this is required at all, but it is)). The kernel itself has basically everything you'd need (for any reason you'd want a unikernel).
The problem with Unikernels is that there is no middle ground between a button smashing user and a kernel hacker. If you open the hood everything is part of the kernel and most (all?) existing examples of Unikernels lack proper tracing and debugging support. It will feel like debugging an eight bit MCU (printf() and GPIO writes) running a far larger (and complex) code base through upward emulation.
Actually this isn't a fundamental issue with unikernels, but rather an implementation one. For instance, check out debugging in Unikraft: https://unikraft.org/docs/internals/debugging .
A couple unikernel projects that caught my eye in the past may be of interest to you. I have no experience with them, so I can't speak to their quality though.
A very basic kernel isn't that hard to make. I think currently the easiest way would be to follow this series of blogpost by Philip Oppermann: https://os.phil-opp.com/
He made a few crates which handles the boot process, paging, x86 structures and more.
It still exists: https://wiki.xenproject.org/wiki/Mini-OS . But beware that this is no more than a small reference OS, there's a massive gap between getting it to just boot and running real-world applications with it.
Nice project! I love WASM. It's designed to be sandboxed and portable from day one. I wish WASM was invented instead of Javascript in the 90s. WASM will eat the world.
What I hope most is endurance. There are many programs that we are not able to run anymore. The best examples are probably older games. I hope WASM will change that, although I'm a little bit nervous about adding new features, because simple specs have a higher chance of surviving, but the future of binaries looks exciting.
Believe it or not, back in the 90s we thought (on the whole) that web browsers were for browsing hypertext documents. Not for replacing the operating system. There's a reason JS started out limited to basic scripting functionality for wiring up e.g. on-click handlers and form validation. That it grew into something else is not indicative of any design fault in JS (tho it has plenty), but with the use it was shoehorned into. The browser as delivery mechanism for the types of things you're talking about is... not what Tim Berners Lee or even Marc Andreesen had in mind?
I have very mixed feelings about WASM. There is a large... hype-and-novelty screen held up in front of it right now.
There are many Bad Things about treating the web browser as nothing more than a viewport for whatever UI designer and SWE language-of-the-wek fantasy is going around. Especially when we get into things like accessibility, screen readers, etc.
As for the people treating WASM as the universal VM system outside the browser... Yeah, been down that road 30 years ago, that's what the JVM was supposed to be? But I understand that's not "cool" now, so...
The main problem with HTML/CSS/JS is programmers want more than these languages offer. With WASM you can pick up language(must compile to .wasm) that fits your use case best. This is the freedom most programmers want.
There will always be programmers who will draw their custom buttons(instead of modifying DOM from WASM) and ignore accessibility. They can do this with JS as well, but most of them don't.
The original "sin" is that the browser became the delivery tool for what you're talking about. Whether it's a sin or not is of course a matter of opinion.
But is odd after all these years the browser killed off a big junk of "native" apps on the desktop, but in mobile, there's a whole other story.
Which makes me think the problem all along was about distribution, not technology.
I keep hoping others see this as well. Sun was so close to the right thing, but the problem is too hard to monetize and it's too vulnerable to embrace, extend, and extinguish.
Well, Sun did, I think, couple the JVM the VM too closely to Java the language. And really, on purpose. WASM doesn't make that mistake at least.
But it's also missing, like, a garbage collector and other things that the JVM offered up and did really really well. People are doing dumbass stuff like running garbage collected interpreters inside WASM, inside V8 (which has its own GC) in the browser. It's like nested dolls, just pointless tossing of CPU cycles into the wastebin. Their (or their VC's) money, but jeez.
You can say "oh, that's coming" (GC extensions in WASM) but that hardly inspires confidence because it took 20 years for the JVM to reach maturity on this front. Best case scenario we'll have a decent GC story in WASM in 10.
That is always bound to happen, even when bytecode is designed from the ground to support multiple languages, eventually one of them ends up winning as it is too much of mental complexity to always keep moving the platform forward with all of them in mind.
Eventually one of them emerges as the main one, and then there are all the others not necessarly having access to everything like in the early days.
One sees this in the Amsterdam toolkit, IBM TIMI, TDF, and more recently CLR, where it seems to mean C# Language Runtime instead of the original Common Language Runtime, since the .NET Framework to .NET Core transition, and decrease of investment into VB, F# and C++/CLI development and feature parity with C#.
The thing that nags me with WASM is how so many people try to sell it, as if it was the very first of its kind.
> The thing that nags me with WASM is how so many people try to sell it, as if it was the very first of its kind.
I don't get that vibe. Just ask, how do you get to write applications with good, predictable performance, perhaps with multithreading and explicit memory management, in the browser?
It doesn't matter how much of this has existed before in some form or shape. It's ablut the "product" more than it is about grandiose ideas (and the product might not be completely there yet, at least it wasn't some 3 years ago)
There are two separate, orthogonal channels of discussion that I think people are poking at.
1. WASM as a browser tech for delivering rich applications inside the browser. On this one I will shrug. I understand the motivation. I don't particularly like it, because my vision of the "web" is not that, but it's a lost battle and I don't have a horse in this race. It's effectively the resurrection of Java applets, but done better, and more earnestly. It's going to solve the kinds of problems you're talking about, I guess, but introduce new ones (even more inconsistency of UX, accessibility features, performance issues, etc.)
2. WASM as a general / universal runtime for server side work. On this, I see a lot of hype, thin substance, a lot of smoke but no fire, and I'm quite skeptical. It looks to me like classic "Have a Hammer, Going to Go find Nails" syndrome. I was initially enthused about this aspect of WASM but I had a job employed working with WASM for a bit and I found a lot to be skeptical about. And while likely will be using WASM in some fashion similar to this for a project I have, I am also not convinced that WASM itself makes a lot of sense as some sort of generic answer for containerization, and looks to me like duplication of effort, claims of novelty where there is none, unhealthy cycles in the tech industry, etc.
Anyways, I think the person you're replying to, and myself, are primarily talking about #2 -- as was the original article
All those VC powered companies selling WASM containers in Kubernetes as if application servers weren't a thing 20 years ago, or IBM isn't shipping TIMI execuatbles for decades.
Or talking about how "safe" WASM happens to be, while there are already some USENIX papers slowly making their appearance regarding WASM based attacks.
I naively hope the web bifurcates into sandboxed wasm apps and document content that doesn't even need js, much less wasm. I'm not sure what a middle ground would look like or why I'd want it. But the realist in me knows wasm will eat the document content too, meaning adblockers and reader view are doomed...
Maybe. As inconvenient as accessibility is, with any luck the need to make web content legible to screen readers will also keep adblockers working. Even with wasm, I don’t think the DOM is going anywhere any time soon. I haven’t seen any proposal to replace it.
You are probably right. Raster frameworks that talk straight to a gl context are out there, eframe/egui is one I've used. And yeah, accessibility is bad. Pair that with encrypted websockets and webTPM which if it isn't a thing, will be, you won't have any control over the chain between the screen and the server.
I absolutely love this. I also hadn't seen several of the linked technologies before, so I'm bookmarking all of them, too.
Next up, I want to configure the hypervisor with a WireGuard connection (possibly through something like Tailscale to establish connections?)...
So I have WebAssembly over here on this machine, talking directly to this WebAssembly over there. Based on configuration and capabilities being passed in. Rather than based on the process opening TCP connections to random locations.
I am fairly sure someone will make some. Just like we had Lisp machines and even specific JVM CPUs.
But my prediction is that those will always stay niche, because running WASM on conventional stock hardware will always be faster in general. Mostly because WASM was designed to run fast on stock hardware, and the economics of scale for conventional general purpose processors are much better.
Compare also how the 'International Conference on Functional Programming' started out as the 'Functional Programming and Computer Architecture' conference, but then people figured out how to compile lazy functional programming languages like Haskell to run efficiently on conventional hardware.
Similar also for the Lisp and Java machines: one reason see we don't see things like them anymore is because compiler technology has caught up.
Won't speak to WASM, or I'll go all "get off my lawn."
But to me the value-sell of unikernels is: 1) Perf; squeak out some extra cycles by throwing overboard things you don't need and pulling things into "ring 0" that you do 2) Simplify; Potentially reduce complexity by ditching some of the things you don't need and 3) Security; Potentially change attack surface ... again, by....
To be clear: I don't think this is right for writing microservices and webapps like most of the people on this forum are employed doing... I think the use case is more for people building infrastructure (databases, load balancers, etc. etc.)
as micro vms can for some (not all) tasks compete with Linux containers but have the benefit of not exposing you Linux kernel to less trusted code
hence why e.g. some cloude on the edge provider convert you docket image to a micro vm when running it
so maybe some use can be found there
through wasm in micro vm in the edge probably will have a hard time competing with wasm as a sandbox on the edge as such provider probably have an Easter time to add useful boundary features/integrations
Only because they are already five years late adding GC support, and even then WASM isn't the first bytecode format supporting C and C++, there are already a couple since 1980's.
As "promised" years ago in Birth & Death of Javascript [0], at some point we shall get a unikernel running a safe GC-collected runtime in kernel-space, at which point we could drop virtual memory mapping support from CPUs, making them faster. While in 2014 the author predicted this will be JS with asm.js, now WASM seems like the way to go. Can't wait (haha)!
> drop virtual memory mapping support from CPUs, making them faster.
In the video, his argument was that the browsers are single-process anyway, and if everything runs in that process, we don't need that separation. However, since then, we've learned that single-process browsers are a security nightmare, so these days browsers are actually not single-process anymore to provide proper sandboxing.
But I love how close to correct that video is, and it's interesting to see in what ways it turned out to be wrong.
Defense-in-depth is always best practice in security. The more layers the attacker has to break and the harder each layer is, the better. All layers can and will be broken.
Apple has spent a long time hardening the JavaScriptCore web sandbox to run untrusted code. We’ve come a long way since JailbreakMe’s web-based jailbreak, but ultimately memory safety requires participation from all parts of the stack and JavaScriptCore and V8 are still both written in C++. You can trigger memory-safety vulnerabilities in the host VM using guest code.
wasmtime is supposedly a hardened WebAssembly runtime written in Rust, but it’s also a JIT, and I have no idea if anyone has put it through its paces security-wise yet. The idea is that WebAssembly can have JIT-like performance without JIT-like security concerns thanks to a simpler translation layer and minimal runtime.
I could see an argument for dropping some layers if the VM isolation become stronger
> The more layers the attacker has to break and the harder each layer is, the better.
No its not, when it comes to end-user app performance, experience or privacy.
Sure, by adding security we can have another reason to let developers end up with golang app compiled to wasm running within electron sandboxed through API redirection (OS + antimalware/antivirus/BPF based EDR) and use it for, like, listening music in a very secure way..
With all these layers happily streaming all kinds of telemetry to knows where, with owning nothing but a bunch of numbers behind a ton of DRM layers, and with no ability to change things to the point where we can't have an app's theme matching system colors because crossplatform compatibility/security reasons.
I don't want to accept developer's assumption that these have to be enabled by default.
Case 2, Windows: can't even do a build of a trusted codebase under IntelliJ without antimalware adding up, like, +150% to build time. While IntelliJ (or some of its extensions or plugins that creep up during development) is happily reporting that performance issue back to its masters. Ugly.
This may change if you're using a bunch of wasm sandboxes. Browser would split its memory up into multiple sandboxes with a process like interface, but one that doesn't need virtual memory
Amen. Single address OSs would be cool to run trusted code with minimal overhead in-kernel while avoiding crashing the machine because of a bug. But I want more sandboxing, not less, when running untrusted code.
Virtual memory and paging isn't just about protection/security/process isolation. It's also about making the most effective use of physical memory -- process virtual usage can exceed process RSS and not just because of swapping -- and providing a set of abstractions for managing memory generally. The OS and the allocator are working together, with the OS having a lot of smarts on machine usage in order to make that Fairly Smart in the general case.
So I don't think there's an automatic win in terms of performance by ridding yourself of it. Especially if you're running through the (pretty slow) WASM VM layer anyways.
For general applications esp those written to a POSIX standard or making assumptions that the machine they're running on looks like a typical modern day computer? Dubious. You'd end up writing a bunch of what the VMM layer does in user code.
Also how would software memory protection (like seen in JVM, JavaScript, Python, ...) be faster than hardware MMU? Hardware simply adds more transistors that run the translation concurrently. Faults are either bugs (segfaults) or features you'd have to reimplement anyways.
Paging implemented naively needs a handful of extra memory accesses to fetch and decode page tables, for each application memory access, which is obviously very expensive. Which is why we have TLBs, which are (small) caches of page table data.
However, the 4kiB page size that is typically used and is baked into most software was decided on in the mid-1980s, and is tiny compared to today's memory and application working set sizes, causing TLB thrashing, often rendering the TLB solution ineffective.
Whatever overhead software memory protection would add is likely going to be small in comparison to cost of TLB thrashing. Fortunately, TLB thrashing can be reduced/avoided by switching to larger page sizes, as well as the use of sequential access rather than random access algorithms.
I don’t get this. Any software implementation of virtual address space is going to need translation tables and “lookaside” caches. But now those structures are competing with real application data for L1 space and bandwidth, not to mention the integer execution units when you use them.
As I understand, the Smalltalk world put a lot of engineering effort into making the software-based model work with performance and efficiency. I don’t think the results were encouraging.
The software-implementation would not have to be a direct emulation of what the hardware does. You are working with the type-system of whatever sandboxed language you are running, and can make much more high-level decisions about what accesses would be legal or not, or how they should get translated, instead of having to go through table lookups on each and every memory access. If you trust the JIT or the compiler you can even make many of the decisions ahead of time, or hoist them outside of loops to virtually eliminate any overhead.
Real answer: because software implementation works by proving mathematically (without running the code) that it won't violate the virutal address space reserved for it by the kernel.
Then, at runtime, it then does nothing at all. Which is very fast.
Paging and lookaside tables are needed for virtual->physical translation. The idea is that a pure software based implementation wouldn't need it at all, at most it would use something segment-like (with just a base offset and segment bound) that it is much easier to handle.
Then again, that's the theory, in practice there are many reasons why hardware moved from early segment based architectures to paging, and memory isolation is only one of them.
You have lighter context switches [0] and finer-grained security domains; consider e.g. passing a pointer versus de/serialising across process boundaries. (The former benefits the latter too, since there's less of a performance cost to cutting up software into more domains.)
It probably isn’t worth digging too much into what was essentially a joke. I think the claim is that one would sufficiently trust the safety guarantees of the compiler/runtime to not need any runtime memory protection (software or hardware).
The hardware mmu does have costs: tlbs are quite small and looking things up in a several-layer tree adds a lot of latency. If vm were fine, no one would care much about hugepages, and yet people do care about them. (Larger pages means fewer tlb misses and fewer levels in the tree to look up when there is a miss)
> Consequently, modern processors have extremely large and highly associative two-level TLBs per CPU — for example, Intel’s Skylake chip uses 64-entry level-1 (L1) TLBs and 12-way, 1,536-entry level-2 (L2) TLBs. These structures require almost as much area as L1 caches today, and can consume as much as 10 to 15 percent of the chip energy.
Bhattacharjee, Abhishek. "Preserving virtual memory by mitigating the address translation wall." IEEE Micro 37.5 (2017): 6-10.
The thing is, now we use and pay the price for both - memory is managed in software, and yet CPU MMU and caches have to sacrifice space on the die for complex memory mappings. Instead we could get extra transistors for better performance (or, like in Apple CPUs, dedicated instructions for GC languages).
There’s no instructions for GC’d languages. That was the old Jazelle ARM extension (which could microcode some of java’s bytecode for direct execution).
The “javascript instruction” is FJCVTZS, which is a rounding mode matching x86 semantics, which is incidentally what JS specifies for double -> int32 conversions, and soft-coding it on top of FCVTZS it is rather expensive (it requires a dozen additional instructions to fix up edge cases).
This is beneficial to javascript (on the order of a percentage point on some benchmarks suites, however pure javascript crypto can get high double digits gains), but it’s also beneficial for any replication of x86 rounding on ARM, including but not limited to emulating x86 on arm (aka Rosetta 2).
Does anyone know of attempts to add CPU instructions that allow JITs and compilers to mitigate Spectre by using speculation-safe instructions for safety critical checks? I could imagine a "load if less than" or similar instruction, which the compiler could use to incorporate the safety check into the load instruction and avoid a separate branch that could be mispredicted. Such an instruction would be documented to have no side effects (even timing side effects) if the condition were not met.
If the hardware is designed to support single address space OSs it doesn't have be a security problem. It can help avoid spectre like problems because it can lower the expected overhead of permission checks so far that there is no advantage of speculating on them instead of performing them.
I think you are confusing meltdown (a speculation attack on hardware permission checks which was patched in later revisions of intel silicon and never affected other vendors) with Spectre, a general family of attacks on speculative execution, which are generally unsolved.
You could of course add dedicated hardware to lower the overhead specifically of memory access permission checks. In fact most CPUs already do, it is called an MMU.
Safe-language OSes like Theseus don't have this class of problems, by their very design. I think it's a superior architecture to current conventional OSes which rely on hardware for protection.
My understanding is that conventional OSes rely on hardware to provide kernel and userspace data isolation, while Theseus relies on Rust compiler, as in safe Rust you can't access arbitrary memory locations.
By "this class of problems" I assumed you were talking about speculation attacks. How does the rust compiler help? Sorry, I'm not going to watch a talk.
I'm sorry, I did not mean that, misunderstanding twice on my part. I meant that you can have a SAS SPL OS and have it safe too. Theseus Book simply states that relying on hardware for data isolation have proven a deficient approach, given the existence of such attacks.
Some folks in this subthread would benefit with re-acquanting themselves with some old OS research. I am specifically thinking of Opal [0] which differentiates the various roles virtual memory management plays. In Opal, all tasks (processes) share a single 64 bit address space (so you can just share pointers) but hardware provides page-level protection.
Without an MMU, swapping to disk becomes a sizeable challenge. I don't think WASM (or Java, or any other kind of VM) should assume it has infinite physical resources of any kind, but am not surprised that JS folk are so far away from hardware they will sometimes forget how computers actually work...
> Without an MMU, swapping to disk becomes a sizeable challenge.
Swapping object graphs out to disk (and substituting entry points by swap-in proxies) was a thing in Smalltalk systems, and I expect Lisp machines must have had their own solutions. For that matter, 16-bit Windows could (with great difficulty) swap on an 8086, and other DOS “overlay managers” existed. Not that I like the idea, necessarily, but this one problem is not unsolvable.
Well, let's add efficiency to the mix then (I used Smalltalk and LISP machines, and neither managed RAM effectively enough, to the point where emacs was... fast! at the time).
You'd still very much need virtual memory to isolate WASM linear memories of different processes, unless you want to range check every memory access. If we're dropping linear memory and using the new age GC WASM stuff, sure.
An exploit to the runtime in such a system obviously would of course be a disaster of upmost proportions, and to have any chance of a decent performance you'd need a very complex (read exploitable) runtime.
I suspect the underlying assumption here is that each WASM module/program would/could likely exist in its own unikernel on the hypervisor. Which is something I guess you could do since boot and startup times could be pretty minimal. How you would share state between the two, I'm unclear on, though.
The question is.. if you have full isolation and separation of the processes etc... why are you bothering with the WASM now?
haha, today's shiny network effect attractor is tomorrow's legacy quicksand to be abstracted, emulated or deprecated. The addition and deletion of turtles will continue.
> Put some WASM in a JVM in the WASM. In an OS. In a hypervisor.
I would like to see a single address space kernel with hardware for permissions and remapping split. This would enable virtually tagged and indexed caches all the way down to the last level cache without risking aliasing. There could be special cases for a handful permissions checks using (base,size) for things like the current stack, largest few code blocks etc. relieving the pressure on the page based permission check hardware which could also run in parallel with cache accesses (just pretty please don't leave observable uarch state changes behind on denied accesses). To support efficient fork() the hardware could differentiate between local and global addresses by xor or add/sub the process identifier into the address if a tag bit is present in the upper address bits. This should move a lot of expensive steps off the critical path to memory without breaking anything userspace software has to do. Add a form of efficient delegation of permissions (e.g. hardware protected capabilities) and you have the building blocks to allow very fast IPC even for large messages.
In reality, I think there is always going to be a hypervisor to separate the various workloads, and the hypervisor is likely to keep using paging, to support dynamic memory partitioning -- though perhaps with a larger page size, so as to not create too much pressure on the TLB.
The earliest implementation was Burroughs B5500, in 1961, a bytecode OS written in safe systems language (ESPOL shortly thereafter replaced with NEWP), where all hardware operations are exposed via intrinsics, and is one of the first recorded use of explicit unsafe code blocks.
The CPUs were microcoded, so the bytecode was for all practical purposes Assembly.