Unikernels are unfit for production

So, what’s the problem with unikernels? Let’s get a definition first: a unikernel is an application that runs entirely in the microprocessor’s privileged mode. (The exact nomenclature varies; on x86 this would be running at Ring 0.) That is, in a unikernel there is no application at all in a traditional sense; instead, application functionality has been pulled into the operating system kernel. (The idea that there is “no OS” serves to mislead; it is not that there isn’t an operating system but rather that the application has taken on the hardware-interfacing responsibilities of the operating system — it is “all OS”, if a crude and anemic one.)

So those are the reasons for unikernels: perhaps performance, a little security theater, and a software crash diet. As tepid as they are, these reasons constitute the end of the good news from unikernels. Everything else from here on out is bad news: costs that must be borne to get to those advantages, however flimsy.

Worth a read if you think Unikernels are the new hotness.

Unikernels are entirely undebuggable. There are no processes, so of course there is no ps, no htop, no strace — but there is also no netstat, no tcpdump, no ping! And these are just the crude, decades-old tools. There is certainly nothing modern like DTrace or MDB. From a debugging perspective, to say this is primitive understates it: this isn’t paleolithic — it is precambrian. As one who has spent my career developing production systems and the tooling to debug them, I find the implicit denial of debugging production systems to be galling, and symptomatic of a deeper malaise among unikernel proponents: total lack of operational empathy. Production problems are simply hand-waved away — services are just to be restarted when they misbehave. This attitude — even when merely implied — is infuriating to anyone who has ever been responsible for operating a system.

He mentions a talk he gave at DockerCon 2015 where he received strong applause after emphasizing the need to debug rather than just restart systems. I do see the point, but I also think the industry is kind of pulling the other direction on this as a whole. If your system is generally reliable [enough] and easily distributed, then there is a certain elegance to the notion or just ignoring edge cases completely and just letting them die.

When you had one huge server any small issue with it was worthy of detailed investigation - but when you have 10,000 tiny servers and it only takes 20 seconds to spool up a new one… it becomes a lot harder to justify the debugging and pragmatism starts to kick in. Hard to say whether this will bite us in the long-run.

It probably already is having a negative effect on personal computer software reliability in general. It’s easy to forgive Vivaldi for getting sluggish after a few days since it restarts so smoothly and preserves it’s state so well, but it would certainly be nicer if I never had to restart it.