✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Notes related to (mostly whole-computer) virtualization, emulation and simulation.

Virtualization, emulation, simulation · Docker notes · Qemu notes

Hardware level - emulation, simulation, and hardware virtualization

Generally,

simulation means imitating the resulting behaviour

sometimes just the surface level (but that's often so specific-purpose that it's too much work)

so you often go lower level, but stop at the highest level you can get away with, because that often makes things simpler

emulation means replicating how the internals work

as precisely as sensible or required, often at a much lower level

either out of necessity (e.g. emulating how an old CPU works so you can run anything on it)

or out of preference (e.g. emulating imperfections as well as the specs)

Yes, this is regularly something of a sliding scale, but it turns out a lot of cases fall mostly on one side or the other.

When hardware is compatible, and software environment is quite similar

When software environment is quite different

When hardware, or software environment, is completely different

Same-architecture emulation, virtualization - virtual machines

Tangent: protection

Protection is not virtualization, but plays a useful part of it - and arguably a pretty necessary one in practice.

Memory protection

Protection rings

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Protection rings are the general idea of "you must be this privileged (or better) to do the thing."

Basically all platforms designed for memory isolation will have some sort of privilege system.

For example, On x86, there are multiple privilege levels, called rings.

It is effectively the CPU being told (by the kernel) when to say no to IO accesses. Which is complementary to the kernel (VMM)telling the MPU (and then CPU) when o say no to memory accesses.

x86 processors have four such levels, and diagrams always make a point of this, but as far as I can tell, most OSes only actually use two of them:

"stuff the the kernel can do", i.e. everything - often in ring 0

programs, which have to go via the OS to speak to most hardware anyway - often in ring 3

A few OSes have experimented with using more, and e.g. put drivers in a third ring (less privileged than the kernel, more than programs), but most have apparently decided this is not worth the extra bother, for both technical and practical reasons.

OpenVMS apparently uses four.

For more concrete implementation details, look for IOPL.

ARM processors have no rings like that, but have a fairly regular user mode, and a supervisor mode that gives more control, that can be used in much the same way(verify).

not-really-rings

AMD-V and Intel VT-x and other hardware-assisted virtualization are regularly referred to as Ring -1.

Which they are not, because it's an entirely different mechanism from IOPL.

Yet it gets the point across that it allows multiple OSes to run, unmodified, while each can assume and act like they are Ring 0 on the CPU, and the hardware still ensures this doesn't cause conflicts, or big security holes.

There is also a System Management Mode (SMM), referring to a CPU mode where regular OS is suspended and external firmware is running instead.

This was initially (386 era) meant for hardware debugging, later also used to implement some BIOS-level/assisted features like Advanced Power Management, USB legacy support, and a handful of other relatively specific things.

SMM is sometimes referred to as Ring -2. Also not accurate, though it could potentially be used for something resembling that.

There was a specific exploit of a specific chipset[1] that did something similar to SMM, and is referred to Ring -3. This is even less accurate, and instead seems to be a reference to the CPU sleep state this could be exploited in(verify) and not related to IO protection.

Some footnotes:

VT-x and such are often disabled by default on e.g. most home PCs

partly because people theorize about Ring -1 malware, "...so if you don't need it, why risk it?"

on certain server hardware it's enabled by default because of how likely it is you will need it

there is no point for software to use rings that the OS doesn't, because you would be using security model that the OS doesn't enforce.

Hardware virtualization

Paravirtualization

Nested virtualization

OS level - Virtualization, jails, application containers, OS containers

chroot jails

chroot() changes a process's apparent filesystem-root directory to a given path.

So only really affects how path lookups work.

This is useful for programs that want to run in an isolated environment of files. Such as a clean build system, which seems to be its original use.

At the same time, if program wants to break out of these, it will manage. It was never designed as a security feature, so it isn't a security feature. See chroot for more words about that.

FreeBSD Jails

Inspired by early chroot() and the need to compartimentalize, BSD jails actually did have some security in mind.

It only lets processes talk to others in the same jail, and considers syscalls, sockets/networking and some other things.

It's been around for a while, is mature, and allows decently fine-grained control.

Solaris/illumos Zones

If you come from linux angle: these are much like LXC (verify), and were mature before it.

Due to the Solaris pedigree, this combines well with things like ZFS's snapshots and cloning.

Linux containers

There are a few kernel features that isolate, monitor, and limit processes.

So when we say that processes running in containers are really just processes on the host, that's meant literally. The main way in which they differ is that by default they can share/communicate nothing with processes that are not part of the same container.

While these isolations are a "pick which you want" gliding scale, 'Linux containers' basically refers to using all of them to isolate things in all the ways that matter to security. ...and often with a toolset to keep your admin person sane, and usually further convenience tools. Docker is one such toolset.

The main two building blocks that make this possible are namespaces and cgroups.

Just those two would probably still be easier to break out of than a classical VM, so when you care about security you typically supplement that with

capabilities, intuitively understood as fine-grained superuser rights

seccomp, which filters allowed syscalls,

SELinux / AppArmor / other mandatory access control. Often not necessary, but still good for peace of mind.

(Note that a good number of these fundaments resemble BSD jails and illumos zones)

Uses

Various other names around this area - docker, LXC, kubernetes, rkt, runC, systemd-nspawn, OpenVZ - are tools around the above - runtimes, management, provisioning, configuring, etc.

Some of them the gears and duct tape, some of them the nice clicky interfaces around that, some of them aimed at different scales, or different purposes.

rkt and runC just on running (lightweight container runtime, fairly minimal wrappers around libcontainer)

containerd - manages image transfer, lifecycle, storage, networking, execution via runc

docker is built on top of containerd (and indirectly runc). And is aimed at making single-purpose containers, on building them, on portability.

LXC shares history, libraries, and other code with docker, but is aimed more at being a multitenant machine virtualisation thing.

While you could e.g. do microservices with LXC, and a fleshed-out OS in docker, both are more work and bother and tweaking to get them quite right (e.g. docker avoids /sbin/init not only because it's not necessary, but also because standard init does some stuff, like setting the default gateway route, that the inside of a container would have to work around).

kubernetes focuses almost purely on orchestrating (autimated deployment, scaling, and other management) systems within one or more hosts.

Xen

Linux KVM

From a hosting perspective

A few notes on...

namespaces

Namespaces limits what you can see of a specific type of resource, by implementing a mapping between within-a-container resources to the host's.

This allows a cheap way to have a container see their own - and have their host manage these as distict subsets.

Linux has grown approximately six of these so far:

PID - allows containers to have distinct process trees
user - user and group IDs. (E.g. allows UID 0 (root) inside a container to be non-root on the host)
mount - can have its own filesystem root (chroot-alike), can have own mounts (e.g. useful for /tmp)
network - network devices, addresses and routing, sockets, ports, etc.

some interesting variations/uses

UTS - mostly for nodename and domainname
IPC - for SysV IPC and POSIX message queues

For example, you can

sudo unshare --fork --pid --mount-proc bash

which separates only the processes' PID namespace. You can run top (because your filesystem is still there as before), you can talk to the network as before, etc. -- but no matter what you do you'll only see the processes you started under this bash.

cgroups

Control groups concept is about balancing and metering resources.

While they came into knowledge with containers, cgroups apply to all of the host and not just containers, they just effectively default to no limits.

They're a useful tool to limit how crazy a known-to-misbehave app can go, without going near anything resembling namespaces or containers.

And/or just to report how much a process or group of them has used.

Resources types (which cgroups calls subsystems) include:

Memory - heap and stack and more (interacts with page cache(verify))

allows OOM killer to work per group

this makes sense for containers, but also for isolating likely/known miscreants and not have OOM killer accidentally shoot someone else in the foot

cpu - sets weights, not limits
cpuset - pin to a CPU
blockIO - limits and/or weights

also network, but does little more than tagging for egress

devices
hugeTLB
freezer - allows pausing in a cgroup

Cgroups set up a hierarchy for each of these. Nodes in each hierarchy refers to (a group of) processes.

The API for cgroups is a filesystem, /sys/fs/cgroups. Which, being a verbose/finicky, there are now nicer tools for.

LXC

More specific software

(...rather than things that mostly just name thing building blocks)

Virtualbox

VMware

Parallels

Linux OpenVZ

QEMU

Multiple

bhyve

For BSD

Proxmox

Basically the combination of:

LXC for containers
KVM for fuller virtualization

Kubernetes

Qubes OS

Virtualization, emulation, simulation

Contents