Virtualization, emulation, simulation

From Helpful
Jump to: navigation, search
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)
Notes related to (mostly whole-computer) virtualization, emulation and simulation.

Some overview · Docker notes · Qemu notes

Virtualization, jails, application containers, OS containers

chroot jails

chroot() changes a process's apparent filesystem-root directory to a given path.

So only really affects how path lookups work.

This is useful for programs that want to run in an isolated environment of files. Such as a clean build system, which seems to be its original use.

At the same time, if program wants to break out of these, it will manage. It was never designed as a security feature, so it isn't a security feature. See chroot for more words about that.

FreeBSD Jails

Inspired by early chroot() and the need to compartimentalize, BSD jails actually did have some security in mind.

It only lets processes talk to others in the same jail, and considers syscalls, sockets/networking and some other things.

It's been around for a while, is mature, and allows decently fine-grained control.

Solaris/illumos Zones

If you come from linux angle: these are much like LXC (verify), and were mature before it.

Due to the Solaris pedigree, this combines well with things like ZFS's snapshots and cloning.

Linux containers

There are a few kernel features that isolate, monitor, and limit processes.

So when we say that processes running in containers are really just processes on the host, that's meant literally. The main way in which they differ is that by default they can share/communicate nothing with processes that are not part of the same container.

While these isolations are a "pick which you want" gliding scale, 'Linux containers' basically refers to using all of them to isolate things in all the ways that matter to security. ...and often with a toolset to keep your admin person sane, and usually further convenience tools. Docker is one such toolset.

The main two building blocks that make this possible are namespaces and cgroups.

Just those two would probably still be easier to break out of than a classical VM, so when you care about security you typically supplement that with

capabilities, intuitively understood as fine-grained superuser rights
seccomp, which filters allowed syscalls,
SELinux / AppArmor / other mandatory access control. Often not necessary, but still good for peace of mind.

(Note that a good number of these fundaments resemble BSD jails and illumos zones)


Various other names around this area - docker, LXC, kubernetes, rkt, runC, systemd-nspawn, OpenVZ - are tools around the above - runtimes, management, provisioning, configuring, etc.

Some of them the gears and duct tape, some of them the nice clicky interfaces around that, some of them aimed at different scales, or different purposes.

rkt and runC just on running (lightweight container runtime, fairly minimal wrappers around libcontainer)

containerd - manages image transfer, lifecycle, storage, networking, execution via runc

docker is built on top of containerd (and indirectly runc). And is aimed at making single-purpose containers, on building them, on portability.

LXC shares history, libraries, and other code with docker, but is aimed more at being a multitenant machine virtualisation thing.

While you can e.g. do microservices with LXC, and a fleshed-out system in docker, that's more work and bother and tweaking to get them quite right (e.g. docker avoids /sbin/init not only because it's not necessary, but also because it init does some stuff that you have to work around, like setting the default gateway route).

kubernetes focuses almost purely on orchestrating (autimated deployment, scaling, and other management) systems within one or more hosts.

See also:

Linux OpenVZ

See also:

Hardware emulation, simulation, and virtualization - VM stuff


simulation means imitating surface behaviour at the highest level you can get away with.
emulation means replicating how the internals work, as exactly as sensible or required, often at a much lower level - either out of necessity (e.g. emulating how an old CPU works so you can run anything on it) or out of preference (e.g. emulating imperfections rather than just the specs)

And yes, this is regularly something of a sliding scale.

When hardware is compatible, and software environment is quite similar

When software environment is quite different

When hardware, or software environment, is completely different

Same-architecture emulation, virtualization - virtual machines

Hardware virtualization

LXC notes

A few notes on...


Namespaces limits what you can see of a specific type of resource, by implementing a mapping between within-a-container resources to the host's.

This allows a cheap way to have a container see their own - and have their host manage these as distict subsets.

Linux has grown approximately six of these so far:

  • PID - allows containers to have distinct process trees
  • user - user and group IDs. (E.g. allows UID 0 (root) inside a container to be non-root on the host)
  • mount - can have its own filesystem root (chroot-alike), can have own mounts (e.g. useful for /tmp)
  • network - network devices, addresses and routing, sockets, ports, etc.
some interesting variations/uses
  • UTS - mostly for nodename and domainname
  • IPC - for SysV IPC and POSIX message queues

For example, you can

sudo unshare --fork --pid --mount-proc bash

which separates only the processes' PID namespace. You can run top (because your filesystem is still there as before), you can talk to the network as before, etc. -- but no matter what you do you'll only see the processes you started under this bash.

See also namespaces [1]


Control groups concept is about balancing and metering resources.

Note that cgroups apply to all of the host and not just containers, they just effectively default to no limits.

They're a useful tool to limit how crazy a known-to-misbehave app can go, without going near anything resembling namespaces or containers.

Resources types (which cgroups calls subsystems) include:

  • Memory - heap and stack and more (interacts with page cache(verify))
allows OOM killer to work per group
this makes sense for containers, but also for isolating likely/known miscreants and not have OOM killer accidentally shoot someone else in the foot
  • cpu - sets weights, not limits
  • cpuset - pin to a CPU
  • blockIO - limits and/or weights
also network, but does little more than tagging for egress
  • devices
  • hugeTLB
  • freezer - allows pausing in a cgroup

Cgroups set up a hierarchy for each of these. Nodes in each hierarchy refers to (a group of) processes.

The API for cgroups is a filesystem, /sys/fs/cgroups. Which, being a verbose/finicky, there are now nicer tools for.

See also cgroups

Hardware level

Protection rings

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Protection rings are the general idea of "you must be this privileged (or better) to do the thing."

For example, x86 processors have four such levels, and they apply only to IO.

For more concrete implementation details, look for IOPL.

The CPU is restricting against IO that its boss (the kernel) says shouldn't happen (complementary to the VMM/MMU/MPU's job of protecting memory accesses that shouldn't happen).

Most OSes only make a split between just two:

"stuff the the kernel can do", i.e. everything - often in ring 0
programs, which have to go via the OS to speak to most hardware anyway - often in ring 3

A few OSes have experimented with using more, and e.g. put drivers in a third ring (less privileged than the kernel, more than programs), but most decided this is not worth the extra bother, for both technical and practical reasons.


AMD-V and Intel VT-x and other hardware-assisted virtualization are regularly referred to as Ring -1.

Which it isn't, because it's an entirely different mechanism from IOPL, but it gets the point across that it allows multiple OSes to run, unmodified, all assuming they are Ring 0 on the CPU and not running into issues.

There is also a System Management Mode (SMM), referring to a CPU mode where regular OS is suspended and external firmware is running instead.

This was initially (386 era) meant for hardware debugging, later also used to implement some BIOS-level/assisted features like Advanced Power Management, USB legacy support, and a handful of other relatively specific things.

It is sometimes referred to as Ring -2. Also not accurate, though it could perhaps be used for something like that.

There was a specific exploit of a specific chipset[2] that did something similar to SMM, and is referred to Ring -3- even less accurate, and instead seems to be a reference to the CPU sleep state this could be exploited in.(verify)

Some footnotes:

  • VT-x and such are often disabled by default on e.g. most home PCs
partly because people theorize about Ring -1 malware, " if you don't need it, why risk it?"
on certain server hardware it's enabled by default because of how likely it is you will need it
  • there is no point for software to use rings that the OS doesn't, because you would be using security model that the OS doesn't enforce.

Memory protection

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

On most computers, memory is both virtualized, and protected. It is the MMU that handles both.

On some platforms memory is not virtualised, but still protected, and there is an MPU doing the latter.

Cloudy environments

Openstack notes

Openstack is designed (by Rackspace+NASA) as a coherent set of components to set up a compute cloud.

libcloud notes