Virtualization, emulation, simulation

From Helpful
(Redirected from Cgroups)
Jump to navigation Jump to search
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)
Notes related to (mostly whole-computer) virtualization, emulation and simulation.

Virtualization, emulation, simulation · Docker notes · Qemu notes

Hardware level - emulation, simulation, and hardware virtualization


simulation means imitating the resulting behaviour
sometimes just the surface level is enough (but that's often so specific-purpose that it's too much work)
so you often go deeper, stop at the highest level you can get away with and that does not actually make things harder

emulation means replicating how the internals work
as precisely as sensible or required, often at a much lower level
either out of necessity (e.g. emulating how an old CPU works so you can run anything on it)
or out of preference (e.g. emulating imperfections as well as the specs)

Theory often presents this as two alternatives, but in practice is more of a sliding scale - which is bimodal, sure.

When hardware is compatible, and software environment is quite similar

When software environment is quite different

When hardware, or software environment, is completely different

Same-architecture emulation, virtualization - virtual machines

Tangent: protection

Protection is not virtualization, but plays a useful part of it - and arguably a pretty necessary one in practice.

Memory protection

Early CPUs (and various microcontrollers and some comparably simple processors today) ran with all code having access to everything. Why not, you know?

Also, early computers tended to run exactly one thing at a time We do not have any reason to distrust what else might be happening when there is nothing else happening.

When you want to run distinct processes, this is awkward for many reasons, such as having full rights to trample over each other (meaning stability and security is impossible to guarantee), they can't easily know where the other is in memory, so can't easily avoid each other even if you wanted to.

So e.g. Intel thought up protected mode (the thing before then was called real mode, apparently because the addresses were real, and later were always virtual). Protected mode allowed things like

virtual memory - letting a process deal with their own private memory space, and having the OS map those to actual memory in a way where they are kept
safe multi-tasking - as a result of the above, and some other details
paging - put specific parts of memory on secondary storage (this made that so much simpler to do at OS level)

Protected memory was introduced with the 286 but was awkward, so only really started seeing adoption around the 386.

For efficiency reasons, the mapping (and usually protection) step is assisted by a hardware element called MPU.

See also On computer memory#Intro

On most computers, memory is both virtualized, and protected.

It is the MMU that handles both.

On some platforms memory is not virtualised, but still protected, and there is an MPU doing the latter.


Protection rings

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Basically all platforms designed for memory isolation will have some sort of privilege system - that will have to go beyond just memory (for, at lower level, mildly complex reasons).

For example, On x86, there are multiple privilege levels, called protection rings, the still-fairly-general idea of "you must be this privileged (or better) to do the thing."

It is effectively the CPU being told (by the kernel) when to say no to IO accesses. Which is complementary to the kernel (VMM)telling the MPU (and then CPU) when to say no to memory accesses.

x86 processors have four such levels, and diagrams always make a point of this, but as far as I can tell, most OSes only actually use two of them:

"stuff the the kernel can do", i.e. everything - often in ring 0
programs, which have to go via the OS to speak to most hardware anyway - often in ring 3

A few OSes have experimented with using more, and e.g. put drivers in a third ring (less privileged than the kernel, more than programs), but most have apparently decided this is not worth the extra bother, for both technical and practical reasons.

OpenVMS apparently uses four.

For more concrete implementation details, look for IOPL.

ARM processors have no rings like that, but have a fairly regular user mode, and a supervisor mode that gives more control, that can be used in much the same way(verify).


AMD-V and Intel VT-x and other hardware-assisted virtualization are regularly referred to as Ring -1.

Which they are not, because it's an entirely different mechanism from IOPL.

Yet it gets the point across that it allows multiple OSes to run, unmodified, meaning they will make privileged instructions directly as if they are Ring 0 on the CPU, and thsee hardware features ensures that functions at all, doesn't cause conflicts, or big security holes. (the alternative is paravirtualization, where OSes are aware they are running in a hypervisor, and call into that hypervisor instead of issuing privileged instructions)

There is also a System Management Mode (SMM), referring to a CPU mode where regular OS is suspended and external firmware is running instead.

This was initially (386 era) meant for hardware debugging, later also used to implement some BIOS-level/assisted features like Advanced Power Management, USB legacy support, and a handful of other relatively specific things.

SMM is sometimes referred to as Ring -2. Also not accurate, though it could potentially be used for something resembling that.

There was a specific exploit of a specific chipset[1] that did something similar to SMM, and is referred to Ring -3. This is even less accurate, and instead seems to be a reference to the CPU sleep state this could be exploited in(verify) and not related to IO protection.

Some footnotes:

  • VT-x and such are often disabled by default on e.g. most home PCs
partly because people theorize about Ring -1 malware, " if you don't need it, why risk it?"
on certain server hardware it's enabled by default because of how likely it is you will need it
  • there is no point for software to use rings that the OS doesn't, because you would be using security model that the OS doesn't enforce.

Hardware virtualization


Nested virtualization

OS level - Virtualization, jails, application containers, OS containers

chroot jails

chroot() changes a process's apparent filesystem-root directory to a given path.

So only really affects how path lookups work.

This is useful for programs that want to run in an isolated environment of files. Such as a clean build system, which seems to be its original use.

At the same time, if program wants to break out of these, it will manage. It was never designed as a security feature, so it isn't a security feature. See chroot for more words about that.

FreeBSD Jails

Inspired by early chroot() and the need to compartimentalize, BSD jails actually did have some security in mind.

It only lets processes talk to others in the same jail, and considers syscalls, sockets/networking and some other things.

It's been around for a while, is mature, and allows decently fine-grained control.

Solaris/illumos Zones

If you come from linux angle: these are much like LXC (verify), and were mature before it.

Due to the Solaris pedigree, this combines well with things like ZFS's snapshots and cloning.

Linux containers

There are a few kernel features that isolate, monitor, and limit processes.

So when we say that processes running in containers are really just processes on the host, that's meant literally. The main way in which they differ is that by default they can share/communicate nothing with processes that are not part of the same container.

While these isolations are a "pick which you want" gliding scale, 'Linux containers' basically refers to using all of them to isolate things in all the ways that matter to security. ...and often with a toolset to keep your admin person sane, and usually further convenience tools. Docker is one such toolset.

The main two building blocks that make this possible are namespaces and cgroups.

Just those two would probably still be easier to break out of than a classical VM, so when you care about security you typically supplement that with

capabilities, intuitively understood as fine-grained superuser rights
seccomp, which filters allowed syscalls,
SELinux / AppArmor / other mandatory access control. Often not necessary, but still good for peace of mind.

(Note that a good number of these fundaments resemble BSD jails and illumos zones)


Various other names around this area - docker, LXC, kubernetes, rkt, runC, systemd-nspawn, OpenVZ - are tools around the above - runtimes, management, provisioning, configuring, etc.

Some of them the gears and duct tape, some of them the nice clicky interfaces around that, some of them aimed at different scales, or different purposes.

rkt and runC just on running (lightweight container runtime, fairly minimal wrappers around libcontainer)

containerd - manages image transfer, lifecycle, storage, networking, execution via runc

docker is built on top of containerd (and indirectly runc). And is aimed at making single-purpose containers, on building them, on portability.

LXC shares history, libraries, and other code with docker, but is aimed more at being a multitenant machine virtualisation thing.

While you could e.g. do microservices with LXC, and a fleshed-out OS in docker, both are more work and bother and tweaking to get them quite right (e.g. docker avoids /sbin/init not only because it's not necessary, but also because standard init does some stuff, like setting the default gateway route, that the inside of a container would have to work around).

kubernetes focuses almost purely on orchestrating (autimated deployment, scaling, and other management) systems within one or more hosts.

See also:


Linux KVM

From a hosting perspective

A few notes on...


Namespaces limits what you can see of a specific type of resource, by implementing a mapping between within-a-container resources to the host's.

This allows a cheap way to have a container see their own - and have their host manage these as distict subsets.

Linux has grown approximately six of these so far:

  • PID - allows containers to have distinct process trees
  • user - user and group IDs. (E.g. allows UID 0 (root) inside a container to be non-root on the host)
  • mount - can have its own filesystem root (chroot-alike), can have own mounts (e.g. useful for /tmp)
  • network - network devices, addresses and routing, sockets, ports, etc.
some interesting variations/uses
  • UTS - mostly for nodename and domainname
  • IPC - for SysV IPC and POSIX message queues

For example, you can

sudo unshare --fork --pid --mount-proc bash

which separates only the processes' PID namespace. You can run top (because your filesystem is still there as before), you can talk to the network as before, etc. -- but no matter what you do you'll only see the processes you started under this bash.

See also namespaces [2]


Control groups concept is about balancing and metering resources.

While they came into knowledge with containers, cgroups apply to all of the host and not just containers, they just effectively default to no limits.

They're a useful tool to limit how crazy a known-to-misbehave app can go, without going near anything resembling namespaces or containers.

And/or just to report how much a process or group of them has used.

Resources types (which cgroups calls subsystems) include:

  • Memory - heap and stack and more (interacts with page cache(verify))
allows OOM killer to work per group
this makes sense for containers, but also for isolating likely/known miscreants and not have OOM killer accidentally shoot someone else in the foot
  • cpu - sets weights, not limits
  • cpuset - pin to a CPU
  • blockIO - limits and/or weights
also network, but does little more than tagging for egress
  • devices
  • hugeTLB
  • freezer - allows pausing in a cgroup

Cgroups set up a hierarchy for each of these. Nodes in each hierarchy refers to (a group of) processes.

The API for cgroups is a filesystem, /sys/fs/cgroups. Which, being a verbose/finicky, there are now nicer tools for.

See also cgroups


More specific software

(...rather than things that mostly just name thing building blocks)




Linux OpenVZ

See also:

Qemu notes

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Usage notes:

  • VNC is handy (while installing and in general), since you get a view on the entire machine, including BIOS and bootup.
  • If the mouse in your VNC window seems to be wrong (offset from the real mouse position), this is likely because of motion acceleration applied by the guest OS. An easy way to fix this is to get absolute positioning by making the mouse device report itself tablet-style: use -usbdevice tablet when running qemu.
  • The easiest VM-to-host networking option is probably -net nic,vlan=1 -net user,vlan=1 (see networking notes below on what that actually means)
  • -no-acpi may be useful for XP or anything else that tries to use ACPI a lot if prsent (but will run fine without it); it seems that emulating ACPI for the way XP and some other things use it uses unnecessarily much host CPU.

On networking

Qemu can emulate network cards in the guest (with NIC model and mac address configurable) and connects it its own VLAN - which is basically like a virtual router, giving you flexibility when interconnecting. You can, for example, connect each guest to other guests and/or to the host networking.

Of course, many of us just have a single guest and want to give it internet access. This means the default, which is equivalent to -net nic,vlan=1 -net user,vlan=1, is good. That means means "emulate a network card in the guest, connect it to VLAN1; connect qemu's usermode networking (to the host) to VLAN1 as well". In other words, this setup mkaes VLAN1 is a simple interconnect between the VM and the host.

You can get various possible interconnections. Some of the options:

  • Qemu VLAN to host via usermode network stack
    • easy gateway to the outside (includes things like a basic in-qemu DHCP server), without needing any interface bother on the host side, nor much bother in the client
    • allows port redirects to the inside, which can be handy
  • Qemu VLAN to host via tun/tap
  • Qemu VLAN to host via VDE (usermode tool that manages tun/tap)
  • Qemu VLAN to other Qemu VLAN

See also:

On images

Some older versions of Qemu have a few problems with qcow type images, and will report that they can't open the disk image (which would normally point to permission problems). Use another type, or use a different version of qemu. If you have such a problematic image, you can fix it by converting it to another type(verify) (using qemu-img).

Note that formats like qcow and qcow2 are sparse, but things like defragmenting and wiping in the guest OS will cause space not actually used by the guest filesystem data to be used on the host size. If you want to crimp it again, you can often shrink the image file by doing something like:

qemu-img convert -f qcow2 windows.img -O qcow2 windows-shrunken.img

(...and then swap them)

Growing or shrinking guest disk size
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Possible, but some work as you're both resizing the filesystem and the disk that the guest OS sees.

Also, while most systems beside windows seem to deal with this well enough, windows copies part of the disk geometry into its boot code to identify the boot disk later, which causes problems in some cases, (possibly when you resize across certain sizes, seemingly two-power sizes?(verify)). The error you'll get is the rather uninformative "A disk read error occurred". You can fix this (by editing the MBR (probably easiest while it's still raw format, depending a little on how you do it), and there are probably some ways tools like fdisk might fix this too.

As to the actual resizing, the easiest option is probably to go via raw format. For example:

  • (if shrinking:) shrink the partition within the VM so you know for sure that only the first so-many bytes of the (virtual) disk is being used. You'll need something which actually moves the data stored on the partition (gparted is an easy enough and free option).
  • convert the image to a raw-type image
  • truncate or extend the raw file (representing-a-disk) as you want. It's probably easiest to use dd, truncating and/or adding as necessary (look at dd's bs, count, and such)
  • convert this new raw image to your favorite image format
  • use gparted within the VM to resize the filesystem to the new disk size

For example, I started with a 6-gig image (initially created using qemu-img create windows.qcow2 -f qcow2 6G).

After installing windows, and stripping Windows down to take ~1.6GB I decided that a 3GB virtual disk would be enough, so:

  • used gparted (within qemu, starting its liveCD using -boot d -cdrom gparted.iso) to resize the partition to ~2.7GB (under 3GB to avoid problems from rounded numbers - I could've worked exactly instead))}}.
  • converted the image to raw:
qemu-img convert -f qcow2 windows.qcow2 -O raw windows.raw
  • Copied the first ~3G to a new raw image. (Note that the following actually specifies 3000MB, ~2.9G, but this is still very comfortably more than what is actually used, since the gparted we did means that only the first ~2.7GB of that 6G raw image is used by the partition)
dd if=windows.raw of=win3g.raw bs=10M count=300
  • converted this truncated raw to a qcow2 image again using qemu-img
qemu-img convert -f raw win3g.raw -O qcow2 win3g.qcow2
  • started the VM with this new disk image, and booted the gparted liveCD again to grow the partition to the actual new disk size.

Experiment with XP

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Choose the memory you'll need ahead of time. Windows can be picky about hardware changes, deciding it's now in a different computer. The below uses 384MB, which is comfortable enough for basic use.

  • create drive image (max 5GB), here of type qcow2
qemu-img create windows.img -f qcow2 5G
  • While installing windows:
qemu -no-acpi -localtime -usbdevice tablet -net nic,vlan=1 -net user,vlan=1 windows.img -m 384 -vnc :2 -boot d -cdrom winxpsetup.iso

Now you can connect a VNC client to :2 and watch it boot, install windows, and such.

Once windows is installed, the CDROM isn't really necessary (arguably it can't hurt as windows can always find any files needed for extra installation).

You may wish to configure XP to listen for Remote Desktop and use that instead of (or in addition to) VNC (note: you may wish to add a password to VNC access). To use remote desktop you'll also need to forward a port to the inside.

The command I use for regular runs of this VM:

qemu -no-acpi -localtime -usbdevice tablet -net nic,vlan=1 -net user,vlan=1 windows.img -m 384 -redir tcp:3389::3389





Basically the combination of:

  • LXC for containers
  • KVM for fuller virtualization



Qubes OS