Virtualization, emulation, simulation

From Helpful
Jump to: navigation, search
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)
Notes related to (mostly whole-computer) virtualization, emulation and simulation.

Some overview · Docker notes · Qemu notes

Virtualization, jails, application containers, OS containers

chroot jails

chroot() changes a process's apparent filesystem-root directory to a given path.

So only really affects how path lookups work.

This is useful for programs that want to run in an isolated environment of files. Such as a clean build system, which seems to be its original use.


At the same time, if program wants to break out of these, it will manage. It was never designed as a security feature, so it isn't a security feature. See chroot for more words about that.


FreeBSD Jails

Inspired by early chroot() and the need to compartimentalize, BSD jails actually did have some security in mind.

It only lets processes talk to others in the same jail, and considers syscalls, sockets/networking and some other things.

It's been around for a while, is mature, and allows decently fine-grained control.


Solaris/illumos Zones

If you come from linux angle: these are much like LXC (verify), and were mature before it.

Due to the Solaris pedigree, this combines well with things like ZFS's snapshots and cloning.


Linux containers

There are a few kernel features that isolate, monitor, and limit processes.


So when we say that processes running in containers are really just processes on the host, that's meant literally. The main way in which they differ is that by default they can share/communicate nothing with processes that are not part of the same container.


While these isolations are a "pick which you want" gliding scale, 'Linux containers' basically refers to using all of them to isolate things in all the ways that matter to security. ...and often with a toolset to keep your admin person sane, and usually further convenience tools. Docker is one such toolset.


The main two building blocks that make this possible are namespaces and cgroups.

Just those two would probably still be easier to break out of than a classical VM, so when you care about security you typically supplement that with

capabilities, intuitively understood as fine-grained superuser rights
seccomp, which filters allowed syscalls,
SELinux / AppArmor / other mandatory access control. Often not necessary, but still good for peace of mind.

(Note that a good number of these fundaments resemble BSD jails and illumos zones)


Uses

Various other names around this area - docker, LXC, kubernetes, rkt, runC, systemd-nspawn, OpenVZ - are tools around the above - runtimes, management, provisioning, configuring, etc.

Some of them the gears and duct tape, some of them the nice clicky interfaces around that, some of them aimed at different scales, or different purposes.


rkt and runC just on running (lightweight container runtime, fairly minimal wrappers around libcontainer)

containerd - manages image transfer, lifecycle, storage, networking, execution via runc

docker is built on top of containerd (and indirectly runc). And is aimed at making single-purpose containers, on building them, on portability.

LXC shares history, libraries, and other code with docker, but is aimed more at being a multitenant machine virtualisation thing.

While you can e.g. do microservices with LXC, and a fleshed-out system in docker, that's more work and bother and tweaking to get them quite right (e.g. docker avoids /sbin/init not only because it's not necessary, but also because it init does some stuff that you have to work around, like setting the default gateway route).

kubernetes focuses almost purely on orchestrating (autimated deployment, scaling, and other management) systems within one or more hosts.


See also:


Linux OpenVZ

See also:

Hardware emulation, simulation, and virtualization - VM stuff

"Here is a thing that acts like hardware X. Put on that whatever you want" can be done in various ways.


Simulation and emulation

Whenever you can't run the code directly on the silicon, you're talking about simulation and emulation.


Same-architecture emulation, virtualization

Virtual machines

See Comparison of virtual machines (wikipedia).


LXC notes

A few notes on...

namespaces

Namespaces limits what you can see of a specific type of resource, by implementing a mapping between within-a-container resources to the host's.


This allows a cheap way to have a container see their own - and have their host manage these as distict subsets.

Linux has grown approximately six of these so far:

  • PID - allows containers to have distinct process trees
  • user - user and group IDs. (E.g. allows UID 0 (root) inside a container to be non-root on the host)
  • mount - can have its own filesystem root (chroot-alike), can have own mounts (e.g. useful for /tmp)
  • network - network devices, addresses and routing, sockets, ports, etc.
some interesting variations/uses
  • UTS - mostly for nodename and domainname
  • IPC - for SysV IPC and POSIX message queues


For example, you can

sudo unshare --fork --pid --mount-proc bash

which separates only the processes' PID namespace. You can run top (because your filesystem is still there as before), you can talk to the network as before, etc. -- but no matter what you do you'll only see the processes you started under this bash.


See also namespaces [1]

cgroups

Control groups concept is about balancing and metering resources.


Note that cgroups apply to all of the host and not just containers, (they just effectively default to no limits),

They're a useful tool to limit how crazy a known-to-misbehave app can go, without going near anything resembling namespaces or containers.


Resources types (which cgroups calls subsystems) include:

  • Memory - heap and stack and more (interacts with page cache(verify))
allows OOM killer to work per group
which is part of the "put one service in a container" suggestion
  • cpu - sets weights, not limits
  • cpuset - pin to a CPU
  • blockIO - limits and/or weights
also network, but does little more than tagging for egress
  • devices
  • hugeTLB
  • freezer - allows pausing in a cgroup


Cgroups set up a hierarchy for each of these. Nodes in each hierarchy refers to (a group of) processes.

The API for cgroups is a filesystem, /sys/fs/cgroups. Which, being a verbose/finicky, there are now nicer tools for.


See also

https://en.wikipedia.org/wiki/Cgroups cgroups
https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt
https://www.youtube.com/watch?v=sK5i-N34im8


Hardware level

Protection rings

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Protection rings can be the general idea of "you must be this privileged (or better) to do the thing."


For example, x86 processors have four such levels, and they apply only to IO.

For more concrete implementation details, look for IOPL.

In other words, the CPU is restricting against IO that its boss (the kernel) says shouldn't happen (complementary to the VMM/MMU/MPU's job of protecting memory accesses that shouldn't happen).


Most OSes only make a split between:

"stuff the the kernel can do", i.e. everything
programs, which have to go via the OS to speak to most hardware anyway

A few OSes have experimented with more, and put drivers in a third ring (less privileged than the kernel, more than programs), but most decided this is not worth the extra bother (for both technical and practical reasons).

So most OSes have landed on using only two, typically putting the kernel in ring 0 and userspace programs in ring 3, and use no other rings.


not-really-rings

AMD-V and Intel VT-x and other hardware-assisted virtualization are regularly referred to as Ring -1.

Which it isn't, because it's an entirely different mechanism from IOPL, but it gets the point across that it allows multiple operating systems to run, as if they are Ring 0 on the CPU, (and do so unmodified).


There is also a System Management Mode (SMM), referring to a CPU mode where regular OS is suspended and external firmware is running instead.

This was initially (386 era) meant for hardware debugging, later also used to implement some BIOS-level/assisted features like Advanced Power Management, USB legacy support, and a handful of other relatively small things.

It is sometimes referred to as Ring -2.

Also not accurate, though it could theoretically be used for something like that.


There was a specific exploit of a specific chipset[2] that did something similar to SMM, and is referred to Ring -3- even less accurate, and instead seems to be a reference to the CPU sleep state this could be exploited in.(verify)



Some footnotes:

  • VT-x and such are often disabled by default on e.g. most home PCs
partly because people theorize about Ring -1 malware, "...so if you don't need it, why risk it?"
on certain types of servers it's so likely you need it that it's enabled by default
  • there is no point for software to use rings that the OS doesn't, because you would be using a different security model that the OS wouldn't enforce.


Memory protection

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

On most computers, memory is both virtualized, and protected. It is the MMU that handles both.

(On some platforms memory is not virtualised, but still protected, and there is an MPU doing the latter)

Cloudy environments

Openstack notes

Openstack is designed (by Rackspace+NASA) as a coherent set of components to set up a compute cloud.


libcloud notes