Revision as of 11:31, 19 October 2023

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Notes related to (mostly whole-computer) virtualization, emulation and simulation.

Virtualization, emulation, simulation · Docker notes · Qemu notes

Hardware level - emulation, simulation, and hardware virtualization

Generally,

simulation means imitating the resulting behaviour

sometimes just the surface level is enough (but that's often so specific-purpose that it's too much work)

so you often go deeper, stop at the highest level you can get away with and that does not actually make things harder

emulation means replicating how the internals work

as precisely as sensible or required, often at a much lower level

either out of necessity (e.g. emulating how an old CPU works so you can run anything on it)

or out of preference (e.g. emulating imperfections as well as the specs)

Yes, this is regularly something of a sliding scale, but it turns out a lot of cases fall mostly on one side or the other.

When hardware is compatible, and software environment is quite similar

When software environment is quite different

When hardware, or software environment, is completely different

Same-architecture emulation, virtualization - virtual machines

Tangent: protection

Protection is not virtualization, but plays a useful part of it - and arguably a pretty necessary one in practice.

Memory protection

Protection rings

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Protection rings are the general idea of "you must be this privileged (or better) to do the thing."

Basically all platforms designed for memory isolation will have some sort of privilege system.

For example, On x86, there are multiple privilege levels, called rings.

It is effectively the CPU being told (by the kernel) when to say no to IO accesses. Which is complementary to the kernel (VMM)telling the MPU (and then CPU) when o say no to memory accesses.

x86 processors have four such levels, and diagrams always make a point of this, but as far as I can tell, most OSes only actually use two of them:

"stuff the the kernel can do", i.e. everything - often in ring 0

programs, which have to go via the OS to speak to most hardware anyway - often in ring 3

A few OSes have experimented with using more, and e.g. put drivers in a third ring (less privileged than the kernel, more than programs), but most have apparently decided this is not worth the extra bother, for both technical and practical reasons.

OpenVMS apparently uses four.

For more concrete implementation details, look for IOPL.

ARM processors have no rings like that, but have a fairly regular user mode, and a supervisor mode that gives more control, that can be used in much the same way(verify).

not-really-rings

AMD-V and Intel VT-x and other hardware-assisted virtualization are regularly referred to as Ring -1.

Which they are not, because it's an entirely different mechanism from IOPL.

Yet it gets the point across that it allows multiple OSes to run, unmodified, while each can assume and act like they are Ring 0 on the CPU, and the hardware still ensures this doesn't cause conflicts, or big security holes.

There is also a System Management Mode (SMM), referring to a CPU mode where regular OS is suspended and external firmware is running instead.

This was initially (386 era) meant for hardware debugging, later also used to implement some BIOS-level/assisted features like Advanced Power Management, USB legacy support, and a handful of other relatively specific things.

SMM is sometimes referred to as Ring -2. Also not accurate, though it could potentially be used for something resembling that.

There was a specific exploit of a specific chipset[1] that did something similar to SMM, and is referred to Ring -3. This is even less accurate, and instead seems to be a reference to the CPU sleep state this could be exploited in(verify) and not related to IO protection.

Some footnotes:

VT-x and such are often disabled by default on e.g. most home PCs

partly because people theorize about Ring -1 malware, "...so if you don't need it, why risk it?"

on certain server hardware it's enabled by default because of how likely it is you will need it

there is no point for software to use rings that the OS doesn't, because you would be using security model that the OS doesn't enforce.

Hardware virtualization

Paravirtualization

Nested virtualization

OS level - Virtualization, jails, application containers, OS containers

chroot jails

chroot() changes a process's apparent filesystem-root directory to a given path.

So only really affects how path lookups work.

This is useful for programs that want to run in an isolated environment of files. Such as a clean build system, which seems to be its original use.

At the same time, if program wants to break out of these, it will manage. It was never designed as a security feature, so it isn't a security feature. See chroot for more words about that.

FreeBSD Jails

Inspired by early chroot() and the need to compartimentalize, BSD jails actually did have some security in mind.

It only lets processes talk to others in the same jail, and considers syscalls, sockets/networking and some other things.

It's been around for a while, is mature, and allows decently fine-grained control.

Solaris/illumos Zones

If you come from linux angle: these are much like LXC (verify), and were mature before it.

Due to the Solaris pedigree, this combines well with things like ZFS's snapshots and cloning.

Linux containers

There are a few kernel features that isolate, monitor, and limit processes.

So when we say that processes running in containers are really just processes on the host, that's meant literally. The main way in which they differ is that by default they can share/communicate nothing with processes that are not part of the same container.

While these isolations are a "pick which you want" gliding scale, 'Linux containers' basically refers to using all of them to isolate things in all the ways that matter to security. ...and often with a toolset to keep your admin person sane, and usually further convenience tools. Docker is one such toolset.

The main two building blocks that make this possible are namespaces and cgroups.

Just those two would probably still be easier to break out of than a classical VM, so when you care about security you typically supplement that with

capabilities, intuitively understood as fine-grained superuser rights

seccomp, which filters allowed syscalls,

SELinux / AppArmor / other mandatory access control. Often not necessary, but still good for peace of mind.

(Note that a good number of these fundaments resemble BSD jails and illumos zones)

Uses

Various other names around this area - docker, LXC, kubernetes, rkt, runC, systemd-nspawn, OpenVZ - are tools around the above - runtimes, management, provisioning, configuring, etc.

Some of them the gears and duct tape, some of them the nice clicky interfaces around that, some of them aimed at different scales, or different purposes.

rkt and runC just on running (lightweight container runtime, fairly minimal wrappers around libcontainer)

containerd - manages image transfer, lifecycle, storage, networking, execution via runc

docker is built on top of containerd (and indirectly runc). And is aimed at making single-purpose containers, on building them, on portability.

LXC shares history, libraries, and other code with docker, but is aimed more at being a multitenant machine virtualisation thing.

While you could e.g. do microservices with LXC, and a fleshed-out OS in docker, both are more work and bother and tweaking to get them quite right (e.g. docker avoids /sbin/init not only because it's not necessary, but also because standard init does some stuff, like setting the default gateway route, that the inside of a container would have to work around).

kubernetes focuses almost purely on orchestrating (autimated deployment, scaling, and other management) systems within one or more hosts.

Xen

Linux KVM

From a hosting perspective

A few notes on...

namespaces

Namespaces limits what you can see of a specific type of resource, by implementing a mapping between within-a-container resources to the host's.

This allows a cheap way to have a container see their own - and have their host manage these as distict subsets.

Linux has grown approximately six of these so far:

PID - allows containers to have distinct process trees
user - user and group IDs. (E.g. allows UID 0 (root) inside a container to be non-root on the host)
mount - can have its own filesystem root (chroot-alike), can have own mounts (e.g. useful for /tmp)
network - network devices, addresses and routing, sockets, ports, etc.

some interesting variations/uses

UTS - mostly for nodename and domainname
IPC - for SysV IPC and POSIX message queues

For example, you can

sudo unshare --fork --pid --mount-proc bash

which separates only the processes' PID namespace. You can run top (because your filesystem is still there as before), you can talk to the network as before, etc. -- but no matter what you do you'll only see the processes you started under this bash.

cgroups

Control groups concept is about balancing and metering resources.

While they came into knowledge with containers, cgroups apply to all of the host and not just containers, they just effectively default to no limits.

They're a useful tool to limit how crazy a known-to-misbehave app can go, without going near anything resembling namespaces or containers.

And/or just to report how much a process or group of them has used.

Resources types (which cgroups calls subsystems) include:

Memory - heap and stack and more (interacts with page cache(verify))

allows OOM killer to work per group

this makes sense for containers, but also for isolating likely/known miscreants and not have OOM killer accidentally shoot someone else in the foot

cpu - sets weights, not limits
cpuset - pin to a CPU
blockIO - limits and/or weights

also network, but does little more than tagging for egress

devices
hugeTLB
freezer - allows pausing in a cgroup

Cgroups set up a hierarchy for each of these. Nodes in each hierarchy refers to (a group of) processes.

The API for cgroups is a filesystem, /sys/fs/cgroups. Which, being a verbose/finicky, there are now nicer tools for.

LXC

More specific software

(...rather than things that mostly just name thing building blocks)

Virtualbox

VMware

Parallels

Linux OpenVZ

Qemu

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Usage notes:

VNC is handy (while installing and in general), since you get a view on the entire machine, including BIOS and bootup.

If the mouse in your VNC window seems to be wrong (offset from the real mouse position), this is likely because of motion acceleration applied by the guest OS. An easy way to fix this is to get absolute positioning by making the mouse device report itself tablet-style: use -usbdevice tablet when running qemu.

The easiest VM-to-host networking option is probably -net nic,vlan=1 -net user,vlan=1 (see networking notes below on what that actually means)

-no-acpi may be useful for XP or anything else that tries to use ACPI a lot if prsent (but will run fine without it); it seems that emulating ACPI for the way XP and some other things use it uses unnecessarily much host CPU.

On networking

Qemu can emulate network cards in the guest (with NIC model and mac address configurable) and connects it its own VLAN - which is basically like a virtual router, giving you flexibility when interconnecting. You can, for example, connect each guest to other guests and/or to the host networking.

Of course, many of us just have a single guest and want to give it internet access. This means the default, which is equivalent to -net nic,vlan=1 -net user,vlan=1, is good. That means means "emulate a network card in the guest, connect it to VLAN1; connect qemu's usermode networking (to the host) to VLAN1 as well". In other words, this setup mkaes VLAN1 is a simple interconnect between the VM and the host.

You can get various possible interconnections. Some of the options:

Qemu VLAN to host via usermode network stack
- easy gateway to the outside (includes things like a basic in-qemu DHCP server), without needing any interface bother on the host side, nor much bother in the client
- allows port redirects to the inside, which can be handy

Qemu VLAN to host via tun/tap

Qemu VLAN to host via VDE (usermode tool that manages tun/tap)

Qemu VLAN to other Qemu VLAN

On images

http://en.wikibooks.org/wiki/QEMU/Images

Some older versions of Qemu have a few problems with qcow type images, and will report that they can't open the disk image (which would normally point to permission problems). Use another type, or use a different version of qemu. If you have such a problematic image, you can fix it by converting it to another type(verify) (using qemu-img).

Note that formats like qcow and qcow2 are sparse, but things like defragmenting and wiping in the guest OS will cause space not actually used by the guest filesystem data to be used on the host size. If you want to crimp it again, you can often shrink the image file by doing something like:

qemu-img convert -f qcow2 windows.img -O qcow2 windows-shrunken.img

(...and then swap them)

Growing or shrinking guest disk size

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Possible, but some work as you're both resizing the filesystem and the disk that the guest OS sees.

Also, while most systems beside windows seem to deal with this well enough, windows copies part of the disk geometry into its boot code to identify the boot disk later, which causes problems in some cases, (possibly when you resize across certain sizes, seemingly two-power sizes?(verify)). The error you'll get is the rather uninformative "A disk read error occurred". You can fix this (by editing the MBR (probably easiest while it's still raw format, depending a little on how you do it), and there are probably some ways tools like fdisk might fix this too.

As to the actual resizing, the easiest option is probably to go via raw format. For example:

(if shrinking:) shrink the partition within the VM so you know for sure that only the first so-many bytes of the (virtual) disk is being used. You'll need something which actually moves the data stored on the partition (gparted is an easy enough and free option).

convert the image to a raw-type image

truncate or extend the raw file (representing-a-disk) as you want. It's probably easiest to use dd, truncating and/or adding as necessary (look at dd's bs, count, and such)

convert this new raw image to your favorite image format

use gparted within the VM to resize the filesystem to the new disk size

For example, I started with a 6-gig image (initially created using qemu-img create windows.qcow2 -f qcow2 6G).

After installing windows, and stripping Windows down to take ~1.6GB I decided that a 3GB virtual disk would be enough, so:

used gparted (within qemu, starting its liveCD using -boot d -cdrom gparted.iso) to resize the partition to ~2.7GB (under 3GB to avoid problems from rounded numbers - I could've worked exactly instead))}}.

converted the image to raw:

qemu-img convert -f qcow2 windows.qcow2 -O raw windows.raw

Copied the first ~3G to a new raw image. (Note that the following actually specifies 3000MB, ~2.9G, but this is still very comfortably more than what is actually used, since the gparted we did means that only the first ~2.7GB of that 6G raw image is used by the partition)

dd if=windows.raw of=win3g.raw bs=10M count=300

converted this truncated raw to a qcow2 image again using qemu-img

qemu-img convert -f raw win3g.raw -O qcow2 win3g.qcow2

started the VM with this new disk image, and booted the gparted liveCD again to grow the partition to the actual new disk size.

Experiment with XP

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Choose the memory you'll need ahead of time. Windows can be picky about hardware changes, deciding it's now in a different computer. The below uses 384MB, which is comfortable enough for basic use.

create drive image (max 5GB), here of type qcow2

qemu-img create windows.img -f qcow2 5G

While installing windows:

qemu -no-acpi -localtime -usbdevice tablet -net nic,vlan=1 -net user,vlan=1 windows.img -m 384 -vnc :2 -boot d -cdrom winxpsetup.iso

Now you can connect a VNC client to :2 and watch it boot, install windows, and such.

Once windows is installed, the CDROM isn't really necessary (arguably it can't hurt as windows can always find any files needed for extra installation).

You may wish to configure XP to listen for Remote Desktop and use that instead of (or in addition to) VNC (note: you may wish to add a password to VNC access). To use remote desktop you'll also need to forward a port to the inside.

The command I use for regular runs of this VM:

qemu -no-acpi -localtime -usbdevice tablet -net nic,vlan=1 -net user,vlan=1 windows.img -m 384 -redir tcp:3389::3389

Multiple

bhyve

For BSD

Proxmox

Basically the combination of:

LXC for containers
KVM for fuller virtualization

Kubernetes

Qubes OS

@@ Line 841: / Line 841: @@
 * https://en.wikipedia.org/wiki/OpenVZ
-===QEMU===
+===Qemu===
+{{stub}}
+Usage notes:
+* VNC is handy (while installing and in general), since you get a view on the entire machine, including BIOS and bootup.
+* If the mouse in your VNC window seems to be wrong (offset from the real mouse position), this is likely because of motion acceleration applied by the guest OS. An easy way to fix this is to get absolute positioning by making the mouse device report itself tablet-style: use '''<tt>-usbdevice tablet</tt>''' when running qemu.
+* The easiest VM-to-host networking option is probably '''<tt>-net nic,vlan=1 -net user,vlan=1</tt>''' (see networking notes below on what that actually means)
+* '''<tt>-no-acpi</tt>''' may be useful for XP or anything else that tries to use ACPI a lot if prsent (but will run fine without it); it seems that emulating ACPI for the way XP and some other things use it uses unnecessarily much host CPU.
+====On networking====
+Qemu can emulate network cards in the guest (with NIC model and mac address configurable) and connects it its own VLAN - which is basically like a virtual router, giving you flexibility when interconnecting. You can, for example, connect each guest to other guests and/or to the host networking.
+Of course, many of us just have a single guest and want to give it internet access. This means the default, which is equivalent to <tt>-net nic,vlan=1 -net user,vlan=1</tt>, is good. That means means "emulate a network card in the guest, connect it to VLAN1; connect qemu's usermode networking {{comment|(to the host)}} to VLAN1 as well".
+In other words, this setup mkaes VLAN1 is a simple interconnect between the VM and the host.
+You can get various possible interconnections. Some of the options:
+* Qemu VLAN to host via usermode network stack
+** easy gateway to the outside (includes things like a basic in-qemu DHCP server), without needing any interface bother on the host side, nor much bother in the client
+** allows port redirects to the inside, which can be handy
+* Qemu VLAN to host via tun/tap
+* Qemu VLAN to host via VDE (usermode tool that manages tun/tap)
+* Qemu VLAN to other Qemu VLAN
+See also:
+* http://en.wikibooks.org/wiki/QEMU/Networking
+* http://en.opensuse.org/Qemu_networking
+====On images====
+http://en.wikibooks.org/wiki/QEMU/Images
+Some older versions of Qemu have a few problems with qcow type images, and will report that they can't open the disk image (which would normally point to permission problems).
+Use another type, or use a different version of qemu.
+If you have such a problematic image, you can fix it by converting it to another type{{verify}} (using <tt>qemu-img</tt>).
+Note that formats like qcow and qcow2 are sparse, but things like defragmenting and wiping in the guest OS will cause space not actually used by the guest filesystem data to be used on the host size. If you want to crimp it again, you can often shrink the image file by doing something like:
+ qemu-img convert -f qcow2 windows.img -O qcow2 windows-shrunken.img
+(...and then swap them)
+=====Growing or shrinking guest disk size=====
+{{stub}}
+Possible, but some work as you're both resizing the filesystem and the disk that the guest OS sees.
+Also, while most systems beside windows seem to deal with this well enough, windows copies part of the disk geometry into its boot code to identify the boot disk later, which causes problems in some cases, (possibly when you resize across certain sizes, seemingly two-power sizes?{{verify}}).
+The error you'll get is the rather uninformative '''"A disk read error occurred"'''. You can fix this (by editing the MBR (probably easiest while it's still raw format, depending a little on how you do it), and there are probably some ways tools like fdisk might fix this too<!--(perhaps fdisk /mbr fixes this too?{{verify}})-->.
+As to the actual resizing, the easiest option is probably to go via raw format.
+For example:
+* (if shrinking:) shrink the partition within the VM so you know for sure that only the first so-many bytes of the (virtual) disk is being used. You'll need something which actually moves the data stored on the partition (gparted is an easy enough and free option).
+* convert the image to a raw-type image
+* truncate or extend the raw file (representing-a-disk) as you want. It's probably easiest to use <tt>dd</tt>, truncating and/or adding as necessary (look at dd's <tt>bs</tt>, <tt>count</tt>, and such)
+* convert this new raw image to your favorite image format
+* use gparted within the VM to resize the filesystem to the new disk size
+For example, I started with a 6-gig image {{comment|(initially created using <tt>qemu-img create windows.qcow2 -f qcow2 6G</tt>)}}.
+After installing windows, and stripping Windows down to take ~1.6GB I decided that a 3GB virtual disk would be enough, so:
+* used gparted (within qemu, starting its liveCD using <tt>-boot d -cdrom gparted.iso</tt>) to resize the partition to ~2.7GB (under 3GB to avoid problems from rounded numbers - I could've worked exactly instead))}}.
+* converted the image to raw:
+ qemu-img convert -f qcow2 windows.qcow2 -O raw windows.raw
+* Copied the first ~3G to a new raw image. {{comment|(Note that the following actually specifies 3000MB, ~2.9G, but this is still very comfortably more than what is actually used, since the gparted we did means that only the first ~2.7GB of that 6G raw image is used by the partition)}}
+ dd if=windows.raw of=win3g.raw bs=10M count=300
+* converted this truncated raw to a qcow2 image again using qemu-img
+ qemu-img convert -f raw win3g.raw -O qcow2 win3g.qcow2
+* started the VM with this new disk image, and booted the gparted liveCD again to grow the partition to the actual new disk size.
+<!--
+http://fugitivethought.com/blog.php?action=view&blog_id=77
+-->
+====Experiment with XP====
+{{stub}}
+Choose the memory you'll need ahead of time. Windows can be picky about hardware changes, deciding it's now in a different computer. The below uses 384MB, which is comfortable enough for basic use.
+* create drive image (max 5GB), here of type qcow2
+ qemu-img create windows.img -f qcow2 5G
+* While installing windows:
+ qemu -no-acpi -localtime -usbdevice tablet -net nic,vlan=1 -net user,vlan=1 windows.img -m 384 -vnc :2 -boot d -cdrom winxpsetup.iso
+Now you can connect a VNC client to :2 and watch it boot, install windows, and such.
+Once windows is installed, the CDROM isn't really necessary {{comment|(arguably it can't hurt as windows can always find any files needed for extra installation)}}.
+You may wish to configure XP to listen for Remote Desktop and use that instead of (or in addition to) VNC (note: you may wish to add a password to VNC access). To use remote desktop you'll also need to forward a port to the inside.
+The command I use for regular runs of this VM:
+ qemu -no-acpi -localtime -usbdevice tablet -net nic,vlan=1 -net user,vlan=1 windows.img -m 384 -redir tcp:3389::3389
+<!--
+====See also====
+http://www.tuxradar.com/content/howto-linux-and-windows-virtualization-kvm-and-qemu
+-->
 ===Multiple===

Virtualization, emulation, simulation: Difference between revisions

Revision as of 11:31, 19 October 2023

Hardware level - emulation, simulation, and hardware virtualization

When hardware is compatible, and software environment is quite similar

When software environment is quite different

When hardware, or software environment, is completely different

Same-architecture emulation, virtualization - virtual machines

Tangent: protection

Memory protection

Protection rings

not-really-rings

Hardware virtualization

Paravirtualization

Nested virtualization

OS level - Virtualization, jails, application containers, OS containers

chroot jails

FreeBSD Jails

Solaris/illumos Zones

Linux containers

Xen

Linux KVM

From a hosting perspective

A few notes on...

namespaces

cgroups

LXC

More specific software

Virtualbox

VMware

Parallels

Linux OpenVZ

Qemu

On networking

On images

Growing or shrinking guest disk size

Experiment with XP

Multiple

bhyve

Proxmox

Kubernetes

Qubes OS

Navigation menu