Introduction

What is ToyVMM?

ToyVMM is a project being developed for the purpose of learning virtualization technology. ToyVMM aims to accomplish the following

Code-based understanding of KVM-based virtualization technologies Learn about the modern virtualization technology stack by using libraries managed by rust-vmm The rust-vmm libraries are also used as a base for well-known OSS such as firecracker and provides the functionality needed to create custom VMMs.

Disclaimer

While every effort has been made to provide correct information in this publication, the authors do not necessarily guarantee that all information is accurate. Therefore, the authors cannot be held responsible for the results of development, prototyping, or operation based on this information. If you find any errors in the contents of this document, please correct or report them as PR or Issue.

What's Next?

If you would like to try ToyVMM first, please refer to QuickStart. To learn more about KVM-based virtualization through ToyVMM, please refer to 01. Running Tiny Code in VM

QuickStart

This quickstart documents are based on the commit ID of 58cf0f68a561ee34a28ae4e73481f397f2690b51.

Architecture & OS

ToyVMM only supports x86_64 Linux for Guest OS.
ToyVMM has been confirmed to work with Rocky Linux 8.6, 9.1 and Ubuntu 18.04, 22.04 as the Hypervisor OS.

Prerequisites

ToyVMM requires the KVM Linux kernel module.

Run Virtual Machine using ToyVMM

Following command builds toyvmm from source, downloads the kernel binary and rootfs needed to start the VM, and starts the VM.

# download and build toyvmm from source.
git clone https://github.com/aztecher/toyvmm.git
cd toyvmm
mkdir build
CARGO_TARGET_DIR=./build cargo build --release

# Download a linux kernel binary.
wget https://s3.amazonaws.com/spec.ccfc.min/img/quickstart_guide/x86_64/kernels/vmlinux.bin

# Download a rootfs.
wget https://s3.amazonaws.com/spec.ccfc.min/ci-artifacts/disks/x86_64/ubuntu-18.04.ext4

# Run virtual machine based on ToyVMM!
sudo ./build/release/toyvmm vm run --config examples/vm_config.json

After the guest OS startup sequence is output, the login screen is displayed, so enter both username and password as 'root' to login.

Disk I/O in Virtual Machine.

Since we have implemented virtio-blk, the virtual machine is capable of operating block devices.
Now it recognizes the ubuntu18.04.ext4 disk image as a block device and mounts it as the root filesystem.

lsblk
> NAME MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
> vda  254:0    0  384M  0 disk /

Therefore, if you create a file in the VM and then recreate the VM using the same image, the file you created will be found. This behavior is significantly different from a initramfs (rootfs that is extracted on RAM).

# Create 'hello.txt' in VM.
echo "hello virtual machine" > hello.txt
cat hello.txt
> hello virtual machine

# Rebooting will cause the ToyVMM process to terminate.
reboot -f

# In the host, please restart VM and login again.
# Afterward, you can found the file you created in the VM during its previous run.
cat hello.txt
> hello virtual machine

Network I/O in Virtual Mahcine.

Since we have implemented virtio-net, the virtual machine is capable of operating network device.
Now, it recognizes the eth0 network interface.

ip link show eth0
> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
>     link/ether 52:5f:7f:b3:f8:81 brd ff:ff:ff:ff:ff:ff

And toyvmm creates the host-side tap device named vmtap0 that connect to the virtual machine interface.

ip link show vmtap0
> 334: vmtap0: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN mode DEFAULT group default qlen 1000
>     link/ether 26:e9:5c:02:3c:19 brd ff:ff:ff:ff:ff:ff

Therefore, by assigning appropriate IP addresses to the interfaces on both the VM side and the Host side, communication can be established between the HV and the VM.

# Assign ip address 192.168.0.10/24 to 'eth0' in vm.
ip addr add 192.168.0.10/24 dev eth0

# Assign ip address 192.168.0.1/24 to 'vmtap0' in host.
sudo ip addr add 192.168.0.1/24 dev vmtap0

# Host -> VM. ping to VM interface ip from host.
ping -c 1 192.168.0.10

# VM -> Host. Ping to Host interface ip from vm.
ping -c 1 192.168.0.1

Additionally, by setting the default route on the VM side, and configuring iptables and enabling IP forwarding on the host side, you can also allow the VM to access the Internet.
However, this will not be covered in detail here.

What's next?

If you are not familiar with KVM-based VMs, I suggest you start reading from 01. Running Tiny Code in VM. If not, please read the topics that interest you.

01. Running Tiny Code in VM

Tiny code execution is no longer supported in the current latest commit.

You may be able to verify it by checking out past commits, but please be aware that resolving package dependencies may be challenging.

This chapter is documented in a way that you can get a sense of its behavior without actually running it, so please feel reassured about that.

DeepDive ToyVMM instruction and how to run tiny code in VM

This main function is a program that starts a VM using the KVM mechanism and executes the following small code inside the VM


#![allow(unused)]
fn main() {
code = &[
	0xba, 0xf8, 0x03, /* mov $0x3f8, %dx */
	0x00, 0xd8,       /* add %bl, %al */
	0x04, b'0',       /* add $'0', %al */
	0xee,             /* out %al, (%dx) */
	0xb0, '\n',       /* mov $'\n', %al */
	0xee,             /* out %al, (%dx) */
	0xf4,             /* hlt */
];
}

This code perform several register operations, but the initial state of the CPU regisers for this VM is set as follows.


#![allow(unused)]
fn main() {
    regs.rip = 0x1000;
    regs.rax = 2;
    regs.rbx = 2;
    regs.rflags = 0x2;
    vcpu.set_sregs(&sregs).unwrap();
    vcpu.set_regs(&regs).unwrap();
}

This will output the result of calculations (2 + 2) inside the VM from the IO Port, followed by a newline code as well.
As you can see the result of running ToyVMM, hex value 0x34 (= '4') and 0xa (= New Line) are catched from I/O port

How's work above code with rust-vmm libraries

Now, the following crate provided by rust-vmm is used to run these codes.

# Please see Cargo.toml
kvm-bindings
kvm-ioctls
vmm-sys-util
vm-memory

I omit to describe about vmm-sys-util because it is only used to create temporary files at this point, so there is nothing special to mention about it.

I will go through the code in order and describe how each crate is related to that.
In this explanation, we will focus primary on the perspective of what ioctl is performed as a result of a function call (This is because the interface to manipulate KVM from the user space relies on the iocl system call)
Also, please note that explanations of unimportant variables may be omitted.
It should be noted that what is described here is not only the ToyVMM implementation, but also the firecracker implementation in a similaer form.

First, we need to open /dev/kvm and acquire the file descriptor. This can be done by Kvm::new() of kvm_ioctls crate. Following this process, the Kvm::open_with_cloexec function issues an open system call as follows, returns a file descriptor as Kvm structure


#![allow(unused)]
fn main() {
let ret = unsafe { open("/dev/kvm\0".as_ptr() as *const c_char, open_flags) };
}

The result obtained from above is used to call the method create_vm, which results in the following ioctl being issued


#![allow(unused)]
fn main() {
vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0)

where
  vmfd: from /dev/kvm
}

Please keep in mind that the file descriptor returned from above function will be used later when preparing the CPU.
Anyway, we finish to crete a VM but it has no memory, cpu.

Now, the next step is to prepare memory! In kvm_ioctls's example, memory is prepared as follows


#![allow(unused)]
fn main() {
// First, setup Guest Memory using mmap
let load_addr: *mut u8 = unsafe {
    libc::mmap(
        null_mut(),
        mem_size, // 0x4000
        libc::PROT_READ | libc::PROT_WRITE,
        libc::MAP_ANONYMOUS | libc::MAP_SHARED | libc::MAP_NORESERVE,
        -1,
        0,
    ) as *mut u8
};

// Second, setup kvm_userspace_memory_region sructure using above memory
// kvm_userspace_memory_region is defined in kvm_bindings crate
let mem_region = kvm_userspace_memory_region {
    slot,
    guest_phys_addr: guest_addr,  // 0x1000
    memory_size: mem_size as u64, // 0x4000
    userspace_addr: load_addr as u64,
    flags: KVM_MEM_LOG_DIRTY_PAGES,
};
unsafe { vm.set_user_memory_region(mem_region).unwrap() };

// retrieve slice from pointer and length (slice::form_raw_parts_mut)
//   > https://doc.rust-lang.org/beta/std/slice/fn.from_raw_parts_mut.html
// and write asm_code into this slice (&[u8], &mut [u8], Vec<u8> impmenent the Write trait!)
//   > https://doc.rust-lang.org/std/primitive.slice.html#impl-Write
unsafe {
    let mut slice = slice::from_raw_parts_mut(load_addr, mem_size);
    slice.write(&asm_code).unwrap();
}
}

Check set_user_memory_region. This function will issue the following ioctl as a result, attach the memory to VM


#![allow(unused)]
fn main() {
ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &mem_region)
}

ToyVMM, on the other hand, provides a utility functions for memory preparation.
This difference is due to the fact that ToyVMM's implementation is similaer to firecracker's, but they are essentially doing the same thing.

Let's look at the whole implementation first


#![allow(unused)]
fn main() {
// The following `create_region` functions operate based on file descriptor, so first, create a temporary file and write asm_code to it.
let mut file = TempFile::new().unwrap().into_file();
assert_eq!(unsafe { libc::ftruncate(file.as_raw_fd(), 4096 * 10) }, 0);
let code: &[u8] = &[
    0xba, 0xf8, 0x03, /* mov $0x3f8, %dx */
    0x00, 0xd8,       /* add %bl, %al */
    0x04, b'0',       /* add $'0', %al */
    0xee,             /* out %al, %dx */
    0xb0, b'\n',      /* mov $'\n', %al */
    0xee,             /* out %al, %dx */
    0xf4,             /* hlt */
];
file.write_all(code).expect("Failed to write code to tempfile");

// create_region funcion create GuestRegion (The details are described in the following)
let mut mmap_regions = Vec::with_capacity(1);
let region = create_region(
    Some(FileOffset::new(file, 0)),
    0x1000,
    libc::PROT_READ | libc::PROT_WRITE,
    libc::MAP_NORESERVE | libc::MAP_PRIVATE,
    false,
).unwrap();

// Vec named 'mmap_regions' contains the entry of GuestRegionMmap
mmap_regions.push(GuestRegionMmap::new(region, GuestAddress(0x1000)).unwrap());

// guest_memory represents as the vec of GuestRegion
let guest_memory = GuestMemoryMmap::from_regions(mmap_regions).unwrap();
let track_dirty_page = false;

// setup Guest Memory
vm.memory_init(&guest_memory, kvm.get_nr_memslots(), track_dirty_page).unwrap();
}

The create_vm consequently performs a mmap in the following way and returns a part of the structure (GuestMmapRegion) representing the GuestMemory


#![allow(unused)]
fn main() {
pub fn create_region(
    maybe_file_offset: Option<FileOffset>,
    size: usize,
    prot: i32,
    flags: i32,
    track_dirty_pages: bool,
) -> Result<GuestMmapRegion, MmapRegionError> {

...

let region_addr = unsafe {
    libc::mmap(
        region_start_addr as *mut libc::c_void,
        size,
        prot,
        flags | libc::MAP_FIXED,
        fd,
        offset as libc::off_t,
    )
};
let bitmap = match track_dirty_pages {
    true => Some(AtomicBitmap::with_len(size)),
    false => None,
};
unsafe {
    MmapRegionBuilder::new_with_bitmap(size, bitmap)
        .with_raw_mmap_pointer(region_addr as *mut u8)
        .with_mmap_prot(prot)
        .with_mmap_flags(flags)
        .build()
}
}

Let's check the structure about Memory here. In src/kvm/memory.rs, the following Memory structure is defined based on vm-memory crate

pub type GuestMemoryMmap = vm_memory::GuestMemoryMmap<Option<AtomicBitmap>>;
pub type GuestRegionMmap = vm_memory::GuestRegionMmap<Option<AtomicBitmap>>;
pub type GuestMmapRegion = vm_memory::MmapRegion<Option<AtomicBitmap>>;

The MmapRegionBuilder is also defined in the vm-memory crate, and this build method creates the MmapRegion.

This time, since we have performed the mmap myself in advance and passed that address to with_raw_mmap_pointer, use that area to initialize. Otherwise, mmap is performed in the build method. In any case, this build method will get the MmapRegion structure, but defines a synonym as described above, which is returned as the GuestMmapRegion. By calling the create_region function once, you can allocate and obtain one region of GuestMemory based on the information(size, flags, ...etc) specified in the argument.

The region allocated here is only mmapped from the virtual address space of the VMM process, and no further information is available. To use this area as Guest Memory, a GuestRegionMmap structure is created from this area. This is simple, specify the corresponding GuestAddress for this region and initialize GuestRegionMmap with a tuple of mmapped area and GuestAddress. In following code, the initialized GuestRegionMmap is pushed to Vec for subsequent processing.


#![allow(unused)]
fn main() {
map_regions.push(GuestRegionMmap::new(region, GuestAddress(0x1000)).unwrap());
}

Now, the mmap_regions: Vec<GuestRegionMmap> created as above represents the entire memory of the Guest VM, and each region that makes up the guest memory holds information on the area allocated by the VMM for the region and the top address of the Guest side. The GuestMemoryMmap structure representing the Guest Memory is initialized from this Vec information and set to VM by the memory_init method.


#![allow(unused)]
fn main() {
let guest_memory = GuestMemoryMmap::from_regions(mmap_regions).unwrap();
vm.memory_init(&guest_memory, kvm.get_nr_memslots(), track_dirty_page).unwrap();
}

Next, let's check the operation of this memory_init. This calls set_kvm_memory_regions and the actual process is described there.


#![allow(unused)]
fn main() {
pub fn set_kvm_memory_regions(
    &self,
    guest_mem: &GuestMemoryMmap,
    track_dirty_pages: bool,
    ) -> Result<()> {
    let mut flags = 0u32;
    if track_dirty_pages {
        flags |= KVM_MEM_LOG_DIRTY_PAGES;
    }
    guest_mem
        .iter()
        .enumerate()
        .try_for_each(|(index, region)| {
            let memory_region = kvm_userspace_memory_region {
                slot: index as u32,
                guest_phys_addr: region.start_addr().raw_value() as u64,
                memory_size: region.len() as u64,
                userspace_addr: guest_mem.get_host_address(region.start_addr()).unwrap() as u64,
                flags,
            };
            unsafe { self.fd.set_user_memory_region(memory_region) }
        })
    .map_err(Error::SetUserMemoryRegion)?;
    Ok(())
}
}

Here we can see that set_user_memory_region is called using the necessary information while iterating the region. In other words, it is processing the same as the example code except that there may be more than one region.

Now that we've gone through the explanation of memory preparation, let's take a look at the vm-memory crate! The information presented here is only the minimum required, so please refer to Design or other sources for more details. This will also be related to the above iteration, where we were able to call methods such as sart_addr() and len() to construct the necessary information for set_user_memory_region.


#![allow(unused)]
fn main() {
GuestAddress (struct) : Represent Guest Physicall Address (GPA)
FileOffset(struct) : Represents the start point within a 'File' that backs a 'GuestMemoryRegion'

GuestMemoryRegion(trait) : Represents a continuous region of guest physical memory / trait
GuestMemory(trait) : Represents a container for a immutable collection of GuestMemoryRegion object / trait

MmapRegion(struct) : Helper structure for working with mmaped memory regions
GuestRegionMmap(struct & implement GuestMemoryRegion trait) : Represents a continuous region of the guest's physical memory that is backed by a mapping in the virtual address space of the calling process
GuestMemoryMmap(struct & implement GuestMemory trait) : Represents the entire physical memory of the guest by tracking all its memory regions
}

Since GuestRegionMmap implements the GuestMemoryRegion trait, there are implementations of functions such as start_addr() and len(), which were used in the above interation. The following figure briefly summarizes the relationship between these structures

As you can see, what is being done is essentially the same.

The final step is prepareing vCPU (vCPU is a CPU to be attached to a virtual machine).
Currently, a VM has been created and memory containing instructions has been inserted, but these is no CPU, so the instructions can't be executed. Therefore, let's create a vCPU, associate it with the VM, and execute the instruction by running the vCPU!

Using the file descriptor obtained during VM creaion (vmfd), the resulting ioctl will be issued as follows.


#![allow(unused)]
fn main() {
vcpufd = ioctl(vmfd, KVM_CREATE_VCPU, 0)
}

The create_vm method that was just issued to obtain the vmfd is designed to return a kvm_ioctls::VmFd strucure as a result, and by execuing the create_vcpu method, which is a method of this structure, the above ioctl is consequently issued and returns the result as a kvm_ioctls::VcpuFd structure.

VcpuFd provides utilities for getting and setting various CPU states.
For example, if you want o get/set a register set from the vCPU, you would normally issue the following ioctl


#![allow(unused)]
fn main() {
ioctl(vcpufd, KVM_GET_SREGS, &sregs);
ioctl(vcpufd, KVM_SET_SREGS, &sregs);
}

For these, the following methods are available in kvm_ioctls::VcpuFd


#![allow(unused)]
fn main() {
get_sregs(&self) -> Result<kvm_sregs>
set_sregs(&self, sregs: &kvm_sregs) -> Result<()>
}

VcpuFd also provids a method called run, which issues the following insructions to actually run the vCPU.


#![allow(unused)]
fn main() {
ioctl(vcpufd, KVM_RUN, NULL)
}

and then, we can aquire return values that has the type Result<VcpuExit> resulting this method.

When running vCPU, exit occurs for various reasons. This is an instruction that the CPU cannot handle, and the OS usually tries to deal with it by invoking the corresponding handler.
If this type of exit comes back from the VM's vCPU, as in the case, it will be necessary to write the appropriate code to handle the situation.
VcpuExit is defined in kvm_ioctls::VcpuExit as enum.
When Exit are occurred on several reasons in running vCPU, the exit reasons that are defined in kvm.h in linux kernel are wrapped to VcpuExit.
Therefore, it is sufficient to write a process that pattern matches this result and appropriately handles the error to be handled.

Now, there is a instruction that execute outputting values through I/O port and this will occur the KVM_EXIT_IO_OUT.
VcpuExit wrap this exit reason as IoOut.

Originally (in C programm as example), we require to calculate appropriate offset to get output data from I/O port, but now, this process are implemented in run method and returned as VcpuExit that contains necessary values.
So, we don't have to write these unsafe code (pointer offset calculation) and handle these exit as you will.


#![allow(unused)]
fn main() {
loop {
    match vcpu.run().expect("vcpu run failed") {
        kvm_ioctls::VcpuExit::IoOut(addr, data) => {
            println!(
                "Recieved I/O out exit. \
                Address: {:#x}, Data(hex): {:#x}",
                addr, data[0],
            );
        },
        kvm_ioctls::VcpuExit::Hlt => {
            break;
        }
        exit => panic!("unexpected exit reason: {:?}", exit),
    }
}
}

In above, only handle KVM_EXIT_IO_OUT and KVM_EXIT_HLT, and the others will be processed as panic. (Although all exits should be handled, I want to focus on the description of KVM API example and keep it simply)

Since we are here, let's take a look at the processing of the run method in some detail.
Let's check the processing of KVM_EXIT_IO_OUT.

If you look at the LWN article, you will see that it calculates the offset and outputs the necessary information in the following way.


#![allow(unused)]
fn main() {
case KVM_EXIT_IO:
    if (run->io.direction == KVM_EXIT_IO_OUT &&
	    run->io.size == 1 &&
	    run->io.port == 0x3f8 &&
	    run->io.count == 1)
	putchar(*(((char *)run) + run->io.data_offset));
    else
	errx(1, "unhandled KVM_EXIT_IO");
    break;
}

On the other hand, run method implemented in kvm_ioctl::VcpuFd is like bellow


#![allow(unused)]
fn main() {
...
let run = self.kvm_run_ptr.as_mut_ref();
match run.exit_reason {
    ...
    KVM_EXIT_IO => {
        let run_start = run as *mut kvm_run as *mut u8;
        // Safe because the exit_reason (which comes from the kernel) told us which
        // union field to use.
        let io = unsafe { run.__bindgen_anon_1.io };
        let port = io.port;
        let data_size = io.count as usize * io.size as usize;
        // The data_offset is defined by the kernel to be some number of bytes into the
        // kvm_run stucture, which we have fully mmap'd.
        let data_ptr = unsafe { run_start.offset(io.data_offset as isize) };
        // The slice's lifetime is limited to the lifetime of this vCPU, which is equal
        // to the mmap of the `kvm_run` struct that this is slicing from.
        let data_slice = unsafe {
            std::slice::from_raw_parts_mut::<u8>(data_ptr as *mut u8, data_size)
        };
        match u32::from(io.direction) {
            KVM_EXIT_IO_IN => Ok(VcpuExit::IoIn(port, data_slice)),
            KVM_EXIT_IO_OUT => Ok(VcpuExit::IoOut(port, data_slice)),
            _ => Err(errno::Error::new(EINVAL)),
        }
    }
		...
}

Let me explain a little. The kvm_run is provided by the kvm-bindings crate, which is a structure automatically generated from a header file using bindgen, so it is a structure like the linux kernel's kvm_run converted directory to Rust.
First, kvm_run is obtained in the form of a pointer, a method of obtaining a pointer often used in Rust.
This correspoinds to the first address of the kvm_run structure which is bound to run_start variable.
And the information corresponding to run->io(.member) can be obtained from run.__bindgen_anon_1.io, although it is a bit tricky. The field named __bindgen_anon_1 is the effect of automatic generation by bindgen.
The data we want is at the first address of kvm_run plus io.data_offset. This process is performed in run_start.offset(io.data_offset as isize). And the data size can be calculated from io->size and io->count (in the LWN example, it is 1byte, so it's taken directory from the offset by putchar). This part is calculated and stored in the value data_size, and std::slice::from_raw_parts_mut actually retrieves the data using this size.
Finally, checking io.direction, we change the wrap type for KVM_EXIT_IO_IN or KVM_EXIT_IO_OUT respectively, and return the descired information such as port and data_slice together.

As can be seen from the above, what is being done is clear.
However, it still contains many unsafe operations because it involves pointer manipuration.
We can see that by using these libraries, we are able to implement VMM on a stable implementation.

Well, it's ben a long time comming, but let's take a look back at the rust-vmm crates we're using again.


#![allow(unused)]
fn main() {
kvm-bindings : Library that includes structures automatically generated from kvm.h by bindgen.
kvm-ioctls : Library that hides ioctl and unsafe processes related to kvm operations and provides user-friendly sructures, functions and methods.  
vm-memory : Library that provides structures and operations to the Memory
}

This knowledge will come up again and again in future discussion and is basic and important.

Load Linux Kernel

In this section, we will explain upon the implementation of launching a Guest VM as the first step in VMM. While our VMM has minimal functionality, booting the Linux Kernel demands a variety of knowledge.

In this section, we will explain the essential aspects of launching a Guest VM and delve into how it is implemented in ToyVMM. To achieve this, we will divide it into several detailed chapters and provide explanations for each topic.

The topics are as follows:

Additionally, this document is based on the following commit numbers:

ToyVMM: 27fdf196dfb31938f24785ca64e7233a6dc8fceb
Firecracker: 4bf121fc032cc2d94a20a3625f2af3918545154a

If you refer to this document while inspecting ToyVMM's code, it may be beneficial.

Overview of Booting Linux

General Booting Mechanism

In Linux, the operating system starts by executing programs in the following order:

BIOS
Boot Loader (GRUB)
Linux Kernel (vmlinuz)
init

The BIOS program is stored in the ROM on the motherboard. When you power on your computer, the CPU is instructed to start executing code from a specific address mapped to this ROM area. The BIOS performs hardware detection and initialization, then searches for the OS boot drive (HDD/SSD, USB flash drive, etc.). During this process, the boot drive needs to be formatted in either MBR or GPT format, depending on the BIOS type, as shown in the table below:

BIOS \ DISK Format	MBR	GPT
Legacy BIOS	◯	-
UEFI	◯ *	◯

* UEFI supports Legacy Boot Mode and thus supports MBR.

Next, I will explain the process of searching for the OS when using MBR. But before going into details, let's briefly review the structure of MBR. The MBR structure explained here assumes HDD/SSD or USB flash memory and implicitly assumes the presence of the Partition Entry, as described later. Please note that this document uses the terms provided on Wikipedia, so keep that in mind.

MBR is a 512-byte sector located at the beginning of the boot drive and consists of three main parts:

Bootstrap code area (446 bytes)
Partition Entry (64 bytes = 16 bytes * 4)
Boot Signature (2 bytes)

I won't go into the details of MBR here, but the Boot code area contains machine code programs (Boot Loaders) to boot the OS, and the Partition Entry stores information about the logical partitions on that disk. It's worth noting that the Boot code area is only 446 bytes in size, so Boot Loaders are typically stored elsewhere. A minimal program is placed in the Boot code area to load the actual Boot Loader into memory.
The critical part here is the "Boot Signature," which contains a 2-byte value used to ensure that the drive is a bootable device. When the BIOS searches for the OS boot drive, it reads the first sector (512 bytes), checks if the last 2 bytes (Boot Signature) are 0x55 and 0xAA, and identifies the drive as a bootable disk. It loads the first sector (512 bytes) from that disk into memory at 0x7c00 - 0x7fff and begins executing the program from 0x7c00.

Now, as a simple validation, let's check the Boot Signature on your machine. In this example, you are using a virtual machine with the boot drive labeled as vda. On a regular machine, it might be something like sda. By writing the first sector's content to a file and examining the 2 bytes at an offset of 510 bytes, you should see the 0x55 0xAA signature as expected.

$ lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sr0     11:0    1    2M  0 rom
vda    252:0    0  300G  0 disk
├─vda1 252:1    0    1M  0 part
└─vda2 252:2    0  300G  0 part /

$ sudo dd if=/dev/vda of=mbr bs=512 count=1
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.000214802 s, 2.4 MB/s

$ hexdump -s 510 -C mbr
000001fe  55 aa                                             |U.|
00000200

Now, back to our discussion. After confirming the Boot Signature, the BIOS identifies the disk as a bootable disk and loads the first sector (512 bytes) from it into memory at address 0x7c00. The program execution starts from 0x7c00.

Moving on, once the Boot Loader is loaded into memory, it takes on the responsibility of loading the Linux Kernel and initramfs from the disk and starting the kernel. In recent years, GRUB has become a common choice as a Boot Loader. I'll skip the detailed workings of the Boot Loader for now. The essential point is that the Boot Loader needs to load the specified kernel and initrd from the disk.

To achieve this, one straightforward method would be to inform the Boot Loader of the location of the kernel file on the disk. However, if you look at the contents of grub.cfg, you'll notice that the kernel and initrd locations are specified in the form of file paths. This means that the Boot Loader must have the ability to interpret the file system. In practice, several Boot Loaders can interpret various file systems and locate the kernel based on directory path information. However, it's essential to note that Boot Loaders are limited to supporting specific file system formats, and they cannot interpret other formats. The Boot Loader loads the specified kernel and RAM disk from grub.cfg, and by jumping to the kernel's entry point, it hands over the execution to the kernel, completing its own processing.

Before delving into the details of the kernel's processing, let's briefly organize some information about the kernel file. The kernel file is generally named vmlinuz*. You might be familiar with a kernel file located at /boot/vmlinuz-*, which is believed to be the kernel. However, this file is in the bzImage format. You can easily check this using the file command. The bzImage includes the actual kernel binary along with several other files used for low-level initialization. In this document, I'll refer to the kernel file in the bzImage format as vmlinuz and the actual kernel binary in executable format as vmlinux.bin.

When control is handed over from the BootLoader to vmlinuz, vmlinuz performs low-level initialization, then decompresses the kernel core, loads it into memory, and transfers control to the kernel's entry routine. Once all initialization processes are completed, the kernel creates a tmpfs filesystem, unpacks the initramfs placed in RAM by the BootLoader into it, and starts the init script located in the root directory.

This init script prepares to mount the main filesystem stored on the disk and mounts other important filesystems. initramfs contains various device drivers and allows mounting root filesystems in different formats. After this is done, the root is switched to the main root filesystem, and the /sbin/init binary stored there is executed.

/sbin/init is the first process to be launched in the system (with PID=1), and it serves as the parent for all other processes responsible for starting other processes. There are various implementations of init, such as SysVinit and Upstart, but what is commonly used in recent systems like CentOS and Ubuntu is Systemd. The ultimate responsibility of init is to further prepare the system and ensure that the necessary services are running and the system is in a state where users can log in when the boot process is complete.

This is a very high-level overview of the process from powering on to the OS booting up.

initrd and initramfs

In the previously discussed Linux boot process, we introduced initramfs, a file system that is unpacked into memory. However, what we often encounter is /boot/initrd.img. Here, we will explain the differences between initrd and initramfs.

initrd stands for "initial RAM disk, while initramfs stands for "initial RAM File System". Although they are different in nature, they serve the same purpose, which is to provide the necessary commands, libraries, and modules for mounting the root file system and launching the /sbin/init script located in the root file system.

The challenge that both initrd and initramfs address is that the system you want to boot originally resides in some storage device. To load it, you need appropriate device drivers and a file system for mounting.

initrd and initramfs both address this issue, but they use different methods. As their names suggest, initrd uses a block device, while initramfs uses a RAM file system based on tmpfs. Traditionally, initrd was used, but starting from Kernel 2.6, initramfs became available, and it is now the more common choice.

The shift from initrd to initramfs occurred because initrd had several issues:

A RAM disk is a mechanism that creates a pseudo block device in RAM, treating it as if it were a secondary storage device. However, because of this behavior, it inadvertently consumes memory cache, similar to regular block devices, leading to unnecessary memory usage. Furthermore, mechanisms such as paging come into play, consuming more memory capacity.
A RAM disk requires a file system driver, such as ext2, to format and interpret its data.
RAM disks have a fixed size, which can lead to problems; if they are too small, they may not accommodate all the necessary scripts, and if they are too large, they waste memory.

To address these issues, initramfs was developed. It is a lightweight, memory-based file system that can be flexibly sized and is based on tmpfs. It is not a block device, so it doesn't interfere with memory caching or paging, and it doesn't require file system drivers for block devices. Additionally, it resolves the fixed size problem. Whether using initrd or initramfs, both methods provide tools inside them to mount the root file system and switch to it. The startup script, /sbin/init, located in the file system, is then executed.

Inspecting the contents of initramfs

Let's unpack and examine the contents of an initramfs. We'll use an Ubuntu 20.04.2 LTS initrd for this example. (Note: The file named initrd is actually a proper initramfs). An initramfs consists of several files concatenated in CPIO format. When you extract it directly using the cpio command, you'll see only the initial files (like AuthenticAMD.bin) as follows:

$ mkdir initrd-work && cd initrd-work
$ sudo cp /boot/initrd.img ./
$ cat initrd.img | cpio -idvm
.
kernel
kernel/x86
kernel/x86/microcode
kernel/x86/microcode/AuthenticAMD.bin
62 blocks

You can extract all the files using a combination of dd and cpio, but there's a handy tool called unmkinitramfs that can do this for you:

$ mkdir extract
$ unmkinitramfs initrd.img extract
$ ls extract
early  early2  main

After extracting, you'll see directories like early, early2, and main. For instance, early contains the same files that were seen when extracted with cpio. The most crucial part is under main, where the contents of the file system root are stored:

$ ls extract/early/kernel/x86/microcode
AuthenticAMD.bin
$ ls extract/early2/kernel/x86/microcode
GenuineIntel.bin
$ ls extract/main
bin  conf  cryptroot  etc  init  lib  lib32  lib64  libx32  run  sbin  scripts  usr  var

By chrooting into this extracted content, you can pseudo-operate the Linux boot-time RAM filesystem and understand what operations can be performed:

$ sudo chroot extract/main /bin/sh

BusyBox v1.30.1 (Ubuntu 1:1.30.1-4ubuntu6.3) built-in shell (ash)
Enter 'help' for a list of built-in commands.

# ls
scripts    init       run        etc        var        usr        conf
lib64      bin        lib        libx32     lib32      sbin       cryptroot
# pwd
/
# which mount
/usr/bin/mount
# exit

As shown above, there is an init script file in the root directory, which is the script executed after extracting initramfs. The init script reads the contents of /proc/cmdline and extracts disk information (e.g., root=/dev/sda1) to perform the necessary mounting operations. If this information is missing, this init script in the Ubuntu 20.04LTS initrd would encounter an error.

In the case of ToyVMM, we use an initramfs based on firecracker-initrd. Therefore, the behavior might differ slightly.

About firecracker-initrd

In ToyVMM, we use firecracker-initrd. Firecracker-initrd creates an initrd.img (initramfs) based on Alpine Linux. Unlike the Ubuntu initrd we discussed earlier, it does not include additional CPIO files like microcode, so you can simply extract it to see the root filesystem:

$ cat initrd.img | cpio -idv
$ ls
bin  dev  etc  home  init  initrd.img  lib  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

Alpine Linux typically unpacks a filesystem into RAM during normal boot, and then the OS starts. Afterward, decisions like whether to write the OS to a disk using setup-alpine depend on specific needs. In ToyVMM, when you boot a VM using this initramfs, it doesn't immediately mount the root file system by default. Instead, it simply unpacks the file system into RAM, and Alpine Linux starts. This is different from the traditional approach where you load the boot area into secondary storage and inform the init script via /proc/cmdline.

Boot Sequence of Linux Kernel in ToyVMM

Now, let's compare what we've discussed so far with the Linux boot process in ToyVMM:

Boot Process (on Linux)	ToyVMM
BIOS	Not implemented yet
Boot Loader	Requires implementation: Loading vmlinux/initrd.img, basic setup
Linux Kernel	Processed by `vmlinux.bin`
init	Processed by `init` scripts (from firecracker-initrd's `initrd.img`)

The current implementation of ToyVMM does not support loading bzImage and instead uses the ELF binary vmlinux.bin. It currently omits BIOS-related functions.

For the Boot Loader's tasks, such as loading vmlinux.bin and initrd.img into memory, implementation is needed. The Linux Kernel itself is processed by vmlinux.bin, while the init process is handled by the init scripts found in initrd.img from the firecracker-initrd.

For more detailed implementation instructions, you can refer to 02-6_minimal_vmm_implementation.

References

ELF binary format and vmlinux structure

At the time of writing this document, the kernel used to boot a VM in ToyVMM assumes an ELF-formatted vmlinux.bin. Therefore, within the VMM, it's necessary to interpret the ELF format and load the kernel into the memory area prepared for the VM appropriately. This process is implemented in the rust-vmm/linux-loader crate. While ToyVMM abstracts this implementation by using the crate, it is essential to understand how it works. Hence, this section provides an explanation of loading ELF binaries.

ELF Binary Format

The ELF file format consists of the following components:

As shown above, the ELF file format primarily consists of an ELF Header, Program Header Table, Segments (Sections), and Section Header Table. When used by a system loader, ELF files treat the entries in the Program Header Table as a collection of Segments, while compilers, assemblers, and linkers treat entries in the Section Header Table as a collection of Sections.

The ELF Header contains overall information about the ELF file. Each entry in the Program Header Table, known as a Program Header, holds header information about the corresponding Segment. Therefore, the number of Program Headers corresponds to the number of Segments. Furthermore, each Segment can be divided into multiple Sections, and the Section Header Table contains header information for these Sections.

The ELF Header always starts at the beginning of the file offset and holds information necessary for reading ELF data. Here are some excerpts from the ELF Header. For a comprehensive overview, please refer to the Man page of ELF.

Attribute	Meaning
`e_entry`	Virtual address representing the entry point to start this ELF process
`e_phoff`	File offset value to the location of the `Program Header Table`
`e_shoff`	File offset value to the location of the `Section Header Table`
`e_phentsize`	Size of one entry in the `Program Header Table`
`e_phnum`	Number of entries in the `Program Header Table`
`e_shentsize`	Size of one entry in the `Section Header Table`
`e_shnum`	Number of entries in the `Section Header Table`

From the above excerpts, you can see that it's possible to extract information about each entry in the Program Header and Section Header.

Now, let's focus on the contents of the Program Header.

Attribute	Meaning
`p_type`	Represents the type of the `Segment` pointed to by this `Program Header`, providing hints on how to interpret it
`p_offset`	File offset value to the `Segment` pointed to by this `Program Header`
`p_paddr`	In systems where physical addresses are meaningful, this value points to the physical address of the `Segment` pointed to by this `Program Header`
`p_filesz`	Byte size of the file image of the `Segment` pointed to by this `Program Header`
`p_memsz`	Byte size of the memory image of the `Segment` pointed to by this `Program Header`
`p_flags`	Flags that indicate information about the `Segment` pointed to by this `Program Header`, such as executable, writable, and readable

As mentioned earlier, by interpreting the contents of the Program Header, you can obtain information about the position, size, and how to interpret the respective segment. For our purposes, understanding the structure of the Program Header is sufficient, so we will omit details about the Section Header and other components.

Now, the vmlinux.bin we will be working with has five Program Header entries, with the first four having a p_type value of PT_LOAD, and the last one having PT_NOTE. Let's extract some details about PT_LOAD and PT_NOTE from the Man page of ELF:

`p_type`	Meaning
`PT_LOAD`	Represents a loadable `Segment` described by `p_filesz` and `p_memsz`.
`PT_NOTE`	Contains auxiliary information for location and size.

In the case of PT_LOAD, the byte sequence of the file is associated with the beginning of the memory segment. You can load the segment's contents into memory by copying the data from the address corresponding to the segment's memory address, calculated using p_offset, for the size specified by p_memsz.

With this minimal knowledge of ELF, let's proceed to analyze the content of vmlinux.bin.

Analyzing vmlinux

Let's analyze the content of vmlinux now. Some of the information we'll extract will be crucial for future tasks. The readelf command is a powerful tool for dumping ELF-formatted files in a human-readable format. In this section, we will display the ELF Header (-h) and Program Header (-l) of vmlinux.bin.

$ readelf -h -l vmlinux.bin
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x1000000
  Start of program headers:          64 (bytes into file)
  Start of section headers:          21439000 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         5
  Size of section headers:           64 (bytes)
  Number of section headers:         36
  Section header string table index: 35

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000200000 0xffffffff81000000 0x0000000001000000
                 

 0x0000000000b72000 0x0000000000b72000  R E    0x200000
  LOAD           0x0000000000e00000 0xffffffff81c00000 0x0000000001c00000
                  0x00000000000b0000 0x00000000000b0000  RW     0x200000
  LOAD           0x0000000001000000 0x0000000000000000 0x0000000001cb0000
                  0x000000000001f658 0x000000000001f658  RW     0x200000
  LOAD           0x00000000010d0000 0xffffffff81cd0000 0x0000000001cd0000
                  0x0000000000133000 0x0000000000413000  RWE    0x200000
  NOTE           0x0000000000a031d4 0xffffffff818031d4 0x00000000018031d4
                  0x0000000000000024 0x0000000000000024         0x4

Section to Segment mapping:
  Segment Sections...
   00     .text .notes __ex_table .rodata .pci_fixup __ksymtab __ksymtab_gpl __kcrctab __kcrctab_gpl __ksymtab_strings __param __modver
   01     .data __bug_table .vvar
   02     .data..percpu
   03     .init.text .altinstr_aux .init.data .x86_cpu_dev.init .parainstructions .altinstructions .altinstr_replacement .iommu_table .apicdrivers .exit.text .smp_locks .data_nosave .bss .brk
   04     .notes

From the ELF Header, we can see that the "Entry point address" (e_entry value) represents the address (0x0100_0000) where the ELF process starts, which is essential. This value is returned as the result of loading the kernel using rust-vmm/linux-loader, and it's also the value to set in the vCPU's eip (instruction pointer) to start the process.

The e_phnum value in the ELF Header (Number of program headers) is 5, which matches the number of Program Headers (Program Header Table entries). The Program Headers are displayed next, with the first four having a Type of LOAD, and the last one being NOTE. Additionally, the first and fourth LOAD entries are marked as executable, indicating that executable code is present around these segments. The first entry is especially important as it likely corresponds to the entry point of the kernel's executable code.

Implementation in ToyVMM.

In ToyVMM, the loading of vmlinux is done within the load_kernel function in src/builder.rs. This function takes boot_config information, which includes the path to the kernel file, and the memory (guest_memory) allocated for the VM.

Within load_kernel, rust-vmm/linux-loader's Elf structure (imported as Loader) is used. This structure implements the KernelLoader trait, and its load function is responsible for loading ELF-formatted kernels into guest_memory. Here's an excerpt from the code:


#![allow(unused)]
fn main() {
use linux_loader::elf::Elf as Loader;

let entry_addr = Loader::load::<File, memory::GuestMemoryMmap>(
    guest_memory,
    None,
    &mut kernel_file,
    Some(GuestAddress(arch::x86_64::get_kernel_start())),
).map_err(StartVmError::KernelLoader)?;
}

Now, let's delve deeper into the implementation of linux-loader. In linux-loader, the KernelLoader trait is defined, and its definition looks like this:


#![allow(unused)]
fn main() {
/// Trait that specifies kernel image loading support.
pub trait KernelLoader {
    /// How to load a specific kernel image format into the guest memory.
    ///
    /// # Arguments
    ///
    /// * `guest_mem`: [`GuestMemory`] to load the kernel in.
    /// * `kernel_offset`: Usage varies between implementations.
    /// * `kernel_image`: Kernel image to be loaded.
    /// * `highmem_start_address`: Address where high memory starts.
    ///
    /// [`GuestMemory`]: https://docs.rs/vm-memory/latest/vm_memory/guest_memory/trait.GuestMemory.html
    fn load<F, M: GuestMemory>(
        guest_mem: &M,
        kernel_offset: Option<GuestAddress>,
        kernel_image: &mut F,
        highmem_start_address: Option<GuestAddress>,
    ) -> Result<KernelLoaderResult>
    where
        F: Read + Seek;
}
}

As inferred from the comments, this trait requires the load function to be implemented, which should load a specific kernel image format into the guest memory. In the case of linux-loader, there are implementations for x86_64 that support loading ELF format kernels, and it also appears to have implementations for bzImage format kernels. However, for this discussion, let's focus on the ELF implementation.

The load function, which is expected to be implemented for ELF, performs the following steps:

Extract the data from the beginning of the ELF file up to the size of the ELF header.
Create an instance of the KernelLoaderResult struct named loader_result and store the value of the ELF header's e_entry field in its kernel_load member. This value represents the address where the system will initially transfer control, which is essentially the starting point of the process.
Seek within the ELF file to the address where the program header table is located (determined by e_phoff), and then loop over all program headers (up to e_phnum) in the ELF file.
While looping over the program headers, perform the following actions:
- Seek within the ELF file to the location of the segment corresponding to the currently inspected program header (determined by p_offset).
- Write the data from kernel_image (which has already been seeked to the beginning of the segment's data) into the guest memory, starting from the address calculated from mem_offset to the size of the segment (p_filesz).
- Update the value of kernel_end (the address of the end of the loaded segment in GuestMemory) and store the larger value between loader_result.kernel_end and the newly calculated value in loader_result.kernel_end.
After looping through all program headers, return loader_result as the final result.

This code essentially interprets and loads ELF files according to the ELF format. The returned KernelLoaderResult struct contains important information about the starting and ending positions of the kernel in GuestMemory, with the starting position being particularly crucial for use in Setup registers of vCPU.

References

Loading initrd

In this document, we will discuss loading and configuring initrd (initramfs) in order to boot a VM. When we mention initrd in the following sections, we are implicitly referring to initramfs. A detailed explanation of initramfs itself can be found in Overview of booting Linux, so please refer to that section for more information.

Loading initrd and setting up kernel header parameters

The function responsible for loading initrd is implemented as load_initrd. It takes two arguments: the memory allocated for the Guest and a mutable reference to the File structure representing the opened initrd file (implementing Read and Seek traits).


#![allow(unused)]
fn main() {
fn load_initrd<F>(
    vm_memory: &memory::GuestMemoryMmap,
    image: &mut F,
) -> std::result::Result<InitrdConfig, StartVmError>
where F: Read + Seek {
    let size: usize;
    // Get image size
    match image.seek(SeekFrom::End(0)) {
        Err(e) => return Err(StartVmError::InitrdRead(e)),
        Ok(0) => {
            return Err(StartVmError::InitrdRead(io::Error::new(
                io::ErrorKind::InvalidData,
                "Initrd image seek returned a size of zero",
            )))
        }
        Ok(s) => size = s as usize,
    };
    // Go back to the image start
    image.seek(SeekFrom::Start(0)).map_err(StartVmError::InitrdRead)?;
    // Get the target address
    let address = arch::initrd_load_addr(vm_memory, size)
        .map_err(|_| StartVmError::InitrdLoad)?;

    // Load the image into memory
    //   - read_from is defined as trait methods of Bytes<A>
    //     and GuestMemoryMmap implements this trait.
    vm_memory
        .read_from(GuestAddress(address), image, size)
        .map_err(|_| StartVmError::InitrdLoad)?;

    Ok(InitrdConfig{
        address: GuestAddress(address),
        size,
    })
}
}

The function performs the following steps:

Retrieves the size of the initrd by seeking to the end of the file and then returning to the start.
Calculates the target address in Guest memory where the initrd should be loaded.
Loads the contents of the initrd file into the specified Guest memory address.
Returns an InitrdConfig structure containing the Guest memory address and size of the loaded initrd.

Once the initrd is loaded into memory, we need to configure the kernel's setup header. This header information is defined by the Boot Protocol. In ToyVMM, these settings are primarily configured in the configure_system function. The table below outlines the relevant settings, which are documented in the Boot Protocol:

Offset/Size	Name	Meaning	ToyVMM value
01FE/2	boot_flag	0xAA55 magic number	0xaa55
0202/4	header	Magic signature "HdrS" (0x53726448)	0x5372_6448
0210/1	type_of_loader	Boot loader identifier	0xff (undefined)
0218/4	ramdisk_image	initrd load address (set by boot loader)	GUEST ADDRESS OF INITRD
021C/4	ramdisk_size	initrd size (set by boot loader)	SIZE OF INITRD
0228/4	cmd_line_ptr	32-bit pointer to the kernel command line	0x20000
0230/4	kernel_alignment	Physical addr alignment required for kernel	0x0100_0000
0238/4	cmdline_size	Maximum size of the kernel command line	SIZE OF CMDLINE STRING

These values are written to Guest memory starting at address 0x7000. The 0x7000 address is also stored in RSI, a vCPU register, for reference during VM startup. For details on vCPU register setup, please refer to Setup registers of vCPU.

Setup E820

Configuring the E820 for the Guest OS allows reporting of available memory regions to the OS and BootLoader. The settings for this are aligned with the implementation in Firecracker. The following code illustrates how the E820 entries are added based on the Guest memory configuration:


#![allow(unused)]
fn main() {
add_e820_entry(&mut params, 0, EBDA_START, E820_RAM)?;
let first_addr_past_32bits = GuestAddress(FIRST_ADDR_PAST_32BITS);
let end_32bit_gap_start = GuestAddress(MMIO_MEM_START);
let himem_start = GuestAddress(HIGH_MEMORY_START);
let last_addr = guest_mem.last_addr();
if last_addr < end_32bit_gap_start {
    add_e820_entry(
        &mut params,
        himem_start.raw_value() as u64,
        last_addr.unchecked_offset_from(himem_start) as u64 + 1,
        E820_RAM)?;
} else {
    add_e820_entry(
        &mut params,
        himem_start.raw_value(),
        end_32bit_gap_start.unchecked_offset_from(himem_start),
        E820_RAM)?;
    if last_addr > first_addr_past_32bits {
        add_e820_entry(
            &mut params,
            first_addr_past_32bits.raw_value(),
            last_addr.unchecked_offset_from(first_addr_past_32bits) + 1,
            E820_RAM)?;
    }
}
}

It would be better to understand the design of the entire address space for the Guest VM, considering the code for starting a Guest VM in ToyVMM. Therefore, I'll list the current memory design for the Guest in the following table. Please note that this information may change in the future.

Guest Address	Contents	Note
0x0 - 0x9FBFF	E820
0x7000 - 0x7FFF	Boot Params (Header)	ZERO_PAGE_START(=0x7000)
0x9000 - 0x9FFF	PML4	Now only 1 entry (8byte), maybe expand later
0xA000 - 0xAFFF	PDPTE	Now only 1 entry (8byte), maybe expand later
0xB000 - 0xBFFF	PDE	Now 512 entry (4096byte)
0x20000 -	CMDLINE	Size depends on cmdline parameter len
0x100000		HIGH_MEMORY_START
0x100000 - 0x7FFFFFF	E820
0x100000 - 0x20E3000	vmlinux.bin	Size depends on vmlinux.bin's size
0x6612000 - 0x7FFF834	initrd.img	Size depends on initrd.img's size
0x7FFFFFF	GuestMemory last address	based on (128 << 20 = 128MB = 0x8000000) - 1
0xD0000000		MMIO_MEM_START（4GB - 768MB）
0xD0000000 - 0xFFFFFFFF		MMIO_MEM_START - FIRST_ADDR_PAST_32BIT
0x100000000		FIRST_ADDR_PAST_32BIT (4GB~)

Upon examining the code, you can see that the address range that is designed independently of the GuestMemory size (roughly 0x0 ~ HIGH_MEMORY_START) is commonly registered as "Usable" in the E820, ranging from 0 to EBDA_START (0x9FBFF).

Subsequently, the range registered in the E820 changes depending on how much GuestMemory is allocated. In the current implementation, the GuestMemory is set to reserve 128MB of memory by default, so the Guest Memory ranges from 0x0 to 0x7FF_FFFF. In this range, vmlnux.bin content and initrd.img are mapped.

In other words, the logic guest_mem.last_addr() = 0x7FF_FFFF < 0xD000_0000 = end_32bit_gap_start applies, so the range HIGH_MEMORY_START ~ guest_mem.last_addr() is additionally registered. In the future, as you expand, if the GuestMemory size exceeds 4GB, you will register the ranges 0x10_0000 ~ 0xD000_0000 and 0x1_000_0000 ~ guest_mem.last_addr().

You will be able to confirm the console output when starting the VM shortly. Here, I've provided part of the output to show that the E820 entries you configured are registered:

[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000007ffffff] usable

References

Linuxのブートシーケンスの基礎まとめ
Linuxカーネルユーザ・管理者ガイド - 初期RAMdディスクを使用する
initrd
initramfs(initrd)のinitをbusyboxだけで書いてみた
[initramfsとinitrdについて](https://blog.goo.ne.jp/pepolinux/e/4d1f4b6e0f5b5ed389f

Setup registers of vCPU

In this document, we will describe the configuration of vCPU registers. While registers are commonly discussed collectively, there are various types of registers, making it complex to determine how to set each of them. The content related to registers explained in this document focuses solely on the aspect of starting a virtual machine (VM). Additionally, as we want to boot the Guest OS in 64-bit mode, we will briefly explain some settings required for transitioning to 64-bit mode and the associated paging.

Setup vCPU general-purpose registers

Configuration of the vCPU's general-purpose registers can be done through the KVM set_regs API. For this example, we will set the values of the registers as follows (detailed explanations of each register are omitted):

Register	Value	Meaning
RFLAGS	2	The bit at 0x02 must be set as a reserved bit
RIP	KERNEL START ADDRESS (`0x0100_0000`)	Address of the entry point obtained from the ELF
RSP	BOOT STACK POINTER (`0x8ff0`)	Address of the Stack Pointer used during boot
RBP	BOOT STACK POINTER (`0x8ff0`)	Set to match RSP before boot processing
RSI	`boot_params` ADDRESS (`0x7000`)	Address where `boot_params` information is stored

The RIP should store the instruction start address when the vCPU is launched. In this case, we specify the address of the kernel's entry point. Since we plan to execute in 64-bit Long Mode, RIP's address will be treated as a virtual memory address. However, to implement Paging with Identity Mapping, the virtual memory address will be equal to the physical memory address. For RSP and RBP, we put the addresses necessary for the boot stack. These values can be obtained from available memory. RSI should contain the address where the boot_params structure is stored. ToyVMM is created by mimicking Firecracker's values, so the address values stored in RSP, RBP, and RSI are mimicked from Firecracker.

Setup vCPU special registers

Configuration of vCPU special registers can be done through the KVM set_sregs API. In this section, we will focus on the registers that are actually configured while briefly mentioning the background. The following explanations may introduce some unfamiliar terms. If you encounter such terms, please take the time to look them up.

IDT (Interrupt Descriptor Table)

The IDT (Interrupt Descriptor Table) is a data structure that holds information about interrupts and exceptions in Protected Mode and Long Mode. Originally, in Real Mode, there was the Interrupt Vector Table (IVT), which served the purpose of informing the CPU where the Interrupt Service Routines (ISRs) were located. In other words, it held handlers for each interrupt or exception, allowing the system to determine which handler to invoke when they occurred.

In Protected Mode and Long Mode, the address representation is different from Real Mode, so IDT is a mechanism that provides similar capabilities but adapted to these modes. The IDT is a table with a maximum of 255 entries, and the IDT's address needs to be set in the IDTR register. When an interrupt occurs, the CPU references the IDT from the IDTR value and executes the specified interrupt handler.

According to the 64-bit Boot Protocol, interrupts should be set to "Disabled." Therefore, the IDT-related configuration is omitted in the ToyVMM (Firecracker) implementation, and we won't delve into the details of the IDT here.

Segmentation, GDT (Global Descriptor Table), LDT (Local Descriptor Table)

Before discussing GDT, let's briefly introduce segmentation. Memory segmentation is a memory management method where programs and data are managed in variable-sized blocks called segments. Segments are groups of information categorized by attributes in memory, and they are one of the memory management methods used to implement virtual memory and memory protection. In Linux, segmentation is used in conjunction with paging, assuming a flat memory model. For the rest of this discussion, we will proceed with this assumption.

The GDT (Global Descriptor Table) is a data structure used to manage memory segments. This structure closely resembles the IDT. The GDT is a table with multiple entries called Segment Descriptors, and the GDT's address needs to be set in the GDTR register. The entries in this table are accessed by the Segment Selector, and they provide information about which address range is covered, what operations are allowed in that region, and other details. The Segment Selector appears in Segmentation Registers and the format of each Entry in the IDT, such as the Gate Descriptor and Task State Segment. We will omit detailed explanations here, so please research further if needed.

The LDT (Local Descriptor Table) is a data structure used to manage segments, similar to GDT. However, LDT can be held separately for each task or thread, distinguishing it from GDT. Having a separate GDT descriptor for each task allows segments to be shared among a task's own programs while keeping them separate from segments used by different tasks, enhancing security between tasks. Since LDT is not relevant to this implementation, we will also skip detailed explanations about it.

GDT setup for 64-bit mode

As specified in the 64-bit Boot Protocol, in 64-bit mode, each Segment Descriptor must be set up as a 4G flat segment. Code and Data Segments should be assigned the appropriate permissions. The Global Descriptor Table indicates that in 64-bit mode, base and limit are essentially ignored, and each Descriptor covers the entire linear address space, except for the flags. Therefore, it seems that the values for flags other than the flags are not critical. Nonetheless, in this example, explicit setup is done to ensure a flat segment. Additionally, it is mentioned that the values for DS, ES, and SS should be the same as DS, so this is implemented accordingly.

Subsequently, we will examine how these settings are configured in ToyVMM (you can read it as Firecracker). These settings are done in the configure_seguments_and_sregs function. To make it easier to understand, some comments have been added:


#![allow(unused)]
fn main() {
fn configure_segments_and_sregs(sregs: &mut kvm_sregs, mem: &GuestMemoryMmap) -> Result<(), RegError> {
    let gdt_table: [u64; BOOT_GDT_MAX as usize] = [
        gdt::gdt_entry(0, 0, 0),            // NULL
        gdt::gdt_entry(0xa09b, 0, 0xfffff), // CODE
        gdt::gdt_entry(0xc093, 0, 0xfffff), // DATA
        gdt::gdt_entry(0x808b, 0, 0xfffff), // TSS
    ];
    // > https://wiki.osdev.org/Global_Descriptor_Table
    //
    //              55 52     47     40 39        31               16 15                0
    // CODE: 0b0..._1010_1111_1001_1011_0000_0000_0000_0000_0000_0000_1111_1111_1111_1111
    //              <-f->     <-Access-><---------------------------> <----- limit ----->
    // - Flags  : 1010      => G(limit is in 4KiB), L(Long mode)
    // - Access : 1001_1011 => P(must 1), S(code/data type), E(executable), RW(readable/writable), A(CPU access allowed)
    //   - 0xa09b of A,9,B represents above values
    //
    // DATA: 0b0..._1100_1111_1001_0011_0000_0000_0000_0000_0000_0000_1111_1111_1111_1111
    // - Flags  : 1100      => G(limit is in 4KiB), DB(32-bit protected mode)
    // - Access : 1001_0011 => P(must 1), S(code/data type), RW(readable/writable), A(CPU access allowed)
    //
    // TSS
    // - Flags  : 1000      => G(limit is in 4KiB)
    // - Access : 1000_1011 => P(must 1), E(executable), RW(readable/writable), A(CPU access allowed)
    //    - TSS requires to support Intel VT
    let code_seg = gdt::kvm_segment_from_gdt(gdt_table[1], 1);
    let data_seg = gdt::kvm_segment_from_gdt(gdt_table[2], 2);
    let tss_seg = gdt::kvm_segment_from_gdt(gdt_table[3], 3);

    // Write segments
    write_gdt_table(&gdt_table[..], mem)?;
    sregs.gdt.base = BOOT_GDT_OFFSET as u64;
    sregs.gdt.limit = mem::size_of_val(&gdt_table) as u16 - 1;

    write_idt_value(0, mem)?;
    sregs.idt.base = BOOT_IDT_OFFSET as u64;
    sregs.idt.limit = mem::size_of::<u64>() as u16 - 1;

    sregs.cs = code_seg;
    sregs.ds = data_seg;
    sregs.es = data_seg;
    sregs.fs = data_seg;
    sregs.gs = data_seg;
    sregs.ss = data_seg;
    sregs.tr = tss_seg;

    // 64-bit protected mode
    sregs.cr0 |= X86_CR0_PE;
    sregs.efer |= EFER_LME | EFER_LMA;
    Ok(())
}
}

In the above code, a table with 4 entries is created as the GDT to set up. The first entry must be Null as required by the GDT. For the rest, it can be seen that settings for the CODE Segment, DATA Segment, and TSS Segment are made for the entire memory region. The TSS setting is done to meet the requirements of Intel VT, and it's not substantially used within the scope of this document.

Now, when creating this GDT, a function called gdt_entry is called to create each entry. Here's the code for this function:


#![allow(unused)]
fn main() {
pub fn gdt_entry(flags: u16, base: u32, limit: u32) -> u64 {
    ((u64::from(base) & 0xff00_0000u64) << (56 - 24))
        | ((u64::from(flags) & 0x0000_f0ffu64) << 40)
        | ((u64::from(limit) & 0x000f_0000u64) << (48 - 16))
        | ((u64::from(base) & 0x00ff_ffffu64) << 16)
        | (u64::from(limit) & 0x0000_ffffu64)
}
}

For this function, all entries have 0x0 as the base and 0xFFFFF (2^5 = 32-bit = 4GB) as the limit, which makes it a flat segmentation. The flags argument for each entry is configured individually, which in turn corresponds to the values in GDT's Flags and AccessByte. If you look at the comments in the code, you can see the values returned by gdt_entry for each entry and what those values represent when parsed. According to the comments, as required by the 64-bit Boot Protocol, the CODE Segment has Execute/Read permission and the "long mode (64-bit code segment)" flag, while the DATA Segment has Read/Write permission.

The GDT created as mentioned above is written to GuestMemory using the write_gdt_table function, and the starting address of that is stored in sregs.gdt.base.

Regarding the subsequent IDT settings, as mentioned earlier, it appears to be disabled. Therefore, nothing is written to memory. However, the code decides on which address in GuestMemory to use and stores that address in sregs.idt.base.

Continuing, other register values are set. As mentioned earlier, CS is set with information about the CODE Segment, and DS, ES, SS are set with information about the DATA Segment, while TR is set with information about the TSS Segment. In the code above, FS and GS are also set with information about the DATA Segment, but these segment values may not need to be configured.

Finally, settings are made for CR0 and EFER registers, which will be explained later.

64-bit protected mode

The Long mode is the native mode for x86_64 processors, offering several additional features compared to the legacy x86 mode. However, we won't go into the details of these additional features here. Long mode consists of two submodes: 64-bit mode and compatibility mode.

To switch to 64-bit mode, you need to perform the following steps:

Set CR4.PAE to enable Physical Address Extension (PAE).
Create the Page Table and load the address of the top-level page table into CR3 register.
Set CR0.PG to enable Paging.
Set EFER.LME to enable Long Mode.

Setting the values in the registers involves updating the corresponding fields in the kvm_sregs structure and then configuring them using set_sregs. The key part is creating the Page Table.

4-Level Page Table for entering 64-bit mode

The processes related to booting the Linux Kernel are categorized into several stages based on the available memory address space. Immediately after booting, the process of setting up and interacting with physical memory addresses is known as x16 Real-Mode, which operates in a 16-bit memory alignment.

On the other hand, as many readers are aware, familiar operating systems like ours can be either 32-bit or 64-bit. These distinctions are made possible through a feature known as CPU mode switching, which transitions the CPU into modes called x32 Protected Mode and x64 Long Mode. Once switched to these modes, the CPU can only utilize virtual memory addresses.

Especially in the x64 CPU architecture, a 4-level page table is typically used to translate 64-bit virtual addresses into physical addresses. This means that before switching to x64 Long Mode, a 4-level page table must be constructed and conveyed to the CPU. This process is implemented as part of the BootLoader's functionality.

Now, another crucial point to consider is that while the RIP value currently contains the physical address value indicating the kernel's entry point, when handling it in x64 Long Mode, this address is used as a virtual address. Therefore, if this address were to be mapped to a different physical address, the OS would fail to boot.

Hence, at this stage, a simple page table is created where virtual memory addresses map to the same physical memory addresses. This is often referred to as Identity Mapping and addresses the issue mentioned above.

Note: It's important to note that the page table created by the BootLoader for x64 is a temporary requirement for executing the kernel. When we typically think of virtual memory addresses and page tables, we often associate them with user-space processes. However, the paging mechanism for user processes is implemented within the kernel and is configured when the kernel boots. Therefore, the mechanism for translating BootLoader's page table, whether it's Identity Mapping or not, has no impact on the paging mechanism for individual processes after the OS boots.

Page Table implementation in ToyVMM

Let's dive into the specific implementation of ToyVMM to understand the Page Table configuration better. This implementation closely follows that of Firecracker.

Let's briefly discuss the structure of the 4-Level Page Table. Essentially, at each level, there exists a table with its own designation:

Level 4: Page Map Level 4 (PML4) Level 3: Page Directory Pointer Table (PDPT) Level 2: Page Directory Table (PDT) Level 1: Page Tables (PT)

Each table can hold 512 entities, and one entity consists of 8 bytes (64 bits). Therefore, the entire table size is 512 (entities) * 8 (bytes per entity) = 4096 bytes. This size conveniently fits into a single page (4KB).

The structure of each level's entity is as follows:

Source: x86 Initial Boot Sequence and OSdev/Paging

From the above, it seems that the setup should satisfy the following conditions:

Consider the data within CR3, which serves as the address of PML4, as ranging from bits 12 to 32+ in order to design the PML4 address.
To enable the PML4, set the 0th bit, and design the address of PDPT within the range of bits 12 to 32+.
To utilize the layout of PDPTE page directory, do not set the 7th bit of PDPTE, and design the address of PD within the range of bits 12 to 32+.
To allow 2MB pages in PDE, set the 7th bit and design the Physical Address within the range of bits 21 to 32+.
In Firecracker, it appears that 2MiB paging is implemented without using Level 1 Page Tables (i.e., without using 4KiB pages). ToyVMM's implementation follows suit.

Now, let's extract the actual code from the implementation based on the above.


#![allow(unused)]
fn main() {
fn setup_page_tables(sregs: &mut kvm_sregs, mem: &GuestMemoryMmap) -> Result<(), RegError> {
    let boot_pml4_addr = GuestAddress(PML4_START);
    let boot_pdpte_addr = GuestAddress(PDPTE_START);
    let boot_pde_addr = GuestAddress(PDE_START);

    // Entry converting VA [0..512GB)
    mem.write_obj(boot_pdpte_addr.raw_value() as u64 | 0x03, boot_pml4_addr)
        .map_err(|_| RegError::WritePdpteAddress)?;
    // Entry covering VA [0..1GB)
    mem.write_obj(boot_pde_addr.raw_value() as u64 | 0x03, boot_pdpte_addr)
        .map_err(|_| RegError::WritePdpteAddress)?;
    // 512 MB entries together covering VA [0..1GB).
    // Note we are assuming CPU support 2MB pages (/proc/cpuinfo has 'pse').
    for i in 0..512 {
        mem.write_obj((i << 21) + 0x83u64, boot_pde_addr.unchecked_add(i * 8))
            .map_err(|_| RegError::WritePdeAddress)?;
    }
    sregs.cr3 = boot_pml4_addr.raw_value() as u64;
    sregs.cr4 |= X86_CR4_PAE;
    sregs.cr0 |= X86_CR0_PG;
    Ok(())
}
}

As seen, the implementation is quite simple.

PML4_START, PDPTE_START, and PDE_START have hardcoded address values, which are PML4_START=0x9000, PDPTE_START=0xa000, and PDE_START=0xb000, respectively, meeting the requirements of the address designs mentioned above.

From the code, it's clear that there is only one PML4 and one PDPT Table, and only the initial entry is set up. This is sufficient in this implementation because the kernel's address being translated by these page tables is 0x0100_0000. These tables, specifically PML4 and PDPT, will always look at the first entry (as described later), making this implementation suitable.

In PML4, the information about the starting address of PDPT is written by taking the logical OR of that address with 0x03. Similarly, in PDPT, the starting address of PD is written by taking the logical OR of that address with 0x03. The reason for using 0x03 here is to set the 0th and 1st bits of PML4E and PDPTE, which correspond to the R/W permission flag and the existence flag of that entry. These bits are essential in this case.

For PD, a loop is used to create 512 entries. It writes the value resulting from shifting the loop's index by 21 bits to the beginning of PD's address, every 8 bytes (1 entry size) from the starting address. The reason for using 0x83 here is to set the R/W permission flag, the existence confirmation flag, and the flag that determines whether to treat it as a 2MB page frame. This flag-setting allows using the value obtained by offsetting 21 bits from the index as the address (utilizing the layout of PDE 2MB page in the diagram). Therefore, for PDE, the entry at index 0 corresponds to an address of 0x0000_0000, and the entry at index 1 corresponds to an address of 0x0010_0000, and so on, based on the value from the loop described above.

Now, let's check whether the kernel's address stored in EIP (0x0100_0000) is correctly converted using the Page Table we just created! As mentioned earlier, when transitioning to x64 Long Mode, this kernel address is treated as a 64-bit virtual address. Currently, ToyVMM (and Firecracker) loads the kernel at physical address 0x0100_0000, and this value is stored in the eip register.

Therefore, by treating 0x0100_0000 as a virtual address and using the conversion table mentioned above, we expect the result of the address translation to be 0x0100_0000.

Let's calculate it explicitly. When converting a 64-bit virtual address with 4-Level Page Table, you split the lower 48 bits of the virtual address into groups of 9 + 9 + 9 + 9 + 12 bits each. These four groups of 9 bits are used as the index values for each Page table entry. You look up the layout of the identified entry in this way, then check the physical address of the next Page Table, and similarly determine the entry to be used in the next Page Table based on the physical address and virtual address. Continuing this process will eventually yield the desired physical address. Since Pages are at least 4KB in size, the address value is also in multiples of 4KB, so the final 12 bits of the virtual address serve as the offset (2^12 = 4KB).

Let's remember that in this case, we have set the flag in PDE to treat it as a 2MB page frame. In this scenario, the result obtained from PDE is used directly as the physical address mapping. The 9 bits that are not used for PTE are treated as an offset, adding up to a total offset of 21 bits when combined with the original 12 bits. This 21-bit offset corresponds to the 2MB size. Similarly, when you set the flag in PDPTE, it is treated as a 1GB page frame.

Based on the above discussion, let's convert 0x0100_0000. In binary representation for clarity, it is 0b0..._0000_0001_0000_0000_0000_0000_0000_0000. Following the virtual address conversion method, it breaks down as follows:

Entry index for	Range of Virtual Address	Value
Page Map Level4 (PML4)	47 ~ 39 bit	`0b0_0000_0000`
Page Directory Pointer Table (PDPT)	38 ~ 30 bit	`0b0_0000_0000`
Page Directory Table (PDT)	29 ~ 21 bit	`0b0_0000_1000`
Page Tables (PT)	20 ~ 12 bit	`0b0_0000_0000`
-	11 ~ 0 bit (offset)	`0b0_0000_0000`

From this breakdown, you can see that the index values for PML4E and PDPTE are 0, so you'll check the 64 bits directly from the beginning of each table. As implemented, PML4E at index 0 contains the address of PDPT, and PDPTE at index 0 contains the address of PDT. So, you follow this structure to reach PDT.

Now, the PDE's index value is 0b0_0000_1000 from the virtual memory address above, so you will check the 8th entry in PDT. The value stored in this entry for the 2MB Page frame area is 0b0...0000_1000. Therefore, when you add the 21-bit offset to this value, you get 0b1_0000_0000_0000_0000_0000_0000 = 0x100_0000 as the resulting physical address after conversion. This matches the input virtual address. Hence, even after the conversion, the kernel's entry point will still be pointed to, and the kernel will begin execution in 64-bit long mode.

It's worth noting that this Page Table, as designed in this implementation, effectively enables Identity Mapping in the range of 2^21 ~ 2^30-1.

Note
Upon revisiting the Page Table created this time, it's important to note that there is only one Entry for PML4 and PDPT. As a result, the virtual memory address range that can be targeted is at most 2^31 - 1. If you go beyond this range, there would be cases where indices other than 0 are used for PML4E and PDPTE.

Additionally, in the PD's Entry, the 2MB page frame is enabled. Therefore, the lower 21 bits of the virtual memory address are treated as an offset. Furthermore, since the PDE's address design corresponds to an index, this Page Table effectively enables Identity Mapping in the range of 2^21 to 2^30 - 1.

What to do next?

Up to this point, it's possible to start a Guest VM just by combining the discussed concepts. However, in this state, the Guest VM can be started but cannot be interacted with, leaving the setup somewhat incomplete. To ensure that the started Guest VM is configured as expected and for further interactions, we need to create an interface to control the Guest VM. In the next chapter, we will discuss the use of Serial and how to implement it within ToyVMM to allow keyboard interactions after starting the Guest VM!

References

Serial Console implementation

About Serial UART and ttyS0

UART(Universal Asynchronous Receiver/Transmitter) is an asynchronous serial communication standard used to connect computers and microcontrollers to peripheral devices. UART allows for the conversion of parallel and serial signals, enabling the conversion of input parallel data into serial data and transmitting it to the other party over a communication line. Integrated circuits designed for this purpose, known as 8250 UART devices, were manufactured, followed by various other families.

Now, in this case, we are attempting to boot the Guest OS (Linux), and having a serial console is quite useful for debugging and other purposes. A serial console sends all console outputs of the Guest to the serial port. With the serial terminal properly configured, you can remotely monitor the system's boot status or log in to the system via the serial port. In this instance, we will use this method to check the state of a Guest VM running on ToyVMM and perform operations within the Guest.

To output console messages to the serial port, it is necessary to set console=ttyS0 as a kernel boot parameter. In the current implementation of ToyVMM, this value is provided as the default.

The challenge lies on the side that receives this, the serial terminal. Since the I/O port address corresponding to the serial port is fixed, ToyVMM's layer will receive instructions like KVM_EXIT_IO for the nearby address. In other words, it needs to properly handle output information to the serial console issued from the Guest OS and other necessary setup requests. This can be achieved by emulating the UART device. Furthermore, by emulating the device, if we can output console output to the standard output and reflect our standard input to the Guest VM, when starting the VM from ToyVMM, we can confirm the boot information and perform operations on the Guest from our local terminal.

In summary, we need to create something like the conceptual diagram below:

We will explain this in detail in the following sections.

Serial UART

For detailed information about Serial UART, you can refer to the following resources by Lammet Bies and Wikibooks, which provide rich information:

The following figures are based on Lammet's document, with a brief explanation of each bit of each register. Although this diagram was created by me personally in writing this document, it is attached in the hope that it will help readers understand the meaning of each register and bit. However, the meaning of each register and bit is not explained in this document, so please refer to the above document for confirmation:

Basically, UART operations are performed by manipulating the registers and bits shown above. In our case, we need to emulate this in software, and we plan to do this using rust-vmm/vm-superio. In the following sections, we'll briefly compare the implementation of rust-vmm/vm-superio with the above specifications.

Software Implementation of Serial Device using rust-vmm/vm-superio

Initial Value Settings/RW Implementation

Here, we will review the implementation of the serial device using rust-vmm/vm-superio while comparing it with the above specifications. I encourage you to obtain the code from the link provided and inspect it for yourself. The following content is based on version vm-superio-0.6.0, so please note that it may have changed in the latest code.

First, let's organize some initial values for certain variables. rust-vmm/vm-superio was originally designed for VMM usage, so it initializes certain register values and doesn't anticipate changes.

Variable	DEFAULT VALUE	Meaning	REGISTER
baud_divisor_low	0x0c	Baud rate 9600 bps
baud_divisor_high	0x00	Baud rate 9600 bps
interrupt_enable	0x00	No interrupts enabled	IER
interrupt_identification	0b0000_0001	No pending interrupt	IIR
line_control	0b0000_0011	8-bit word length	LCR
line_status	0b0110_0000	(1)	LSR
modem_control	0b0000_1000	(2)	MCR
modem_status	0b1011_0000	(3)	MSR
scratch	0b0000_0000		SCR
in_buffer	Vec::new()	Vector values (buffer)	-

(1) Setting THR empty-related bits. Setting these bits means that data can be received at any time. This represents the assumption that it will be used as a virtual device.
(2) Many UARTs enable interrupts by default by setting Auxiliary Output 2 to 1.
(3) Connected state and hardware data flow initialization.

Now, let's look at the processing when a write request is received. As a result of KVM_EXIT_IO, we receive the address where IO occurred and the data to be written. On the ToyVMM side, we calculate the appropriate device (in this case, the Serial UART device) and its offset from the base address based on these values and call the write function defined in vm-superio. The following content is a simplified table representing the processing of Serial::write. In general, it involves straightforward register value modification, with a few exceptions:

Variable	OFFSET(u8)	Additional Conditions	Write
DLAB_LOW_OFFSET	0	is_dlab_set = true	Modify `self.baud_divisor_low`
DLAB_HIGH_OFFSET	1	is_dlab_set = true	Modify `self.baud_divisor_high`
DATA_OFFSET	0	- (is_dlab_set = false)	(1)
IER_OFFSET	1	- (is_dlab_set = false)	(2)
LCR_OFFSET	3	-	Modify `self.line_control`
MCR_OFFSET	4	-	Modify `self.modem_control`
SCR_OFFSET	7	-	Modify `self.scratch`

(1) Depending on the current state of the Serial, we handle cases where LOOP_BACK_MODE (MCR bit 4) is enabled and when it is not enabled.
- If it is enabled, it simulates passing what is written to the transmit register directly to the receive register (loopback), which is not important in this context.
- If it is not enabled, it writes the data to be written to the output and depends on the existing configuration to generate interrupts.
  - As shown in the table above, we do not support changing IIR due to write from outside, and the default value is set to 0b0000_0001.
  - If the THR empty bit flag of IER is set for IER_OFFSET, it sets the corresponding flag for THR empty in IIR and triggers an interrupt.
(2) Among the bits of IER, only bits 0-3 are masked, and the result is written back to self.interrupt_enable.

Next, let's look at the processing when a read request is received. Similarly, we present the processing of Serial::read in a simplified table. Unlike write, in the case of read, it mainly involves returning data as the result.

Variable	OFFSET(u8)	Additional Conditions	Read
DLAB_LOW_OFFSET	0	is_dlab_set = true	Read `self.baud_divisor_low`
DLAB_HIGH_OFFSET	1	is_dlab_set = true	Read `self.baud_divisor_high`
DATA_OFFSET	0	- (is_dlab_set = false)	(1)
IER_OFFSET	1	- (is_dlab_set = false)	Read `self.interrupt_enable`
IIR_OFFSET	2	-	(2)
LCR_OFFSET	3	-	Read `self.line_control`
MCR_OFFSET	4	-	Read `self.modem_control`
LSR_OFFSET	5	-	Read `self.line_status`
MSR_OFFSET	6	-	(3)
SCR_OFFSET	7	-	Read `self.scratch`

(1) Reads data from the buffer held by the Serial structure. In the current implementation, this buffer is only filled by write in loopback mode, so read operations related to this region are not issued in the boot sequence of the OS.
(2) Returns the result of self.interrupt_identification | 0b1100_0000 (FIFO enabled) and resets it to the default value.
(3) Depending on whether the current state is loopback mode, it handles differently.
- In the case of loopback, it adjusts appropriately (not important for this context).
- In the case of non-loopback, it straightforwardly returns the value of self.modem_status.

Usage of rust-vmm/vm-superio in ToyVMM

In ToyVMM, we use rust-vmm/vm-superio to handle KVM_EXIT_IO contents. Additionally, two things need to be considered:

Outputting console output destined for the serial port to the standard output to allow monitoring of the boot sequence and internal state of the Guest VM.
Passing the content of standard input to the Guest VM.

In the following sections, we'll go through each of these in order.

Outputting Console Output Destined for the Serial Port to Standard Output

To monitor the boot sequence and internal state of the Guest VM, we will redirect console output destined for the serial port to the standard output. "Console output destined for the serial port" corresponds to the case of KVM_EXIT_IO_OUT where KVM_EXIT_IO is issued for the "IO Port address for Serial". The code section below handles this:


#![allow(unused)]
fn main() {
...
loop {
  match vcpu.run() {
      Ok(run) => match run {
          ...
          VcpuExit::IoOut(addr, data) => {
              io_bus.write(addr as u64, data);
          }
      ...  
      }
    }
}
...
}

Here, as a result of KVM_EXIT_IO_OUT, we receive the address and data to be written. On the ToyVMM side, we simply call io_bus.write with these values. The setup for this io_bus is done as follows:


#![allow(unused)]
fn main() {
let mut io_bus = IoBus::new();
let com_evt_1_3 = EventFdTrigger::new(EventFd::new(libc::EFD_NONBLOCK).unwrap());
let stdio_serial = Arc::new(Mutex::new(SerialDevice {
    serial: serial::Serial::with_events(
        com_evt_1_3.try_clone().unwrap(),
        SerialEventsWrapper {
            buffer_read_event_fd: None,
        },
        Box::new(std::io::stdout()),
    ),
}));
io_bus.insert(stdio_serial.clone(), 0x3f8, 0x8).unwrap();
vm.fd().register_irqfd(&com_evt_1_3, 4).unwrap();
}

The setup above requires some explanation, so let's go through it step by step. In essence, it accomplishes the following:

Initializes an I/O Bus represented by IoBus and initializes the eventfd for interrupts.
Initializes the Serial Device. During initialization, we provide an eventfd for generating interrupts in the Guest and an FD (std::io::stdout()) for standard output.
Registers the Serial Device we initialized with the IoBus. During registration, we specify 0x3f8 as the base and 0x8 as the range.
- This means that the range of 0x8 starting from the base 0x3f8 represents the address space used by this Serial Device.

Handling the I/O Bus

The address value passed via KVM_EXIT_IO becomes the value within the entire address space. On the other hand, the read/write implementation in rust-vmm/vm-superio works based on an offset value from the Serial Device's base address. Therefore, there's a need for processing to bridge this gap.

You could simply calculate the offset, but in Firecracker, considering future extensibility (using I/O Ports for devices other than Serial), there's a Bus structure representing the I/O Bus. This structure allows devices to be registered along with BusRange (a structure representing the base address and address range for devices on the bus). Furthermore, when an I/O at a specific address occurs, the mechanism checks that address, retrieves the device registered in the corresponding address range, and performs I/O on that device using the offset from the base address.

For instance, the write function is implemented as follows, where it retrieves the registered device and its offset based on the address information using the get_device function, and then calls the write function implemented in that device with the offset.


#![allow(unused)]
fn main() {
pub fn write(&self, addr: u64, data: &[u8]) -> bool {
    if let Some((offset, dev)) = self.get_device(addr) {
        // OK to unwrap as lock() failing is a serious error condition and should panic.
        dev.lock()
            .expect("Failed to acquire device lock")
            .write(offset, data);
        true
    } else {
        false
    }
}
}

Let's consider the Serial device as an example. As mentioned earlier, KVM_EXIT_IO_OUT for the Serial device from the Guest VM occurs within an address range of 8 bytes with a base address of 0x3f8. ToyVMM's IoBus also registers the Serial Device with the same address base and range. For example, when you trap an instruction that writes 0b1001_0011 to 0x3fb as KVM_EXIT_IO_OUT, it interprets this instruction as writing 0b1001_0011 to LCR at the position 0x3 from the base address 0x3f8.

Interrupt Notification to Guest VM via eventfd/irqfd

Now, let's discuss KVM and interrupts. We will reference some Linux source code, mainly from version v4.18.

:warning: The following information is mainly based on source code and may not capture all the details of state transitions. If you find any inaccuracies, please let me know in the comments.

In rust-vmm/vm-superio, during Serial initialization, it requires an EventFd as its first argument. This is a wrapper for eventfd in Linux. Eventfd allows inter-process and process-to-kernel event notifications.

Next is irqfd. irqfd is a mechanism based on eventfd that allows injecting interrupts into a VM. In simple terms, it's like having one end of eventfd held by KVM, and the other end's notifications are interpreted as interrupts to the Guest VM. This irqfd-based interrupt is meant to emulate interrupts from the external world to the Guest VM, which corresponds to regular system interrupts from peripheral devices in a typical system. The reverse direction of interrupts is handled using the ioeventfd mechanism, which we'll omit for now.

Let's examine how irqfd is connected to Guest VM interrupts by looking at the source code. When you perform an ioctl with KVM_IRQFD against KVM, it goes through the KVM processing with the data passed to kvm_irqfd and kvm_irqfd_assign. In the kvm_irqfd_assign function, an instance of the kvm_kernel_irqfd structure is created. At this point, settings are made based on additional information passed during the ioctl. Particularly, the gsi field in the kvm_kernel_irqfd structure is set based on the value passed as an argument during the ioctl. This gsi corresponds to the index of the interrupt table for the Guest, so when making the ioctl, you specify which interrupt table entry you want to use along with the eventfd. ToyVMM sets this up with a line like this:


#![allow(unused)]
fn main() {
vm.fd().register_irqfd(&com_evt_1_3, 4).unwrap();
}

This is defined as a method in the kvm_ioctl::VmFd structure.


#![allow(unused)]
fn main() {
pub fn register_irqfd(&self, fd: &EventFd, gsi: u32) -> Result<()> {
    let irqfd = kvm_irqfd {
        fd: fd.as_raw_fd() as u32,
        gsi,
        ..Default::default()
    };
    // Safe because we know that our file is a VM fd, we know the kernel will only read
    // the correct amount of memory from our pointer, and we verify the return result.
    let ret = unsafe { ioctl_with_ref(self, KVM_IRQFD(), &irqfd) };
    if ret == 0 {
        Ok(())
    } else {
        Err(errno::Error::last())
    }
}
}

In other words, in the aforementioned setup, the eventfd (com_evt_1_3) used by the Serial device has been configured with GSI=4 (the Guest VM's interrupt table index for the COM1 port). Therefore, any write operation performed on com_evt_1_3 results in an interrupt being sent to the Guest VM as if it were generated from COM1. From the Guest's perspective, this means that an interrupt originated from the Serial device downstream of COM1, leading to the invocation of the Guest VM's COM1 interrupt handler.

Now, let's discuss the setup of the Guest-side Interrupt Table (GSI: Global System Interrupt Table) and how and when it's established. In short, these tables are set up by issuing an ioctl to KVM with KVM_CREATE_IRQCHIP. This operation creates two interrupt controllers, the PIC and IOAPIC (internally, the kvm_pic_init function handles PIC initialization, registers read/write ops, and sets it in kvm->arch.vpic. Similarly, kvm_ioapic_init initializes the IOAPIC, registers read/write ops, and sets it in kvm->arch.vioapic). These hardware components, such as the PIC and IOAPIC, are implemented within KVM for the purpose of acceleration, so there's no need to emulate them separately. While you could delegate this task to qemu, we'll omit this detail here since we're not using it.

Furthermore, the kvm_setup_default_irq_routing function sets up default IRQ routing. This process determines which handler will be invoked for each GSI-based interrupt. Let's take a closer look at the contents of kvm_setup_default_irq_routing. This function calls kvm_set_irq_routing, where the essential processing takes place. Here, a kvm_irq_routing_table is created and populated with kvm_kernel_irq_routing_entry structures that represent the mapping from GSI to IRQ.

The kvm_kernel_irq_routing_entry structures are populated using a loop that iterates through a default_routing array. Here's how default_routing is defined along with related macros:

#define SELECT_PIC(irq) \
    ((irq) < 8 ? KVM_IRQCHIP_PIC_MASTER : KVM_IRQCHIP_PIC_SLAVE)

#define IOAPIC_ROUTING_ENTRY(irq) \
    { .gsi = irq, .type = KVM_IRQ_ROUTING_IRQCHIP, \
      .u.irqchip = { .irqchip = KVM_IRQCHIP_IOAPIC, .pin = (irq) } }

#define ROUTING_ENTRY1(irq) IOAPIC_ROUTING_ENTRY(irq)

#define PIC_ROUTING_ENTRY(irq) \
    { .gsi = irq, .type = KVM_IRQ_ROUTING_IRQCHIP, \
      .u.irqchip = { .irqchip = SELECT_PIC(irq), .pin = (irq) % 8 } }

#define ROUTING_ENTRY2(irq) \
    IOAPIC_ROUTING_ENTRY(irq), PIC_ROUTING_ENTRY(irq)

static const struct kvm_irq_routing_entry default_routing[] = {
    ROUTING_ENTRY2(0), ROUTING_ENTRY2(1),
    ROUTING_ENTRY2(2), ROUTING_ENTRY2(3),
    ROUTING_ENTRY2(4), ROUTING_ENTRY2(5),
    ROUTING_ENTRY2(6), ROUTING_ENTRY2(7),
    ROUTING_ENTRY2(8), ROUTING_ENTRY2(9),
    ROUTING_ENTRY2(10), ROUTING_ENTRY2(11),
    ROUTING_ENTRY2(12), ROUTING_ENTRY2(13),
    ROUTING_ENTRY2(14), ROUTING_ENTRY2(15),
    ROUTING_ENTRY1(16), ROUTING_ENTRY1(17),
    ROUTING_ENTRY1(18), ROUTING_ENTRY1(19),
    ROUTING_ENTRY1(20), ROUTING_ENTRY1(21),
    ROUTING_ENTRY1(22), ROUTING_ENTRY1(23),
};

As you can see, IRQ numbers 0-15 are passed to ROUTING_ENTRY2, and IRQ numbers 16-23 are passed to ROUTING_ENTRY1. ROUTING_ENTRY2 calls both IOAPIC_ROUTING_ENTRY and PIC_ROUTING_ENTRY, while ROUTING_ENTRY1 calls IOAPIC_ROUTING_ENTRY only, creating structures with the necessary information.

These structures are used to set up each .u.irqchip.irqchip value (KVM_IRQCHIP_PIC_SLAVE, KVM_IRQCHIP_PIC_MASTER, KVM_IRQCHIP_IOAPIC) appropriately in the kvm_set_routing_entry function, depending on the IRQ. This function performs callbacks (kvm_set_pic_irq, kvm_set_ioapic_irq) and any necessary configurations when an interrupt occurs. We'll discuss these callbacks in more detail later.

int kvm_set_routing_entry(struct kvm *kvm,
                          struct kvm_kernel_irq_routing_entry *e,
                          const struct kvm_irq_routing_entry *ue)
{
    /* We can't check irqchip_in_kernel() here as some callers are
     * currently initializing the irqchip. Other callers should therefore
     * check kvm_arch_can_set_irq_routing() before calling this function.
     */
    switch (ue->type) {
    case KVM_IRQ_ROUTING_IRQCHIP:
        if (irqchip_split(kvm))
            return -EINVAL;
        e->irqchip.pin = ue->u.irqchip.pin;
        switch (ue->u.irqchip.irqchip) {
        case KVM_IRQCHIP_PIC_SLAVE:
            e->irqchip.pin += PIC_NUM_PINS / 2;
            /* fall through */
        case KVM_IRQCHIP_PIC_MASTER:
            if (ue->u.irqchip.pin >= PIC_NUM_PINS / 2)
                return -EINVAL;
            e->set = kvm_set_pic_irq;
            break;
        case KVM_IRQCHIP_IOAPIC:
            if (ue->u.irqchip.pin >= KVM_IOAPIC_NUM_PINS)
                return -EINVAL;
            e->set = kvm_set_ioapic_irq;
            break;
        default:
            return -EINVAL;
        }
        e->irqchip.irqchip = ue->u.irqchip.irqchip;
        break;
...

Now, let's return to the discussion of irqfd. Although not mentioned earlier, the kvm_irqfd_assign function includes the init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup) process, registering irqfd_wakeup with &irqfd->wait->func. This function is called when an interrupt occurs, and it invokes schedule_work(&irqfd->inject).

The inject field is also initialized within the kvm_irqfd_assign function, resulting in a call to the irqfd_inject function. Inside irqfd_inject, the kvm_set_irq function is called.

The kvm_set_irq function lists entries with the incoming IRQ number and calls their set callbacks. This means that functions like kvm_set_pic_irq and kvm_set_ioapic_irq, as described earlier, will be called based on the routing information.

The following explanation will go into a little more depth on interrupt processing, but since they are not necessary for understanding ToyVMM, you may skip to ToyVMM serial console.

Let's take a closer look at the kvm_set_pic_irq handler, which is responsible for handling interrupts. While this discussion slightly deviates from the main topic, it's a good opportunity to explore it more thoroughly. kvm_set_pic_irq simply utilizes the kvm_pic_set_irq function, passing the relevant parameters.

static int kvm_set_pic_irq(struct kvm_kernel_irq_routing_entry *e,
                           struct kvm *kvm, int irq_source_id, int level,
                           bool line_status)
{
    struct kvm_pic *pic = kvm->arch.vpic;
    return kvm_pic_set_irq(pic, e->irqchip.pin, irq_source_id, level);
}

Let's inspect the implementation of kvm_pic_set_irq:

int kvm_pic_set_irq(struct kvm_pic *s, int irq, int irq_source_id, int level)
{
    int ret, irq_level;

    BUG_ON(irq < 0 || irq >= PIC_NUM_PINS);

    pic_lock(s);
    irq_level = __kvm_irq_line_state(&s->irq_states[irq],
                                     irq_source_id, level);
    ret = pic_set_irq1(&s->pics[irq >> 3], irq & 7, irq_level);
    pic_update_irq(s);
    trace_kvm_pic_set_irq(irq >> 3, irq & 7, s->pics[irq >> 3].elcr,
                          s->pics[irq >> 3].imr, ret == 0);
    pic_unlock(s);

    return ret;
}

In pic_set_irq1, the IRQ level is set, and then pic_update_irq calls the pic_irq_request and updates the kvm->arch.vpic structure.

/*
 * raise irq to CPU if necessary. must be called every time the active
 * irq may changejjj
 */
static void pic_update_irq(struct kvm_pic *s)
{
	int irq2, irq;

	irq2 = pic_get_irq(&s->pics[1]);
	if (irq2 >= 0) {
		/*
		 * if irq request by slave pic, signal master PIC
		 */
		pic_set_irq1(&s->pics[0], 2, 1);
		pic_set_irq1(&s->pics[0], 2, 0);
	}
	irq = pic_get_irq(&s->pics[0]);
	pic_irq_request(s->kvm, irq >= 0);
}

/*
 * callback when PIC0 irq status changed
 */
static void pic_irq_request(struct kvm *kvm, int level)
{
	struct kvm_pic *s = kvm->arch.vpic;

	if (!s->output)
		s->wakeup_needed = true;
	s->output = level;

}

After that, kvm_pic_set_irq invokes pic_unlock function.
This function is a little more import because if the wakeup_needed field is true, then invokes kvm_vcpu_kick function for vCPU.

static void pic_unlock(struct kvm_pic *s)
    __releases(&s->lock)
{
    bool wakeup = s->wakeup_needed;
    struct kvm_vcpu *vcpu;
    int i;

    s->wakeup_needed = false;

    spin_unlock(&s->lock);

    if (wakeup) {
        kvm_for_each_vcpu(i, vcpu, s->kvm) {
            if (kvm_apic_accept_pic_intr(vcpu)) {
                kvm_make_request(KVM_REQ_EVENT, vcpu);
                kvm_vcpu_kick(vcpu);
                return;
            }
        }
    }
}

void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
{
    int me;
    int cpu = vcpu->cpu;

    if (kvm_vcpu_wake_up(vcpu))
        return;

    me = get_cpu();
    if (cpu != me && (unsigned)cpu < nr_cpu_ids && cpu_online(cpu))
        if (kvm_arch_vcpu_should_kick(vcpu))
            smp_send_reschedule(cpu);
    put_cpu();
}

And the result of invoking smp_send_reschedule function in kvm_vcpu_kick, native_smp_send_reschedule function is called.


static void native_smp_send_reschedule(int cpu)
{
    if (unlikely(cpu_is_offline(cpu))) {
        WARN_ON(1);
        return;
    }
    apic->send_IPI(cpu, RESCHEDULE_VECTOR);
}

By invoking smp_send_reschedule, an IPI (Inter-Processor Interrupt) is sent to another CPU, prompting it to reschedule. This results in an interrupt being inserted into the vCPU, causing a VMExit. Consequently, the vCPU is scheduled when the interrupt is delivered.

Now, let's briefly review the process of how interrupts are inserted. When KVM_RUN is executed, the following steps are performed (focusing solely on interrupt insertion, omitting other extensive processing):

kvm_arch_vcpu_ioctl_run
 -> vcpu_run
 -> vcpu_enter_guest
 -> inject_pending_event
 -> kvm_cpu_has_injectable_intr

Within kvm_cpu_has_injectable_intr, the kvm_cpu_has_extint function is called. In this case, it likely returns 1, probably based on the value of s->output set by pic_irq_request.

Therefore, the following part of the inject_pending_event function is reached:

	} else if (kvm_cpu_has_injectable_intr(vcpu)) {
		/*
		 * Because interrupts can be injected asynchronously, we are
		 * calling check_nested_events again here to avoid a race condition.
		 * See https://lkml.org/lkml/2014/7/2/60 for discussion about this
		 * proposal and current concerns.  Perhaps we should be setting
		 * KVM_REQ_EVENT only on certain events and not unconditionally?
		 */
		if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) {
			r = kvm_x86_ops->check_nested_events(vcpu, req_int_win);
			if (r != 0)
				return r;
		}
		if (kvm_x86_ops->interrupt_allowed(vcpu)) {
			kvm_queue_interrupt(vcpu, kvm_cpu_get_interrupt(vcpu),
					    false);
			kvm_x86_ops->set_irq(vcpu);
		}
	}

Finally, kvm_x86_ops->set_irq(vcpu) is called, and this triggers the vmx_inject_irq callback function. In this process, it inserts the interrupt by setting VMCS (Virtual Machine Control Structure) with VMX_ENTRY_INTR_INFO_FIELD. While not elaborated on here, explaining VMCS would require delving into hypervisor implementation details, which is beyond the scope of this discussion. It may be added as supplementary information in the documentation in the future.

In summary, this is the flow of interrupt processing using the PIC as an example.

ToyVMM serial console

Now, at this point, let's temporarily set aside the exploration of interrupts and return to discussing the implementation of ToyVMM. Considering the previous discussions, let's organize what processes are being executed within ToyVMM and what happens behind the scenes.

In ToyVMM, before performing register_irqfd as mentioned earlier, a function called setup_irqchip is actually executed. This function acts as a thin wrapper and internally makes calls to create_irq_chip and create_pit2.


#![allow(unused)]
fn main() {
#[cfg(target_arch = "x86_64")]
pub fn setup_irqchip(&self) -> Result<()> {
    self.fd.create_irq_chip().map_err(Error::VmSetup)?;
    let pit_config = kvm_pit_config {
        flags: KVM_PIT_SPEAKER_DUMMY,
        ..Default::default()
    };
    self.fd.create_pit2(pit_config).map_err(Error::VmSetup)
}
}

What's important here is the create_irq_chip function. Internally, it calls the KVM_CREATE_IRQCHIP API, as mentioned earlier, to initialize the interrupt controller and IRQ routing. Following this setup, register_irqfd(&com_evt_1_3, 4) is executed on the configured Guest VM, which, as explained earlier, calls functions like kvm_irqfd_assign to set up interrupt handlers. This completes the setup of interrupt-related configurations using the KVM API.

Now, let's revisit the interrupts coming from com_evt_1_3. As previously discussed, one end of the interrupt is passed to KVM along with GSI=4 through register_irqfd. Consequently, any write issued from the other end is treated as an interrupt to the Guest VM as if it were sent to the COM1 port. On the other hand, the other end of com_evt_1_3 is passed to the Serial Device, making writes to the eventfd on the Serial Device side (occurring after processing through Serial::write or through the invocation of Serial::enqueue_raw_byte) the actual interrupt triggers. In essence, this setup enables the Guest VM and the software-implemented Serial Device to interact in a manner similar to regular server and Serial Device communication.

Furthermore, to represent a Serial Console, we've configured stdout as the destination for writes corresponding to the Serial Device's output in this case. Therefore, when handling KVM_EXIT_IO_OUT and writing to THR, the data is passed to stdout, resulting in console messages being output to standard output. This effectively realizes the desired Serial Console functionality.

Controlling the Guest VM via Standard Input

Finally, to manipulate the Guest VM using standard input, we want to reflect the contents of standard input into the Guest VM. The Serial struct provided by rust-vmm/vm-superio offers a helper function called enqueue_raw_bytes. This helper function allows us to send data to the Guest VM without needing to handle low-level register operations or interrupts explicitly, as the function handles these operations internally.

To achieve this, we need to read input from the program and pass it directly to this method. We can set up standard input in raw mode, and the main thread can poll it while waiting for input. When input is received, we can use enqueue_raw_bytes to send it to the Guest VM. Since each vCPU of the Guest VM is executed in a separate thread, polling standard input in the main thread won't affect the processing of the Guest VM.

Here is a basic implementation:


#![allow(unused)]
fn main() {
let stdin_handle = io::stdin();
let stdin_lock = stdin_handle.lock();
stdin_lock
    .set_raw_mode()
    .expect("failed to set terminal raw mode");
let ctx: PollContext<Token> = PollContext::new().unwrap();
ctx.add(&exit_evt, Token::Exit).unwrap();
ctx.add(&stdin_lock, Token::Stdin).unwrap();
'poll: loop {
    let pollevents: PollEvents<Token> = ctx.wait().unwrap();
    let tokens: Vec<Token> = pollevents.iter_readable().map(|e| e.token()).collect();
    for &token in tokens.iter() {
        match token {
            Token::Exit => {
                println!("vcpu requested shutdown");
                break 'poll;
            }
            Token::Stdin => {
                let mut out = [0u8; 64];
                tx.send(true).unwrap();
                match stdin_lock.read_raw(&mut out[..]) {
                    Ok(0) => {
                        println!("eof!");
                    }
                    Ok(count) => {
                        stdio_serial
                            .lock()
                            .unwrap()
                            .serial
                            .enqueue_raw_bytes(&out[..count])
                            .expect("failed to enqueue bytes");
                    }
                    Err(e) => {
                        println!("error while reading stdin: {:?}", e);
                    }
                }
            }
            _ => {}
        }
    }
}
}

This is a straightforward implementation, but it achieves the desired functionality.

Check UART Request When Booting the Linux Kernel

In the previous sections, we discussed the software implementation of the Serial UART and how it's used internally within ToyVMM. While it works effectively, it's important to examine the UART communication during the Linux Kernel boot process.

Fortunately, due to the VMM's architecture, we need to handle KVM_EXIT_IO, which allows us to intercept all requests sent to the serial port by injecting debug code into this handling process.

I won't go into detail about the code inserted for debugging purposes here, as it's quite straightforward to insert debug code at the appropriate locations. Instead, I'll provide annotations in three specific formats to make it clear and understandable when looking at requests made to the 0x3f8 (COM1) register during OS startup.

[Format 1 - Read]
r($register) = $data
  - Description

- r           = Read operation
- $register   = The register corresponding to the offset calculated using the device's address (0x3f8)
- $data       = Data read from $register
- Description = Explanation

[Format 2 - Write]
w($register = $data)
  - Description

- w           = Write operation
- $register   = The register corresponding to the offset calculated using the device's address (0x3f8)
- $data       = Data to be written to $register
- Description = Explanation

[Format 3 - Write (character)]
w(THR = $data = 0xYY) -> 'CHAR'

- w(THR ...)  = Write operation to THR
- $data       = Binary data to be written to $register
- 0xYY        = $data converted to hexadecimal
- 'CHAR'      = 0xYY converted to a character based on the ASCII code table

Now, the following is a somewhat lengthy representation of requests made to the 0x3f8 (COM1) register during OS startup, formatted according to the above annotations:

# Initial setup, configuring baud rate, etc.
w(IER = 0)
w(LCR = 10010011)
  - DLAB         = 1   (DLAB: DLL and DLM accessible)
  - Break signal = 0   (Break signal disabled)
  - Parity       = 010 (No parity)
  - Stop bits    = 0   (1 stop bit)
  - Data bits    = 11  (8 data bits)
w(DLL = 00001100)
w(DLM = 0)
  - DLL = 0x0C, DLM = 0x00 (Speed = 9600 bps)
w(LCR = 00010011)
  - DLAB         = 0   (DLAB: RBR, THR, and IER accessible)
  - Break signal = 0   (Break signal disabled)
  - Parity       = 010 (No parity)
  - Stop bits    = 0   (1 stop bit)
  - Data bits    = 11  (8 data bits)
w(FCR = 0)
w(MCR = 00000001)
  - Reserved            = 00
  - Autoflow control    = 0
  - Loopback mode       = 0
  - Auxiliary output 2  = 0
  - Auxiliary output 1  = 0
  - Request to send     = 0
  - Data terminal ready = 1
r(IER) = 0
w(IER = 0)

# From here, the actual console output is being received through the serial port,
# and write operations (in this case, writing to stdout) are happening.

# Checking the content of r(LSR) to determine whether to write the next character
r(LSR) = 01100000
  - Errornous data in FIFO         = 0
  - THR is empty, and line is idle = 1
  - THR is empty                   = 1
  - Break signal received          = 0
  - Framing error                  = 0
  - Parity error                   = 0
  - Overrun error                  = 0
  - Data available                 = 0
    - Bits 5 and 6 are related to character transmission and used by UART
    - If bits 5 and 6 are set, it means UART is ready to accept a new character
      - Bit 6 = '1' means that all characters have been transmitted
      - Bit 5 = '1' means that UART is capable of receiving more characters

# Since the next character write is accepted here, we write the character we want to output.
w(THR = 01011011 = 0x5b) -> '['

# Following this, the same pattern repeats:
r(LSR) = 01100000
w(THR = 00100000 = 0x20) -> ' '
# The above operation repeats 3 more times.
# ...

r(LSR) = 01100000
w(THR  = 00110000 = 0x30) -> '0'
r(LSR) = 01100000
w(THR  = 00101110 = 0x2e) -> '.'
r(LSR) = 01100000
w(THR  = 00110000 = 0x30) -> '0'
# The above operation repeats 5 more times

r(LSR) = 01100000
w(THR  = 01011101 = 0x5d) -> ']'
r(LSR) = 01100000
w(THR  = 00100000 = 0x20) -> ' '
r(LSR) = 01100000
w(THR  = 01001100 = 0x4c) -> 'L'
r(LSR) = 01100000
w(THR  = 01101001 = 0x69) -> 'i'
r(LSR) = 01100000
w(THR  = 01101110 = 0x6e) -> 'n'
r(LSR) = 01100000
w(THR  = 01110101 = 0x75) -> 'u'
r(LSR) = 01100000
w(THR  = 01111000 = 0x78) -> 'x'
r(LSR) = 01100000
w(THR  = 00100000 = 0x20) -> ' '
r(LSR) = 01100000
w(THR  = 01110110 = 0x76) -> 'v'
r(LSR) = 01100000
w(THR  = 01100101 = 0x65) -> 'e'
r(LSR) = 01100000
w(THR  = 01110010 = 0x72) -> 'r'
r(LSR) = 01100000
w(THR  = 01110011 = 0x73) -> 's'
r(LSR) = 01100000
w(THR  = 01101001 = 0x69) -> 'i'
r(LSR) = 01100000
w(THR  = 01101111 = 0x6f) -> 'o'
r(LSR) = 01100000
w(THR  = 01101110 = 0x6e) -> 'n'
r(LSR) = 01100000
w(THR  = 00100000 = 0x20) -> ' '
r(LSR) = 01100000
w(THR  = 00110100 = 0x34) -> '4'
r(LSR) = 01100000
w(THR  = 00101110 = 0x2e)-> '.'
r(LSR) = 01100000
w(THR  = 00110001 = 0x31) -> '1'
r(LSR) = 01100000
w(THR  = 00110100 = 0x34) -> '4'
r(LSR) = 01100000
w(THR  = 00101110 = 0x2e) -> '.'
r(LSR) = 01100000
w(THR  = 00110001 = 0x31) -> '1'
r(LSR) = 01100000
w(THR  = 00110111 = 0x37) -> '7'
r(LSR) = 01100000
w(THR  = 00110100 = 0x34) -> '4'
r(LSR) = 01100000
w(THR  = 00100000 = 0x20) -> ' '
r(LSR) = 01100000
w(THR  = 00101000 = 0x28) -> '('
r(LSR) = 01100000
w(THR  = 01000000 = 0x40) -> '@'
w(LSR) = 01100000
r(THR  = 00110101 = 0x35) -> '5'
r(LSR) = 01100000
w(THR  = 00110111 = 0x37) -> '7'
r(LSR) = 01100000
w(THR  = 01100101 = 0x65) -> 'e'
r(LSR) = 01100000
w(THR  = 01100100 = 0x64) -> 'd'
r(LSR) = 01100000
w(THR  = 01100101 = 0x65) -> 'e'
r(LSR) = 01100000
w(THR  = 01100010 = 0x62) -> 'b'
r(LSR) = 01100000
w(THR  = 01100010 = 0x62) -> 'b'
r(LSR) = 01100000
w(THR  = 00111001 = 0x39) -> '9'
r(LSR) = 01100000
w(THR  = 00111001 = 0x39) -> '9'
r(LSR) = 01100000
w(THR  = 01100100 = 0x64) -> 'd'
r(LSR) = 01100000
w(THR  = 01100010 = 0x62) -> 'b'
r(LSR) = 01100000
w(THR  = 00110111 = 0x37) -> '7'
r(LSR) = 01100000
w(THR  = 00101001 = 0x29) -> ')'

# Concatenating the output, we get the following line:
[    0.000000] Linux version 4.14.174 (@57edebb99db7)

# This matches the content of the first line output during OS boot.

Of course, Linux Kernel startup UART requests continue beyond this, and more complex operations take place. However, I won't delve further into these requests here. If you are interested, I encourage you to explore them in detail.

Reference

ToyVMM Implementation

To summarize our previous discussions, we have successfully created a minimal VMM with essential features. This ToyVMM is a straightforward VMM with the following functionalities:

It can boot a Guest OS using vmlinuz and initrd.
After the Guest OS boots, it can handle input and output as a Serial Terminal, allowing you to monitor and interact with the Guest's state.

Run the Linux Kernel!

Let's actually boot the Linux Kernel.

First, prepare vmlinux.bin and initrd.img. Place them in the root directory of the ToyVMM repository. You can download vmlinux.bin as follows:

# Download vmlinux.bin
wget https://s3.amazonaws.com/spec.ccfc.min/img/quickstart_guide/x86_64/kernels/vmlinux.bin
cp vmlinux.bin <TOYVMM WORKING DIRECTORY>

For initrd.img, you can create it using marcov/firecracker-initrd, which includes an Alpine Linux root filesystem:

# Create initrd.img
# Using marcov/firecracker-initrd (https://github.com/marcov/firecracker-initrd)
git clone https://github.com/marcov/firecracker-initrd.git
cd firecracker-initrd
bash ./build.sh
# After the above commands, the initrd.img file will be located in build/initrd.img.
# Please move it to the working directory of ToyVMM.
cp build/initrd.img <TOYVMM WORKING DIRECTORY>

With these preparations completed, let's launch the Guest VM:

$ make run_linux

Here, we'll skip the output of the boot sequence, which will be displayed on the standard output. Once the boot is complete, you'll see the Alpine Linux screen, and it will prompt you for login credentials. You can log in using root as both the username and password:

Welcome to Alpine Linux 3.15
Kernel 4.14.174 on an x86_64 (ttyS0)

(none) login: root
Password:
Welcome to Alpine!

The Alpine Wiki contains a large amount of how-to guides and general
information about administrating Alpine systems.
See <http://wiki.alpinelinux.org/>.

You can set up the system with the command: setup-alpine

You may change this message by editing /etc/motd.

login[1058]: root login on 'ttyS0'
(none):~#

Great! You have successfully booted the Guest VM and can operate it. You can also execute commands within the Guest VM. For example, running the basic ls command results in the following output:

(none):~# ls -lat /
total 0
drwx------    3 root     root            80 Sep 23 06:44 root
drwxr-xr-x    5 root     root           200 Sep 23 06:44 run
drwxr-xr-x   19 root     root           400 Sep 23 06:44 .
drwxr-xr-x   19 root     root           400 Sep 23 06:44 ..
drwxr-xr-x    7 root     root          2120 Sep 23 06:44 dev
dr-xr-xr-x   12 root     root             0 Sep 23 06:44 sys
dr-xr-xr-x   55 root     root             0 Sep 23 06:44 proc
drwxr-xr-x    2 root     root          1780 May  7 00:55 bin
drwxr-xr-x   26 root     root          1040 May  7 00:55 etc
lrwxrwxrwx    1 root     root            10 May  7 00:55 init -> /sbin/init
drwxr-xr-x    2 root     root          3460 May  7 00:55 sbin
drwxr-xr-x   10 root     root           700 May  7 00:55 lib
drwxr-xr-x    9 root     root           180 May  7 00:54 usr
drwxr-xr-x    2 root     root            40 May  7 00:54 home
drwxr-xr-x    5 root     root           100 May  7 00:54 media
drwxr-xr-x    2 root     root            40 May  7 00:54 mnt
drwxr-xr-x    2 root     root            40 May  7 00:54 opt
drwxr-xr-x    2 root     root            40 May  7 00:54 srv
drwxr-xr-x   12 root     root           260 May  7 00:54 var
drwxrwxrwt    2 root     root            40 May  7 00:54 tmp

Well done! At this point, you have created a minimal VMM. However, there are some limitations:

It can only be operated through a serial console. You might want to implement virtio-net for networking.
Implementing virtio-blk for block devices.
Handling PCI devices is not yet supported.

The creation of ToyVMM served several personal objectives, including:

Deepening the understanding of virtualization.
Gaining a better understanding of virtio.
Learning about PCI passthrough:
- Exploring technologies like VFIO.
- Understanding peripheral technologies like mdev, libvfio, and VDPA.

While we have completed the creation of a minimal VMM, the direction you take it from here is up to you. ToyVMM is a great starting point, and you can choose to extend it in various ways. If you're reading this and you're an enthusiastic geek, I encourage you to give it a try! And if possible, I'd be delighted to receive feedback on ToyVMM.

Virtual I/O Device (Virtio)

In this section, as the second step of VMM, we will delve into the implementation of Virtio.
The Virtio specification is maintained by OASIS.
The latest version appears to be version 1.2, which was published on July 1, 2022.
The terminology related to Virtio in this document follows the definitions in version 1.2, so if you want to confirm the meaning of specific terms, please refer to the OASIS page.

In this section, we will cover fundamental knowledge about Virtio and its implementation.
Additionally, as concrete implementations based on Virtio, we will work on virtio-net and virtio-blk.
Once virtio-net is implemented, you will be able to communicate with a booted Guest VM over the network, enabling SSH login and internet connectivity.
Moreover, with virtio-blk implemented, you will handle block devices, meaning DISK I/O, within the virtual machine. With these two functionalities, you will have most of the requirements for a typical "virtual machine" in place, making Virtio implementation highly significant.

The topics in this section are structured as follows:

This document is based on the following commit numbers:

ToyVMM: 58cf0f68a561ee34a28ae4e73481f397f2690b51
Firecracker: cfd4063620cfde8ab6be87ad0212ea1e05344f5c

From this point onwards, we will explain the implemented source code using file names.
Here are the actual file paths referred to by the file names mentioned in the explanations:

File Name Mentioned in Explanations	File Path
`mod.rs`	src/vmm/src/devices/virtio/mod.rs
`queue.rs`	src/vmm/src/devices/virtio/queue.rs
`mmio.rs`	src/vmm/src/devices/virtio/mmio.rs
`status.rs`	src/vmm/src/devices/virtio/status.rs
`virtio_device.rs`	src/vmm/src/devices/virtio/virtio_device.rs
`net.rs`	src/vmm/src/devices/virtio/net.rs
`block.rs`	src/vmm/src/devices/virtio/block.rs

Please note that these file paths may change in the future as source code is updated.
Consider these file paths to be associated with the commit numbers mentioned earlier.

Virtio

What is Virtual I/O Device (Virtio)?

Virtio is a specification for virtual devices standardized by OASIS. It provides a virtual device interface for efficient data transfer and communication between the host system and guest systems (virtual machines).

Based on Virtio, there are implementations like virtio-net (virtual network device) and virtio-blk (virtual block device). As their names suggest, these implementations mimic the behavior of network and block devices, allowing guest operating systems to perform I/O operations as if they were using real network and block devices.

Virtio is compatible with major virtualization technologies such as KVM and is supported by a wide range of guest operating systems, including Linux, Windows, and FreeBSD. As a result, it has become an industry-standard specification widely adopted in virtualization environments.

Why is Virtio Necessary?

When it comes to generating I/O within a virtual machine (VM), how should the hypervisor handle it? First and foremost, the hypervisor needs to make the VM recognize the device at VM startup, which requires emulating various PCI devices. Additionally, when I/O is generated for those devices, the hypervisor must mimic the behavior of those devices. A well-known and widely used software for this kind of hardware emulation is QEMU.

The advantage of fully emulating real hardware using software is that you can use device drivers designed for physical hardware that come with the guest OS. However, this approach incurs significant overhead because it involves a VMExit each time an I/O request occurs within the VM. The hypervisor must perform emulation and then return control to the VM.

One proposed and standardized framework to reduce the overhead of virtualization in device I/O is Virtio. Virtio establishes a queue structure called Virtqueue in shared memory between the hypervisor and VM. This mechanism minimizes the number of mode transitions caused by VMExit. However, Virtio requires device drivers that is implemented for it, depending on the kernel build configuration. Many modern OS distributions come with Virtio device drivers installed by default.

Components of Virtio

Virtio mainly consists of the following components:

Virtqueue: A queue built in shared memory between the host and guest for performing data input and output.
Virtio driver: The guest-side driver for Virtio-based devices.
Virtio device: The host-side emulation of devices.

Virtio Overview

As depicted in the diagram, I/O requests initiated by the guest pass through Virtqueue to the host and responses are also mediated through Virtqueue back to the guest. Detailed behaviors and implementations will be discussed in the next section.

Additionally, when exposing Virtio devices to guests, it's possible to choose specific transport methods. The two common methods are "Virtio Over PCI Bus" which uses PCI (Peripheral Component Interconnect), and "Virtio Over MMIO Bus" which uses MMIO (Memory Mapped I/O). Guests have corresponding drivers such as virtio-pci and virtio-mmio for specific transports, along with Virtio drivers (virtio-net, virtio-blk) for particular device types.

In ToyVMM, we'll initially adopt virtio-mmio as the transport and proceed to implement Network devices for virtio-net and Block devices for virtio-blk.

References

Implementing Virtio in ToyVMM

In this section, we will delve into the implementation of Virtio in ToyVMM. There are three main topics covered in this discussion:

Implementation of Virtqueue
Implementation of lightweight notifications between the guest and host using irqfd and ioeventfd
Implementation of the MMIO Transport

As mentioned in the previous section, ToyVMM initially utilizes MMIO as the transport method for Virtio. Before diving into the detailed explanation, let's start by illustrating an overview of the Virtio implementation in this context.

By referring to this diagram as needed, we can better understand the explanations and code that follow.

Implementation Approach

In the implementation, the VirtioDevice itself is implemented as an abstract concept (Trait), and concrete devices like Net and Block are created to fulfill this trait. Similarly, since there are options for transport methods like PCI and MMIO (with MMIO being used here), we treat transport as an abstract concept and implement it according to the specific implementation, which in this case is MMIO.

Finally, we need to implement Virtqueues. While the number and usage of Virtqueues can vary depending on the implemented Virtio device, the structure of the Virtqueue remains consistent. We'll provide more details on this later.

Virtqueue Implementation

Virtqueue Deep-Dive

Before delving into the implementation of Virtqueue, let's gain a more detailed understanding of the typical Virtqueue structure. A Virtqueue is composed of three main elements: the Descriptor Table, the Available Ring, and the Used Ring. Here's what each of them does:

Descriptor Table : A table that holds entries (Descriptor) that store information such as addr and size of data to be shared between Host and Guest.
Available Ring : Structure that manages Descriptor that stores information that the Guest wants to notify to the Host.
Used Ring : Structure that manages Descriptor that stores information that the Host wants to notify to the Guest.

We'll explore each of these elements in detail while understanding how they cooperate. First, the Descriptor Table contains data structures like Descriptor (as indicated in the diagram) gathered together.

struct virtq_desc { 
        /* Address (guest-physical). */ 
        le64 addr; 
        /* Length. */ 
        le32 len; 
 
/* This marks a buffer as continuing via the next field. */ 
#define VIRTQ_DESC_F_NEXT   1 
/* This marks a buffer as device write-only (otherwise device read-only). */ 
#define VIRTQ_DESC_F_WRITE     2 
/* This means the buffer contains a list of buffer descriptors. */ 
#define VIRTQ_DESC_F_INDIRECT   4 
        /* The flags as indicated above. */ 
        le16 flags; 
        /* Next field if flags & NEXT */ 
        le16 next; 
};

Source: 2.7.5 The Virtqueue Descriptor Table

A Descriptor represents the data to be transferred and the location of the next data in the chain.

addr is the actual address of the data (guest's physical address), and the length of the data can be obtained from len.
flags provide information about whether there is a next descriptor, whether it's write-only, and other flags.
next indicates the number of the next descriptor, allowing Descriptor Table to be processed sequentially.

Usually, one Descriptor is used to send one piece of data. It's important to note that even if you allocate contiguous memory in virtual address space, if the physical addresses are not contiguous, each Descriptor will be needed for each physical page, resulting in multiple Descriptors being sent sequentially.

Next is the Available Ring. The Available Ring is structured as follows:

struct virtq_avail { 
#define VIRTQ_AVAIL_F_NO_INTERRUPT      1 
        le16 flags; 
        le16 idx; 
        le16 ring[ /* Queue Size */ ]; 
        le16 used_event; /* Only if VIRTIO_F_EVENT_IDX */ 
}

Source: 2.7.6 The Virtqueue Available Ring

The Available Ring is used to specify the Descriptors that need to be notified from the guest to the host.

flags are used for temporary interrupt suppression and other purposes.
idx points to the index of the newest entry in the ring.
ring is the actual ring body, holding Descriptor numbers.
used_event is also used for interrupt suppression but is only necessary if VIRTIO_F_EVENT_IDX is enabled.

The guest writes the location of the actual data to the Descriptor and the index information to the Available Ring (specifically in the ring field). It's important to note that the host needs to remember the index of the last processed ring. The guest can only provide information about the current state of the ring and the latest index (idx field). Therefore, the host compares the last processed entry number with the latest index information (idx) and checks for any differences (indicating new entries). If there are differences, it means there are new entries to process. The host then refers to the ring and retrieves the Descriptor index based on the index difference, obtains the data from the Descriptor, and processes it accordingly, depending on the specific device's implementation.

Finally, there's the Used Ring, which is the reverse of the Available Ring, meaning it's used to specify Descriptors that need to be notified from the host to the guest.

struct virtq_used { 
#define VIRTQ_USED_F_NO_NOTIFY  1 
        le16 flags; 
        le16 idx; 
        struct virtq_used_elem ring[ /* Queue Size */]; 
        le16 avail_event; /* Only if VIRTIO_F_EVENT_IDX */ 
}; 
 
/* le32 is used here for ids for padding reasons. */ 
struct virtq_used_elem { 
        /* Index of start of used descriptor chain. */ 
        le32 id; 
        /* 
         * The number of bytes written into the device writable portion of 
         * the buffer described by the descriptor chain. 
         */ 
        le32 len; 
};

Source: 2.7.8 The Virtqueue Used Ring

flags are used for temporary interrupt suppression and other purposes.
idx points to the index of the newest entry in the ring.
ring is the actual ring body, holding Descriptor numbers.
used_event is also used for interrupt suppression but is only necessary if VIRTIO_F_EVENT_IDX is enabled.

When returning notifications from the host to the guest, the descriptor is used to inform the guest of the location of the data to be returned, corresponding to the reply data. The index of the descriptor is stored in the ring of the Used Ring and the idx value is updated to point to the newest index in the ring before returning control to the guest.

However, unlike the Available Ring the elements of the ring are accompanied by a structure (virtq_used_elem).

id is the head entry of the descriptor chain (the same as virtq_avail.idx).
len stores information such as the total amount of I/O performed on the descriptor chain referred to by id on the host side.

The following diagram summarizes what has been explained so far.

This concludes the necessary knowledge for implementing Virtqueue.

Virtqueue implementation on ToyVMM

In ToyVMM, the implementation of Virtqueues is located in queue.rs.

The concrete addresses of the Descriptor Table, Available Ring, and Used Ring in guest memory are configured through interactions with the guest-side Device Driver during the guest VM startup. While we'll delve into this exchange as we peek into actual I/O requests from the guest, for now, let's focus on this fact.

ToyVMM needs to perform address accesses based on the specific starting addresses and Virtio specifications. In essence, it operates on a per-descriptor basis (where each descriptor points to the address of the actual data). During data processing, it updates the Available Ring and Used Ring.

Now, let's explore the code. The Queue structure in ToyVMM represents a Virtqueue and is defined as follows:

#[derive(Clone)]
/// A virtio queue's parameters
pub struct Queue {
    /// The maximal size in elements offered by the device
    max_size: u16,

    /// The queue size in elements the driver selected
    pub size: u16,

    /// Indicates if the queue is finished with configuration
    pub ready: bool,

    /// Guest physical address of descriptor table
    pub desc_table: GuestAddress,

    /// Guest physical address of the available ring
    pub avail_ring: GuestAddress,

    /// Guest physical address of the used ring
    pub used_ring: GuestAddress,

    next_avail: Wrapping<u16>,
    next_used: Wrapping<u16>,
}

In this structure, you can see the definitions for the Descriptor Table, Available Ring, and Used Ring, which represent the specific addresses in guest memory. These addresses are initialized during interactions with the guest's Device Driver, as mentioned earlier. From ToyVMM's perspective, these are merely physical memory addresses belonging to the guest, and ToyVMM accesses them based on Virtio specifications.

Now, let's delve into address access using the code.

ToyVMM abstracts the sequence of operations to get Descriptor from the state of Available Ring is hiddden as Virtqueue iteration, and in actual device implementations that utilize Virtqueues, you will find code structured like this:


#![allow(unused)]
fn main() {
// queue: 'Queue' struct
// desc_chain: 'DescriptorChain' struct
for desc_chain in queue.iter(mem) {
    // 'desc_chain' contains the 'addr,' 'len,' 'flags,' and 'next' values of the descriptor
    // Behind the iteration, data related to 'queue.avail_ring' is adjusted.
}
}

Behind the scenes of this iteration, let's explain what's happening. First, the iter function is implemented in the Queue structure, and it creates an AvailIter structure. To create this AvailIter, it fetches the latest idx in the Available Ring from the GuestMemory and the avail_ring's starting address.

/// A consuming iterator over all available descriptor chain heads offered by the driver
pub fn iter<'a, 'b>(&'b mut self, mem: &'a GuestMemoryMmap) -> AvailIter<'a, 'b> {
    ... // validation codes
    let queue_size = self.actual_size();
    let avail_ring = self.avail_ring;

    // Access the 'idx' fields of available ring
    // skip 2 bytes (= u16 / 'flags' member) from avail_ring address
    // and get 2 bytes (= u16 / 'idx' member representing the newest index of avail_ring) from that address.
    let index_addr = mem.checked_offset(avail_ring, 2).unwrap();
    let last_index: u16 = mem.read_obj(index_addr).unwrap();

    AvailIter {
        mem,
        desc_table: self.desc_table,
        avail_ring: self.avail_ring,
        next_index: self.next_avail,
        last_index: Wrapping(last_index),
        queue_size,
        next_avail: &mut self.next_avail,
    }
}

As you can see, the iter function returns an AvailIter. Inside the next function of AvailIter, if self.next_index equals self.last_index, it returns None, indicating the end of iteration. The next_index tracks the processed index values.

Inside the next function, the element pointed to by self.next_index in the Available Ring (which corresponds to a descriptor index) is retrieved. The DescriptorChain::checked_new function is called using this retrieved value, and the result value is returned as the element during iteration.

The checked_new function calculates the address of the element pointed to by the index value and accesses it, extracting information like the addr, len, flags, and next of the descriptor. Finally, it constructs a DescriptorChain structure with this information.

fn checked_new(
    mem: &GuestMemoryMmap,
    desc_table: GuestAddress,
    queue_size: u16,
    index: u16,
) -> Option<DescriptorChain> {
    if index >= queue_size {
        return None;
    }

    // The size of each element of the descriptor table is 16 bytes
    // - le64 addr  = 8 bytes
    // - le32 len   = 4 bytes
    // - le16 flags = 2 bytes
    // - le16 next  = 2 bytes
    // So, the calculation of the offset of the address
    // indicated by desc_index is 'index * 16'
    let desc_head = match mem.checked_offset(desc_table, (index as usize) * 16) {
        Some(a) => a,
        None => return None,
    };
    // These reads can't fail unless Guest memory is hopelessly broken
    let addr = GuestAddress(mem.read_obj(desc_head).unwrap());
    mem.checked_offset(desc_head, 16)?;
    let len = mem.read_obj(desc_head.unchecked_add(8)).unwrap();
    let flags: u16 = mem.read_obj(desc_head.unchecked_add(12)).unwrap();
    let next: u16 = mem.read_obj(desc_head.unchecked_add(14)).unwrap();
    let chain = DescriptorChain {
        mem,
        desc_table,
        queue_size,
        ttl: queue_size,
        index,
        addr,
        len,
        flags,
        next,
    };
    if chain.is_valid() {
        Some(chain)
    } else {
        None
    }
}

Since the next function returns a DescriptorChain, you access the descriptor's information when processing within the loop by accessing the relevant members of the DescriptorChain structure.

Although I have not mentioned it much so far, Used Ring also needs to be updated on the host side. However, this is not a difficult process and can be implemented by defining the following functions and calling them as necessary.


#![allow(unused)]
fn main() {
/// Puts an available descriptor head into the used ring for use by the guest
pub fn add_used(&mut self, mem: &GuestMemoryMmap, desc_index: u16, len: u32) {
    if desc_index >= self.actual_size() {
        // TODO error
        return;
    }
    let used_ring = self.used_ring;
    let next_used = (self.next_used.0 % self.actual_size()) as u64;

    // virtq_used structure has 4 byte entry before `ring` fields, so skip 4 byte.
    // And each ring entry has 8 bytes, so skip 8 * index.
    let used_elem = used_ring.unchecked_add(4 + next_used * 8);
    // write the descriptor index to virtq_used_elem.id
    mem.write_obj(desc_index, used_elem).unwrap();
    // write the data length to the virtq_used_elem.len
    mem.write_obj(len, used_elem.unchecked_add(4)).unwrap();

    // increment the used index that is the last processed in host side.
    self.next_used += Wrapping(1);

    // This fence ensures all descriptor writes are visible before the index update is.
    fence(Ordering::Release);
    mem.write_obj(self.next_used.0, used_ring.unchecked_add(2))
        .unwrap();
}

Please remember this underlying mechanism as it forms the basis for the actual I/O implementation in the virtio-net and virtio-blk devices, which we will explain in the following sections.

Implementation of Lightweight Communication between Guest and Host using irqfd and ioeventfd

So far, we've discussed the implementation of Virtqueues, but now let's delve into another crucial aspect related to Virtqueues: the "notification" mechanism required for communication between the host and guest when using Virtqueues. In Virtio, after filling Virtqueues with data, a mechanism for notifying the host from the guest or the guest from the host becomes necessary. Understanding how this notification is realized is essential.

In essence, notifications between the guest and host are achieved using the mechanisms ioeventfd and irqfd. Both of these mechanisms are provided through the KVM API.

First, for notifications from the guest to the host, we use ioeventfd. ioeventfd transforms memory writes caused by PIO/MMIO operations in the guest VM into eventfd notifications. The KVM_IOEVENTFD is used as part of the KVM API, where you provide the eventfd for notifications and the address for MMIO. Writes to this MMIO address are converted into notifications to the specified eventfd. As a result, software on the host side (in this case, ToyVMM) can receive notifications from the guest via eventfd. This mechanism enhances event notification efficiency, making it a more lightweight implementation compared to traditional polling or interrupt handler-based methods.

Next, for notifications from the host to the guest, we use the irqfd mechanism. Although we've used irqfd in previous implementations, we employ KVM_IRQFD here. By passing the eventfd to be used for notifications and the IRQ number corresponding to the desired guest IRQ when using KVM_IRQFD, writes to the eventfd on the ToyVMM side are converted into hardware interrupts for the specified guest IRQ.

Using the notification features based on the KVM API mentioned above, we achieve communication between the guest and host. Specific usage details will be discussed in the following section, "Implementation of MMIO Transport."

Implementation of MMIO Transport

Now, let's delve into the implementation of MMIO Transport.

Virtio Over MMIO provides the official specification for MMIO Transport, and you may want to refer to it as needed.

MMIO Transport is a method that can be easily used in virtual environments without PCI support, and it appears that Firecracker primarily supports MMIO Transport. MMIO Transport operates by performing device operations through Read/Write to specific memory regions.

MMIO Transport does not utilize a generic Device discovery mechanism like PCI. Therefore, Device discovery in MMIO involves providing information about the memory-mapped device's location and interrupt position to the guest OS, as described in MMIO Device Discovery. While the official documentation suggests documenting this in the Device Tree, an alternative method is to embed it in the kernel's command-line arguments during startup, as documented here. This latter method is used in this context because ToyVMM can dynamically adjust these command-line arguments at guest VM startup.

With this method, you can provide information to the guest VM to perform device discovery. The following format is used to describe MMIO device discovery:

(format)
virtio_mmio.device=<size>@<baseaddr>:<irq>

(example)
virtio_mmio.device=4K@0xd0000000:5

In this case, the guest VM uses the address 0xd0000000 as the starting point and performs Read/Write at predetermined offsets (register positions) to initialize and configure the device. The details are described in MMIO Device Register Layout.

From ToyVMM's perspective, it's crucial to ensure that processing according to the specifications is carried out for each Register when Read/Write operations occur. This is the core of the MMIO Transport implementation. Typically, IO to the MMIO region is processed as KVM_EXIT_MMIO, and handling this correctly allows initialization and configuration of the device through this flow.

On the other hand, notifications for I/O via Virtqueue, which we've discussed so far, are managed using ioeventfd and irqfd. In MMIO Transport, writing to the address offset 0x050 from the base address corresponds to the process of notifying the device that "data to be processed is present in the buffer." In other words, by associating this address with Eventfd using KVM_IOEVENTFD and then writing code to handle the desired Virtio Device's processing through this Eventfd, Notify events (writes to MMIO) generated by the guest can be directly notified as interrupts to the Eventfd.
Additionally, since IRQ information is provided to the guest via the command line, the guest sets up to launch the corresponding device handler when an interrupt occurs at the specified IRQ. Conversely, when you want to trigger an interrupt in the Virtio device presented to the Guest VM (when you want to delegate processing to the Guest VM), you can do so by writing to this IRQ. Essentially, by creating Eventfd and registering it with IRQ using KVM_IRQFD, you can trigger interrupts by writing to this Eventfd from the ToyVMM side.

The figure below summarizes the above discussion, and ToyVMM implements this scheme:

MMIO Transport - Implementation Corresponding to MMIO Device Register Layout

The implementation of MMIO Transport can be found in the mmio.rs file. The MmioTransport structure, like the I/O Bus we discussed in the Serial Console implementation, implements the BusDevice trait and is registered within the Bus structure. It allows handling MMIO I/O in response to KVM_EXIT_MMIO, similar to how traditional VcpuExit::IoIn and VcpuExit::IoOut are processed.

Therefore, the MmioTransport implements the read and write functions required to satisfy the BusDevice. These functions contain specific logic for handling register accesses, which are essentially the device emulation processes. Naturally, this implementation follows the specifications of the MMIO Device Register Layout. Here's a portion of the read function as an example:

impl BusDevice for MmioTransport {
    // OASIS: MMIO Device Register Layout
    #[allow(clippy::bool_to_int_with_if)]
    fn read(&mut self, offset: u64, data: &mut [u8]) {
        match offset {
            0x00..=0xff if data.len() == 4 => {
                let v = match offset {
                    0x0 => MMIO_MAGIC_VALUE,
                    0x04 => MMIO_VERSION,
                    0x08 => self.device.device_type(),
                    0x0c => VENDOR_ID,
                    0x10 => {
                        self.device.features(self.features_select)
                            | if self.features_select == 1 { 0x1 } else { 0x0 }
                    }
                    0x34 => self.with_queue(0, |q| q.get_max_size() as u32),
                    0x44 => self.with_queue(0, |q| q.ready as u32),
                    0x60 => self.interrupt_status.load(Ordering::SeqCst) as u32,
                    0x70 => self.driver_status,
                    0xfc => self.config_generation,
                    _ => {
                        println!("unknown virtio mmio register read: 0x{:x}", offset);
                        return;
                    }
                };
                LittleEndian::write_u32(data, v);
            }
            0x100..=0xfff => self.device.read_config(offset - 0x100, data),
            _ => {
                // WARN!
                println!(
                    "invalid virtio mmio read: 0x{:x}:0x{:x}",
                    offset,
                    data.len()
                );
            }
        }
    }

A detailed explanation of the read and write functions would be quite extensive, so I'll skip it here. However, as you can see, the implementation is straightforward, and you can easily understand it by referring to the specification while examining the source code.

The processing in this part is essential for handling initialization and configuration of Virtio devices during the initialization sequence called by the Device Driver from the Guest VM during startup. By adding debugging code here, you can observe the device initialization sequence initiated by the guest.

Observing MMIO Device Initialization during Guest VM Startup Sequence

The Guest OS includes the Virtio device driver (on the guest side), which is expected to perform Virtio device initialization according to the specification. In this MMIO-based implementation, during startup, the guest VM is supposed to perform R/W operations on the MMIO range of the Virtio device based on the information of the MMIO range specified in the kernel command-line. As Hypervisor, it's necessary to trap this and handle it appropriately since it corresponds to the code section where VMExit occurs during the guest VM's boot. Debugging code can be easily incorporated for observation.

Before we examine the specific processing flow, let's organize the initialization process according to the specification. In the following discussion, we will use the initialization of a Virtio network device (virtio-net) as an example. Device initialization specification is divided into three parts: Initialization in MMIO Transport (MMIO-specific Initialization And Device Operation), General Initialization And Device Operation (General Initialization And Device Operation), and Device-specific Initialization. Combining these, the flow is generally as follows:

Read the Magic Number. Read Device ID, Device Type, Vendor ID, and other information.
Reset the device.
Set the ACKNOWLEDGE status bit.
Read the Device feature bits and set the device with feature bits that the OS and driver can interpret.
Set the FEATURES_OK status bit.
Perform device-specific settings (detecting and configuring Virtqueues, writing configuration).
Set the DRIVER_OK status bit, and at this point, the device is in a live state.

Now, keeping this in mind, let's examine the actual processing during the guest VM's startup. Below is an example where I have added debugging code, and the output generated by the debugging code is annotated with comments for explanation.

# Read the magic number from offset 0x00
# Since it's Little Endian, the original values are 116, 114, 105, 118
# 116(10) = 74(16)
# 114(10) = 72(16)
# 105(10) = 69(16)
# 118(10) = 76(16)
# Therefore, it's 0x74726976 (magic number)
MmioRead: addr = 0xd0000000, data = [118, 105, 114, 116]
# Read device id (0x02) from offset 0x04
MmioRead: addr = 0xd0000004, data = [2, 0, 0, 0]
# Read device type (net = 0x01) from offset 0x08
MmioRead: addr = 0xd0000008, data = [1, 0, 0, 0]
# Read vendor id (virtio vendor id = 0x00) from offset 0x0c
MmioRead: addr = 0xd000000c, data = [0, 0, 0, 0]

# This part is Device Initialization Phase (3.1.1 Driver Requirements: Device Initialization)
# Write 0 to offset 0x70 (= Status) to reset the device status
MmioWrite: addr = 0xd0000070, data = [0, 0, 0, 0]
# Read from offset 0x70, and now the device is reset
MmioRead: addr = 0xd0000070, data = [0, 0, 0, 0]
# Write 0x01 to offset 0x70 (= Status) to set the ACKNOWLEDGE bit
MmioWrite: addr = 0xd0000070, data = [1, 0, 0, 0]
# Read from offset 0x70, perhaps for confirmation?
MmioRead: addr = 0xd0000070, data = [1, 0, 0, 0]
# Add 0x02 = Device(2) to offset 0x70 (= Status), so 0x70 is 0x03
MmioWrite: addr = 0xd0000070, data = [3, 0, 0, 0]

# Processing for Device/Driver Feature bits.
# The device provides its own feature set (feature bits),
# and the driver reads it and instructs the device which feature subset to accept.
# 
# First, the Virtio device driver in the Guest OS reads the feature bits
# Write 0x01 to offset 0x14 (= DeviceFeatureSel) to select the next operation
MmioWrite: addr = 0xd0000014, data = [1, 0, 0, 0]
# Read from offset 0x10 (= DeviceFeatures).
# It reads DeviceFeatures bit, and it returns (DeviceFeatureSel * 32) + 31 bits.
# Now DeviceFeatureSel=1, so it returns the DeviceFeatures bits of 64~32 bits.
# For virtio-net, DeviceFeatureSel=0x0000_0001_0000_4c83 (64-bit),
# so it returns 0x0000_0001 in Little Endian.
MmioRead: addr = 0xd0000010, data = [1, 0, 0, 0]
# Write 0x00 to offset 0x14 (= DeviceFeatureSel) for the next operation
MmioWrite: addr = 0xd0000014, data = [0, 0, 0, 0]
# Read from offset 0x10 (= DeviceFeatures).
# Now DeviceFeatureSel=0, so it returns the lower 32 bits of DeviceFeatures.
# For virtio-net, DeviceFeatureSel=0x0000_0001_0000_4c83 (64-bit),
# so it returns 0x0000_4c83 in Little Endian.
# Now, Confirmation of Values of 0x0000_4c83
# Reversed Little Endian: 0,0,76,131
# 76(10) = 4c
# 131(10) = 83
# 0x00004c83 -> Ignoring the bit set by VIRTIO_F_VERSION_1 (0x100000000) in avail_features (0x100004c83)
# In other words,
# * virtio_net_sys::VIRTIO_NET_F_GUEST_CSUM
# * virtio_net_sys::VIRTIO_NET_F_CSUM
# * virtio_net_sys::VIRTIO_NET_F_GUEST_TSO4
# * virtio_net_sys::VIRTIO_NET_F_GUEST_UFO
# * virtio_net_sys::VIRTIO_NET_F_HOST_TSO4
# * virtio_net_sys::VIRTIO_NET_F_HOST_UFO
# The feature bits of this information are returned.
MmioRead: addr = 0xd0000010, data = [131, 76, 0, 0]
# The reading of feature bits is done here, and from here, it instructs the device about the feature subset to accept.
# The process is similar to reading, where you write 0x00/0x01 to DriverFeatureSel bit
# and then write the feature bits you want to set to DriverFeature.
# First, write 0x01 to offset 0x24 (DriverFeatureSel/activate guest feature) to set 'acked_features' to 0x01
MmioWrite: addr = 0xd0000024, data = [1, 0, 0, 0]
# Write 0x01 (= 0x0000_0001, one of the values read earlier) to offset 0x20 (DriverFeatures).
# Since DriverFeatureSel is set to 0x01, a 32-bit shift occurs, and 0x0000_0001_0000_0000 is actually set.
MmioWrite: addr = 0xd0000020, data = [1, 0, 0, 0]
# Write 0x00 to offset 0x24 (DriverFeatureSel/activate guest feature) to set 'acked_features' to 0x00
MmioWrite: addr = 0xd0000024, data = [0, 0, 0, 0]
# Write 0x0000_4c83 (the other value read earlier) to offset 0x20 (DriverFeatures).
MmioWrite: addr = 0xd0000020, data = [131, 76, 0, 0]
# The processing of Feature bits is completed here.

# Read offset 0x70(= Status) -> Since 0x03 was specified most recently, returning 0x03 is good.
MmioRead: addr = 0xd0000070, data = [3, 0, 0, 0]
# Write the value (3 + 8 = 11) obtained by 'adding' 0x08 = FEATURES_OK(8) to offset 0x70(= Status)
MmioWrite: addr = 0xd0000070, data = [11, 0, 0, 0]
# Read from offset 0x70(= Status). Naturally, 11 is returned.
MmioRead: addr = 0xd0000070, data = [11, 0, 0, 0]

# Device-specific setup starts from here (4.2.3.2 Virtqueue Configuration)
# Write 0x00 to offset 0x30 (= QueueSel) to select self.queue_select
MmioWrite: addr = 0xd0000030, data = [0, 0, 0, 0]
# Read from offset 0x44 (= QueueReady), and it's not ready yet, so it returns 0x0 as expected
MmioRead: addr = 0xd0000044, data = [0, 0, 0, 0]
# Read from offset 0x34 (= QueueNumMax) to check the queue size
MmioRead: addr = 0xd0000034, data = [0, 1, 0, 0]
# Write the previously read QueueNum to offset 0x38 (= QueueNum)
MmioWrite: addr = 0xd0000038, data = [0, 1, 0, 0]

# Virtual queue's 'descriptor' area 64-bit long physical address
# Write the location of the descriptor area of the selected queue (0)
# to offset 0x80 (= QueueDescLow = lo(q.desc_table) / lower 32 bits of the address)
MmioWrite: addr = 0xd0000080, data = [0, 64, 209, 122]
# Same as above, but set the remaining part of 0x84 (QueueDescHigh = hi(q.desc_table) / higher 32 bits of the address)
MmioWrite: addr = 0xd0000084, data = [0, 0, 0, 0]
# Combining the two, it's 0x0000_0000_7ad1_4000 (q.desc_table) as the base address

# Virtual queue's 'driver' area 64-bit log physical address
# Write the location of the driver area (avail_ring) of the selected queue (0)
# to offset 0x90 (= QueueDeviceLow = lo(q.avail_ring) / lower 32 bits of the address)
MmioWrite: addr = 0xd0000090, data = [0, 80, 209, 122]
# Same as above, but set the remaining part of 0x94 (QueueDeviceHigh = hi(q.avail_ring) / higher 32 bits of the address)
MmioWrite: addr = 0xd0000094, data = [0, 0, 0, 0]
# Combining the two, it's 0x0000_0000_7ad1_5000 (q.avail_ring)
# Address range of q.desc_table: q.avail_ring - q.desc_table = 0x1000 = 512(10)

# Virtual queue's 'device' area 64-bit long physical address
# Write the location of the device area (used_ring) of the selected queue (0) to offset 0xa0 (= QueueDeviceLow = lo(q.used_ring) / lower 32bits of the address)
MmioWrite: addr = 0xd00000a0, data = [0, 96, 209, 122]
# Same as above, but set the remaining part of 0xa4 (QueueDeviceHigh = hi(q.used_ring) / higher 32 bits of the address)
MmioWrite: addr = 0xd00000a4, data = [0, 0, 0, 0]
# Combining the two, it's 0x0000_0000_7ad1_6000 (q.used_ring)
# Address range of q.avail_ring: q.used_ring - q.avail_ring = 0x1000 = 512(10)

# Write 0x1 to offset 0x44 (QueueReady = q.ready) to make it Ready
MmioWrite: addr = 0xd0000044, data = [1, 0, 0, 0]

# The same process is performed for the other queue (1)
MmioWrite: addr = 0xd0000030, data = [1, 0, 0, 0]
MmioRead: addr = 0xd0000044, data = [0, 0, 0, 0]
MmioRead: addr = 0xd0000034, data = [0, 1, 0, 0]
MmioWrite: addr = 0xd0000038, data = [0, 1, 0, 0]
MmioWrite: addr = 0xd0000080, data = [0, 128, 196, 122]
MmioWrite: addr = 0xd0000084, data = [0, 0, 0, 0] # q.desc_table = 0x0000_0000_7ad1_8000
MmioWrite: addr = 0xd0000090, data = [0, 144, 196, 122]
MmioWrite: addr = 0xd0000094, data = [0, 0, 0, 0] # q.avail_ring = 0x0000_0000_7ad1_9000
MmioWrite: addr = 0xd00000a0, data = [0, 160, 196, 122]
MmioWrite: addr = 0xd00000a4, data = [0, 0, 0, 0] # q.used_ring = 0x0000_0000_7ad1_a000
MmioWrite: addr = 0xd0000044, data = [1, 0, 0, 0]
# Device-specific setup (setup of two queues for virtio-net) is completed here

# Read from offset 0x70 (= Status) and return 0x11, which was written recently
MmioRead: addr = 0xd0000070, data = [11, 0, 0, 0]
# Write 0x04 (DRIVER_OK(4)) to offset 0x70 (= Status) to 'add' it to the current value (11 + 4 = 15)
MmioWrite: addr = 0xd0000070, data = [15, 0, 0, 0]
# Read from offset 0x70 (= Status), and naturally, it returns 15
MmioRead: addr = 0xd0000070, data = [15, 0, 0, 0]
# Device Initialization Phase (3.1.1 Driver Requirements: Device Initialization) is completed here

When interpreted carefully, it becomes evident that the behavior aligns with the specifications. For reading and writing device-specific configurations, It execute the appropriate function from the VirtioDevice associated with the MmioTransport initialization. In other words, the VirtioDevice Trait requires implementations to provide the necessary information for such operations.

Additionally, during the initialization process, there are multiple MMIO writes to offset=0x70. These correspond to updating the status as the initialization sequence progresses. ToyVMM confirms the completion of these status updates (ACKNOWLEDGE -> DRIVER -> FEATURES_OK -> DRIVER_OK transition). After the DRIVER_OK status update, ToyVMM calls the activate function to perform device-specific activation procedures (e.g., setting up epoll and its handlers). The specifics of this activation process are delegated to individual device implementations.

Summary

In this section, we provided a detailed explanation of the Virtio mechanisms within ToyVMM. In the following sections, we will introduce concrete implementations of actual devices that were not covered in this section, specifically Network Devices and Block Devices. We will also verify the execution of specific I/O operations using the code implemented according to the Virtio principles.

Reference

Implement virtio-net device

In this section, we will proceed with the implementation of a Network Device as a specific Virtio Device. While the specification can be found in the official OASIS documentation Network Device, please note that this implementation may not align perfectly with the specification. If you haven't read the previous sections, be sure to review them before continuing with this section.

virtio-net Mechanism

In virtio-net, three types of Virtqueues are typically used: Transmit Queue, Receive Queue, and Control Queue. The Transmit Queue is used for data transmission from the guest to the host, the Receive Queue for data transmission from the host to the guest. The Control Queue is used for guest-to-host operations related to NIC settings, such as setting promiscuous mode, enabling/disabling broadcast reception, and multicast reception. For the sake of brevity, we are omitting the implementation of the Control Queue in this section. It is worth noting that the specification allows scaling the number of Virtqueues, but for simplicity, we are not implementing that.

In the following sections, we will provide detailed implementation-based explanations of the network device.

Network Device Implementation Details

The implementation of virtio-net can be found in net.rs. We will break down and explain the initialization phase and the post-initialization phase.

The following diagram primarily focuses on the initialization process of the Network Device:

The Net struct implements the VirtioDevice Trait and is associated with the MmioTransport. As mentioned earlier, device-specific operations during MMIO Transport initialization depend on the implementation of this Net struct.

For example, during initialization, a query about the Device Type occurs. According to the specification, for a Net device, this should return 0x01, and the Net struct implements it as follows:

impl VirtioDevice for Net {
    fn device_type(&self) -> u32 {
        // types::NETWORK_CARD:u32 = 0x01
        types::NETWORK_CARD
    }
    ...
}

Similarly, queries about the Device Feature should be implemented to return device-specific values. Additionally, during initialization, the guest OS initializes the Descriptor Table, Available Ring, and Used Ring of the Virtqueues, and the addresses for each of these are notified. This allows the addresses to be stored for each queue so they can be referenced during actual processing. Once the initialization steps are completed and the status is updated to a specific value, ToyVMM executes the activate function implemented in the device. In the case of the Net device, within this activate function, various file descriptors are registered with epoll, and the setup of handlers (e.g., NetEpollHandler) triggered by epoll is performed. The Net device emulates I/O by creating a Tap device on the host side and writing data received from the guest via Virtqueue to the Tap device for transmission (Tx), and writing incoming data from the Tap device to Virtqueue to notify the guest (Rx). Four file descriptors are registered with epoll: the fd of the tap device, an eventfd for notification of the Tx Virtqueue, an eventfd for notification of the Rx Virtqueue, and an eventfd for halting in unexpected situations.

Next, we will provide a detailed diagram of the Network Device in its activated state:

When one of the registered file descriptors in epoll triggers an event, it dispatches the NetEpollHandler for event processing. NetEpollHandler varies its actions based on the event triggered. In any case, within the NetEpollHandler, it references the Virtqueue and performs I/O emulation.

One important point to note is that the initialization process of the device, based on KVM_EXIT_MMIO, is a processing call within the thread handling vCPU. In ToyVMM, this is done in a separate thread from the main thread. However, the thread responsible for executing I/O is also separate from the vCPU processing thread (currently handled in the main thread). To facilitate communication between these threads, channels are used to send the initialized NetEpollHandler. This allows I/O to be processed while the guest VM is running and CPU emulation is conducted in a separate thread.

As mentioned earlier, communication between the host and guest is primarily triggered by events related to Virtqueue Eventfds and Tap device file descriptors. In the following sections, we will provide more detailed explanations of how processing occurs in both the Tx and Rx cases.

Tx (Guest -> Host)

Let's start by examining the implementation of communication in the Guest -> Host direction (Tx) and provide a detailed explanation. Once again, for Tx, the Descriptor Table, Available Ring, and Used Ring function as follows:

Descriptor Table: It contains descriptors that point to the data the Guest is trying to transmit.
Available Ring: It stores the index of the descriptor pointing to the transmit data. The Host reads this index and processes Tx emuration.
Used Ring: It stores the index of descriptors that have been processed on the Host side. The Guest reads this index to collect processed descriptors.

Tx initiates when the guest (guest device driver) prepares a packet, and control is transferred to ToyVMM when a Write operation occurs on QueueNotify.

Specifically, in the guest, the following steps are expected:

The guest sets the data address and length in the first Descriptor's addr and len fields.
The guest stores the index of the Descriptor pointing to the transmit data in the Available Ring entry, pointed to by the Available Ring index.
The guest increments the Available Ring index.
To notify the host of unprocessed data, the guest writes to the MMIO QueueNotify.

Now, let's shift our focus to the host side, which is handled by ToyVMM. The EventFd triggered by the write to MMIO's QueueNotify is picked up by epoll monitoring. It triggers the NetEpollHandle's handler processing, specifically, the execution of the TX_QUEUE_EVENT corresponding operation.

The implementation calls the process_tx function.
In process_tx, the processing proceeds as follows:

Initialization of necessary variables, including:
- frame[0u8; 65562]: A buffer to copy the data prepared by the guest.
- used_desc_heads[0u16; 256]: Data for storing the index of processed Descriptors and for updating the Used Ring at the end.
- used_count: A counter to keep track of how much data has been read from the guest.
Iteration over the TX Virtqueue until it stops, repeating steps 3 to 5.
Reading the data information (located at addr) pointed to by the Descriptor and loading it into the buffer. If the next field points to another Descriptor, it is followed, and the data is read out.
Writing the read data to the Tap device.
Storing the index of processed Descriptors (the Descriptor pointed to by the Available Ring) in used_desc_heads.
Updating the Used Ring with the information of processed Descriptors' indexes and the total amount of data stored.
Writing to the eventfd associated with the irq to trigger an interrupt and delegate processing to the guest.

On the guest side, the following steps are expected:

Check the index of the Used Ring, and if there is a difference between the index of processed entries and the previously recorded index, check and process the Descriptor indexes to fill this gap.
The Descriptors pointed to by these Descriptor indexes have been processed on the host side, so they are returned to the chain of free Descriptors, and the recorded Descriptor numbers are updated.
Repeat steps 1 and 2 until there is no difference between the index of the Used Ring and the recorded index position.

This completes the Tx processing.

Rx (Host -> Guest)

Next, let's explain the communication from the host to the guest (Rx) while referring to the implementation. In the case of Rx, the Descriptor Table, Available Ring, and Used Ring function as follows:

Descriptor Table: It contains descriptors that point to received data, allowing the Guest to access the received data from the Tap.
Available Ring: It is used for the transfer of completed empty descriptors from the Guest's side.
Used Ring: It stores the index of descriptors pointing to received data, which the Guest reads to process the necessary descriptors.

Comparing Rx to Tx, you can see that the roles of the Available Ring and Used Ring are reversed.

Unlike Tx, Rx requires handling two types of event triggers: incoming packets from the Tap device and completion notifications from the guest for the Rx Virtqueue. Handling Rx is more complex compared to Tx due to the need to manage these two types of event triggers.

First, let's discuss the basic Rx processing flow, followed by considerations for cooperative behavior.

Basic Rx Processing Flow

The host receives data from the Tap device and needs to notify the guest by filling the Rx Virtqueue with data. To do this, some basic setup is required for the Rx Virtqueue, such as knowing where to place the data. It's important to remember that, from the perspective of ToyVMM, each element of the Virtqueue consists only of guest memory addresses, and necessary operations are performed based on Virtqueue memory access.

Returning to the guest, the following steps are expected:

After initializing Descriptor chains and other settings, the guest assigns the index of the head of the free Descriptor chain to the empty entry pointed to by the Available Ring index.
The guest increments the Available Ring index.
To notify the host, the guest writes to MMIO QueueNotify.

On the host side, when Rx Virtqueue notification is received from the guest, it interprets this as the Rx data space is ready for address access.

Suppose the Tap device receives a packet at this point. By detecting the trigger of the Tap's file descriptor, the NetEpollHandler is dispatched, and it performs event processing for RX_TAP_EVENT. This processing mainly involves calling the process_rx function. However, there are certain conditions under which this may not happen, which we will discuss later.

process_rx proceeds as follows:

process_rx processes as many frames as possible received from the Tap by looping until no data can be read.
If a successful read occurs from the Tap, the size of the read data is stored in self.rx_count, and the rx_single_frame function, which processes a single frame, is called.
In rx_single_frame, the first entry from the Available Ring is retrieved, and the beginning of the free Descriptor chain that this entry points to is extracted.
The received single frame's data is stored in the Descriptor, calculating the size along the way. If the received frame cannot fit into a single Descriptor, the next field of the Descriptor is followed to continue storing data.
The Used Ring of the Rx Virtqueue is updated with information about the index of the Descriptor containing Rx data and the total amount of data stored.
An interrupt is triggered by writing to the eventfd associated with the irq to delegate processing to the guest.

The following diagram illustrates the process of writing the received data into the Descriptor chain using the Available Ring:

Once the data from the Tap device has been written, the Used Ring is updated, and an interrupt is sent to the guest.

On the guest side, the guest checks the Used Ring index, references the Descriptor pointed to by new entries, retrieves and processes Rx data, and performs any necessary operations. It then updates the Available Ring, signaling to the host that it is ready to accept new data.

When Tap Trigger Occurs Without Rx Virtqueue Preparation

It is expected that there may be cases where Tap receives a packet when Rx Virtqueue is not ready. In such cases, even if data is extracted from Tap, it is impossible to obtain information about where to store it, preventing further processing.

To address this, a mechanism to delay Tap device processing until Rx Virtqueue is prepared is required. In the ToyVMM code, this is controlled using a flag called deferred_rx.

When this flag is set, ToyVMM's Rx-related processing follows the following strategy:

When RX_QUEUE_EVENT is triggered, indicating that the Rx Virtqueue is ready to receive data from the guest, data is immediately retrieved from the Tap device, and processing continues. If processing is completed at this point, the flag is cleared.
When TAP_RX_EVENT is triggered, processing is temporarily paused to check the status of the Rx Virtqueue. If processing can proceed, it continues, and if processing is completed, the flag is cleared. If processing cannot proceed or the amount of data in the Virtqueue is smaller than the received data, the flag is not cleared, and it waits for the Rx Virtqueue to be ready again.

When Tap Reception Exceeds the Prepared Rx Virtqueue

Another case to consider is when the data received by Tap exceeds the capacity of the prepared Virtqueue, as briefly mentioned above. In this case, the strategy is essentially the same, and the processing is temporarily interrupted until the next Virtqueue is prepared, controlled by the deferred_rx flag. When the Virtqueue is ready, processing resumes.

Verification of virtio-net Operation

Let's test whether communication between the host and guest is possible using the implemented Virtio mechanism and the Network Device. Below is the result of executing the ip addr command inside the guest. eth0 is recognized as a virtual NIC.

localhost:~# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 02:32:22:01:57:33 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::32:22ff:fe01:5733/64 scope link
       valid_lft forever preferred_lft forever

Let's also check on the host side. In ToyVMM, a Tap device is created on the host side. So assign an IP address (192.168.0.10/24).

140: vmtap0: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
    link/ether c6:69:6d:65:05:cf brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.10/24 brd 192.168.0.255 scope global vmtap0
       valid_lft forever preferred_lft forever

Additionally, assign an IP address to the guest side. Here, an address within the same subnet range as the host is assigned.

localhost:~# ip addr add 192.168.0.11/24 dev eth0

Now that everything is set up, let's ping the IP address of the host's Tap interface from within the guest. You should receive responses as follows:

localhost:~# ping -c 5 192.168.0.10
PING 192.168.0.10 (192.168.0.10): 56 data bytes
64 bytes from 192.168.0.10: seq=0 ttl=64 time=0.394 ms
64 bytes from 192.168.0.10: seq=1 ttl=64 time=0.335 ms
64 bytes from 192.168.0.10: seq=2 ttl=64 time=0.334 ms
64 bytes from 192.168.0.10: seq=3 ttl=64 time=0.321 ms
64 bytes from 192.168.0.10: seq=4 ttl=64 time=0.330 ms

--- 192.168.0.10 ping statistics ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max = 0.321/0.342/0.394 ms

Conversely, if you ping the IP address of the virtio-net interface in the guest from the host, you should also receive responses:

[mmichish@mmichish ~]$ ping -c 5 192.168.0.11
PING 192.168.0.11 (192.168.0.11) 56(84) bytes of data.
64 bytes from 192.168.0.11: icmp_seq=1 ttl=64 time=0.410 ms
64 bytes from 192.168.0.11: icmp_seq=2 ttl=64 time=0.366 ms
64 bytes from 192.168.0.11: icmp_seq=3 ttl=64 time=0.385 ms
64 bytes from 192.168.0.11: icmp_seq=4 ttl=64 time=0.356 ms
64 bytes from 192.168.0.11: icmp_seq=5 ttl=64 time=0.376 ms

--- 192.168.0.11 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4114ms
rtt min/avg/max/mdev = 0.356/0.378/0.410/0.028 ms

Although this is a simple confirmation using ICMP, it confirms that communication is functioning properly!

Reference

Implement virtio-blk device

In this section, we will implement the Block device used by the guest's virtio-blk. The specification is similar to Virtio and is officially published by OASIS in the Block Device section. However, please note that this implementation may not be fully compliant with this specification.

Before proceeding with this section, make sure you have read the previous sections as the concepts introduced earlier will be used without further explanation here. Additionally, please read the Implement virtio-net device from the previous section as this section may omit details that overlap with the virtio-net implementation.

Mechanism of virtio-blk

In virtio-blk, a single Virtqueue is used to represent DISK Read/Write from the guest. Unlike virtio-net, there are no external factors (such as receiving data from Tap), and it is purely driven by I/O requests from the guest. Therefore, it operates with a minimum of one Virtqueue. Although the specification allows for scaling the number of Virtqueues, we have not implemented it for simplicity.

In the following sections, we will explain the implementation details based on code examples.

Implementation Details of virtio-blk

The implementation of virtio-blk can be found in block.rs. The roles and relationships of various structures are shown in the following diagram:

Block Device Initialization

As mentioned earlier, the concrete implementation depends on the specific device. However, it is abstracted by the VirtioDevice trait, so everything else, apart from the details of various device implementations, works the same as shown for virtio-net. Therefore, this diagram mostly resembles the internal details of the Block Device, with minor differences.

During initialization, queries such as Device Type and Features are responded to by the specific implementation of the Block device. Similar to the Net device, the addresses of the Virtqueue on the Guest's address space are set up and provided. Once the initialization steps are completed, the activate function is executed.

For the Block device, like the Net device, various file descriptors are registered with epoll during initialization. Handlers (BlockEpollHandler) are set up to be executed when epoll triggers, just like in the case of the Net device. In the Block device, to emulate I/O, a host-side file (to be operated as a BlockDevice) is opened, and read/write requests from the guest are performed on it. The file descriptors registered with epoll include an eventfd for the Virtqueue and another eventfd for stopping the system in case of unexpected situations, making a total of two file descriptors. In comparison to the Net device, you can see that the Tap device has been replaced by a file, and the number of eventfds has changed. However, apart from these changes, there are no significant differences in the behavior.

For the Block device, a single Virtqueue is associated with the firing of an eventfd. Therefore, we will focus on this process in the following sections.

I/O Requests in virtio-blk

Before delving into the implementation details, let's explain the I/O requests in virtio-blk.

As mentioned earlier, virtio-blk handles I/O requests from the guest through a single Virtqueue. However, guest-originated I/O requests can be broadly categorized into two types: Read and Write, and the processing required for each of them is significantly different. The host must determine how to recognize these requests and emulate the I/O correctly.

To explain this, we need to understand how the Descriptor Table is used in virtio-blk. The data sent by the guest to Virtqueue follows the structure below:


#![allow(unused)]
fn main() {
struct virtio_blk_req {
        le32 type;
        le32 reserved;
        le64 sector;
        u8 data[];
        u8 status;
};
}

Source: Block Device: Device Operation

In practice, this is created as three entries in the Descriptor Table, with each entry being linked by the next field.

The first Descriptor entry points to the addresses containing the type, reserved, and sector data.
The second Descriptor entry points to the beginning of the data area where data is written.
The third Descriptor entry points to the address where the status will be written by Host.

The type field indicates the type of I/O (e.g., read, write, or other I/O requests). By examining this value, the host can determine how to behave differently.

In the case of a read, the second Descriptor entry points to the area where the host should store the data it reads from the Disk. The host can determine the sector to read from based on the sector value and read the necessary amount of data (desc.len of the second Descriptor).

In the case of a write, the second Descriptor entry contains the data that should be written to the Disk. The host reads the data and writes it to the sector specified by the sector value.

The third Descriptor entry is used to write status information, indicating whether the I/O was successful or failed.

In summary, the type of Disk I/O and the necessary data or buffers are provided through Virtqueue. It is the responsibility of the host to interpret this according to the specification, emulate the I/O correctly, and provide the appropriate status.

Implementation of Disk I/O in ToyVMM

Let's explain the guest-originated Disk I/O requests in the context of the implementation. Everything else is essentially the same as the Tx case of the Net Device, so let's start with the point where the processing is delegated to the host through QueueNotify.

Writing to MMIO's QueueNotify triggers an EventFd, which is picked up by epoll monitoring. Specifically, the handler for QUEUE_AVAIL_EVENT is executed. In practice, the process_queue function is called, and if its return value is true, the signal_used_queue function is called. The signal_used_queue function simply sends an interrupt to the guest, so the important part to examine is the process_queue function.

In the process_queue function, the following steps are performed:

Initialize necessary variables:
- used_desc_heads[(u16, u32), 256]: Stores the index and data length of processed Descriptors. This is used to populate the used_ring at the end of process_queue.
- used_count: Keeps track of how many I/O requests from the guest have been processed.
Iterate through Virtqueue until it stops, repeating steps 2 to 4.
Retrieve the Descriptor pointed to by the Available Ring, parse it according to the virtio-blk specification, and create a Request structure. The Request structure contains parsed information such as request type, sector information, data address, data length, and status address.
Call the execute function, which performs the I/O request based on the content of the Request structure. For successful I/O, it returns the length of data read (for Read) or 0 (for Write and other types). This value is used to write to the used_ring.
Write the status (success or failure of I/O) to the status address and write necessary information to the used_ring.
If one or more requests have been processed, return true as the function's return value.

The following diagrams illustrate the process when the guest-originated I/O request is a Read:

And here's the process when the guest-originated I/O request is a Write:

Verification of virtio-blk Operation

Now, let's perform a practical verification to demonstrate the functionality. Instead of using initrd.img, we will use an Ubuntu rootfs image similar to Firecracker, allowing us to boot the Ubuntu OS directly. With the implementation of the virtio-blk BlockDevice, we can recognize the Ubuntu rootfs image as /dev/vda in the VM. To boot from this Ubuntu image, we need to specify root=/dev/vda in the VM's kernel cmdline.

# Run ToyVMM with kernel and rootfs (no initrd.img)
$ sudo -E cargo run -- boot_kernel -k vmlinux.bin -r ubuntu-18.04.ext4
...

# You can verify that the launched VM is ubuntu-based.
root@7e47bb8f2f0a:~# uname -r
4.14.174
root@7e47bb8f2f0a:~# cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.1 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.1 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

# And you can also find that this VM mount /dev/vda as rootfs.

root@7e47bb8f2f0a:~# lsblk
NAME MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda  254:0    0  384M  0 disk /

root@7e47bb8f2f0a:~# ls -lat /
total 36
drwxr-xr-x 12 root root   360 Aug 14 13:47 run
drwxr-xr-x 11 root root  2460 Aug 14 13:46 dev
dr-xr-xr-x 12 root root     0 Aug 14 13:46 sys
drwxrwxrwt  7 root root  1024 Aug 14 13:46 tmp
dr-xr-xr-x 57 root root     0 Aug 14 13:46 proc
drwxr-xr-x  2 root root  3072 Jul 20  2021 sbin
drwxr-xr-x  2 root root  1024 Dec 16  2020 home
drwxr-xr-x 48 root root  4096 Dec 16  2020 etc
drwxr-xr-x  2 root root  1024 Dec 16  2020 lib64
drwxr-xr-x  2 root root  5120 May 28  2020 bin
drwxr-xr-x 20 root root  1024 May 13  2020 .
drwxr-xr-x 20 root root  1024 May 13  2020 ..
drwxr-xr-x  2 root root  1024 May 13  2020 mnt
drwx------  4 root root  1024 Apr  7  2020 root
drwxr-xr-x  2 root root  1024 Apr  3  2019 srv
drwxr-xr-x  6 root root  1024 Apr  3  2019 var
drwxr-xr-x 10 root root  1024 Apr  3  2019 usr
drwxr-xr-x  9 root root  1024 Apr  3  2019 lib
drwx------  2 root root 12288 Apr  3  2019 lost+found
drwxr-xr-x  2 root root  1024 Aug 21  2018 opt

As mentioned above, it can be seen that the VM is running the Ubuntu-based OS passed as /dev/vda, and after logging in, it is confirmed that it is an Ubuntu-based OS, and the rootfs is correctly mounted as intended. Furthermore, unlike the previous initrd.img, which had volatile rootfs, in this case, the rootfs persisted as a DISK is used for booting the VM, allowing files created within the VM to be retained across VM reboots.

# Create a sample file (hello.txt) in the first VM boot and reboot.

root@7e47bb8f2f0a:~# echo "HELLO UBUNTU" > ./hello.txt
root@7e47bb8f2f0a:~# cat hello.txt
HELLO UBUNTU
root@7e47bb8f2f0a:~# reboot -f
Rebooting.

# After the second boot, you can also find 'hello.txt'.  

Ubuntu 18.04.1 LTS 7e47bb8f2f0a ttyS0

7e47bb8f2f0a login: root
Password:
Last login: Mon Aug 14 13:57:27 UTC 2023 on ttyS0
Welcome to Ubuntu 18.04.1 LTS (GNU/Linux 4.14.174 x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

This system has been minimized by removing packages and content that are
not required on a system that users do not log into.

To restore this content, you can run the 'unminimize' command.
root@7e47bb8f2f0a:~# cat hello.txt
HELLO UBUNTU

With the implementation of both virtio-net and virtio-blk devices, you have successfully created a minimal VM with the necessary functionality.

toyvmm book (en)