Introduction
What is ToyVMM?
ToyVMM is a project being developed for the purpose of learning virtualization technology. ToyVMM aims to accomplish the following
Code-based understanding of KVM-based virtualization technologies Learn about the modern virtualization technology stack by using libraries managed by rust-vmm The rust-vmm libraries are also used as a base for well-known OSS such as firecracker and provides the functionality needed to create custom VMMs.
Disclaimer
While every effort has been made to provide correct information in this publication, the authors do not necessarily guarantee that all information is accurate. Therefore, the authors cannot be held responsible for the results of development, prototyping, or operation based on this information. If you find any errors in the contents of this document, please correct or report them as PR or Issue.
What's Next?
If you would like to try ToyVMM first, please refer to QuickStart. To learn more about KVM-based virtualization through ToyVMM, please refer to 01. Running Tiny Code in VM
QuickStart
This quickstart documents are based on the commit ID of 58cf0f68a561ee34a28ae4e73481f397f2690b51
.
Architecture & OS
ToyVMM only supports x86_64 Linux for Guest OS.
ToyVMM has been confirmed to work with Rocky Linux 8.6, 9.1 and Ubuntu 18.04, 22.04 as the Hypervisor OS.
Prerequisites
ToyVMM requires the KVM Linux kernel module.
Run Virtual Machine using ToyVMM
Following command builds toyvmm from source, downloads the kernel binary and rootfs needed to start the VM, and starts the VM.
# download and build toyvmm from source.
git clone https://github.com/aztecher/toyvmm.git
cd toyvmm
mkdir build
CARGO_TARGET_DIR=./build cargo build --release
# Download a linux kernel binary.
wget https://s3.amazonaws.com/spec.ccfc.min/img/quickstart_guide/x86_64/kernels/vmlinux.bin
# Download a rootfs.
wget https://s3.amazonaws.com/spec.ccfc.min/ci-artifacts/disks/x86_64/ubuntu-18.04.ext4
# Run virtual machine based on ToyVMM!
sudo ./build/release/toyvmm vm run --config examples/vm_config.json
After the guest OS startup sequence is output, the login screen is displayed, so enter both username and password as 'root' to login.
Disk I/O in Virtual Machine.
Since we have implemented virtio-blk, the virtual machine is capable of operating block devices.
Now it recognizes the ubuntu18.04.ext4 disk image as a block device and mounts it as the root filesystem.
lsblk
> NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
> vda 254:0 0 384M 0 disk /
Therefore, if you create a file in the VM and then recreate the VM using the same image, the file you created will be found. This behavior is significantly different from a initramfs (rootfs that is extracted on RAM).
# Create 'hello.txt' in VM.
echo "hello virtual machine" > hello.txt
cat hello.txt
> hello virtual machine
# Rebooting will cause the ToyVMM process to terminate.
reboot -f
# In the host, please restart VM and login again.
# Afterward, you can found the file you created in the VM during its previous run.
cat hello.txt
> hello virtual machine
Network I/O in Virtual Mahcine.
Since we have implemented virtio-net, the virtual machine is capable of operating network device.
Now, it recognizes the eth0
network interface.
ip link show eth0
> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
> link/ether 52:5f:7f:b3:f8:81 brd ff:ff:ff:ff:ff:ff
And toyvmm creates the host-side tap device named vmtap0
that connect to the virtual machine interface.
ip link show vmtap0
> 334: vmtap0: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN mode DEFAULT group default qlen 1000
> link/ether 26:e9:5c:02:3c:19 brd ff:ff:ff:ff:ff:ff
Therefore, by assigning appropriate IP addresses to the interfaces on both the VM side and the Host side, communication can be established between the HV and the VM.
# Assign ip address 192.168.0.10/24 to 'eth0' in vm.
ip addr add 192.168.0.10/24 dev eth0
# Assign ip address 192.168.0.1/24 to 'vmtap0' in host.
sudo ip addr add 192.168.0.1/24 dev vmtap0
# Host -> VM. ping to VM interface ip from host.
ping -c 1 192.168.0.10
# VM -> Host. Ping to Host interface ip from vm.
ping -c 1 192.168.0.1
Additionally, by setting the default route on the VM side, and configuring iptables and enabling IP forwarding on the host side, you can also allow the VM to access the Internet.
However, this will not be covered in detail here.
What's next?
If you are not familiar with KVM-based VMs, I suggest you start reading from 01. Running Tiny Code in VM. If not, please read the topics that interest you.
01. Running Tiny Code in VM
Tiny code execution is no longer supported in the current latest commit.
You may be able to verify it by checking out past commits, but please be aware that resolving package dependencies may be challenging.
This chapter is documented in a way that you can get a sense of its behavior without actually running it, so please feel reassured about that.
DeepDive ToyVMM instruction and how to run tiny code in VM
This main
function is a program that starts a VM using the KVM mechanism and executes the following small code inside the VM
#![allow(unused)] fn main() { code = &[ 0xba, 0xf8, 0x03, /* mov $0x3f8, %dx */ 0x00, 0xd8, /* add %bl, %al */ 0x04, b'0', /* add $'0', %al */ 0xee, /* out %al, (%dx) */ 0xb0, '\n', /* mov $'\n', %al */ 0xee, /* out %al, (%dx) */ 0xf4, /* hlt */ ]; }
This code perform several register operations, but the initial state of the CPU regisers for this VM is set as follows.
#![allow(unused)] fn main() { regs.rip = 0x1000; regs.rax = 2; regs.rbx = 2; regs.rflags = 0x2; vcpu.set_sregs(&sregs).unwrap(); vcpu.set_regs(®s).unwrap(); }
This will output the result of calculations (2 + 2) inside the VM from the IO Port, followed by a newline code as well.
As you can see the result of running ToyVMM, hex value 0x34 (= '4') and 0xa (= New Line) are catched from I/O port
How's work above code with rust-vmm libraries
Now, the following crate provided by rust-vmm is used to run these codes.
# Please see Cargo.toml
kvm-bindings
kvm-ioctls
vmm-sys-util
vm-memory
I omit to describe about vmm-sys-util because it is only used to create temporary files at this point, so there is nothing special to mention about it.
I will go through the code in order and describe how each crate is related to that.
In this explanation, we will focus primary on the perspective of what ioctl is performed as a result of a function call (This is because the interface to manipulate KVM from the user space relies on the iocl system call)
Also, please note that explanations of unimportant variables may be omitted.
It should be noted that what is described here is not only the ToyVMM implementation, but also the firecracker implementation in a similaer form.
First, we need to open /dev/kvm
and acquire the file descriptor. This can be done by Kvm::new()
of kvm_ioctls
crate. Following this process, the Kvm::open_with_cloexec
function issues an open
system call as follows, returns a file descriptor as Kvm
structure
#![allow(unused)] fn main() { let ret = unsafe { open("/dev/kvm\0".as_ptr() as *const c_char, open_flags) }; }
The result obtained from above is used to call the method create_vm
, which results in the following ioctl
being issued
#![allow(unused)] fn main() { vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0) where vmfd: from /dev/kvm }
Please keep in mind that the file descriptor returned from above function will be used later when preparing the CPU.
Anyway, we finish to crete a VM but it has no memory, cpu.
Now, the next step is to prepare memory!
In kvm_ioctls
's example, memory is prepared as follows
#![allow(unused)] fn main() { // First, setup Guest Memory using mmap let load_addr: *mut u8 = unsafe { libc::mmap( null_mut(), mem_size, // 0x4000 libc::PROT_READ | libc::PROT_WRITE, libc::MAP_ANONYMOUS | libc::MAP_SHARED | libc::MAP_NORESERVE, -1, 0, ) as *mut u8 }; // Second, setup kvm_userspace_memory_region sructure using above memory // kvm_userspace_memory_region is defined in kvm_bindings crate let mem_region = kvm_userspace_memory_region { slot, guest_phys_addr: guest_addr, // 0x1000 memory_size: mem_size as u64, // 0x4000 userspace_addr: load_addr as u64, flags: KVM_MEM_LOG_DIRTY_PAGES, }; unsafe { vm.set_user_memory_region(mem_region).unwrap() }; // retrieve slice from pointer and length (slice::form_raw_parts_mut) // > https://doc.rust-lang.org/beta/std/slice/fn.from_raw_parts_mut.html // and write asm_code into this slice (&[u8], &mut [u8], Vec<u8> impmenent the Write trait!) // > https://doc.rust-lang.org/std/primitive.slice.html#impl-Write unsafe { let mut slice = slice::from_raw_parts_mut(load_addr, mem_size); slice.write(&asm_code).unwrap(); } }
Check set_user_memory_region
. This function will issue the following ioctl as a result, attach the memory to VM
#![allow(unused)] fn main() { ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &mem_region) }
ToyVMM, on the other hand, provides a utility functions for memory preparation.
This difference is due to the fact that ToyVMM's implementation is similaer to firecracker's, but they are essentially doing the same thing.
Let's look at the whole implementation first
#![allow(unused)] fn main() { // The following `create_region` functions operate based on file descriptor, so first, create a temporary file and write asm_code to it. let mut file = TempFile::new().unwrap().into_file(); assert_eq!(unsafe { libc::ftruncate(file.as_raw_fd(), 4096 * 10) }, 0); let code: &[u8] = &[ 0xba, 0xf8, 0x03, /* mov $0x3f8, %dx */ 0x00, 0xd8, /* add %bl, %al */ 0x04, b'0', /* add $'0', %al */ 0xee, /* out %al, %dx */ 0xb0, b'\n', /* mov $'\n', %al */ 0xee, /* out %al, %dx */ 0xf4, /* hlt */ ]; file.write_all(code).expect("Failed to write code to tempfile"); // create_region funcion create GuestRegion (The details are described in the following) let mut mmap_regions = Vec::with_capacity(1); let region = create_region( Some(FileOffset::new(file, 0)), 0x1000, libc::PROT_READ | libc::PROT_WRITE, libc::MAP_NORESERVE | libc::MAP_PRIVATE, false, ).unwrap(); // Vec named 'mmap_regions' contains the entry of GuestRegionMmap mmap_regions.push(GuestRegionMmap::new(region, GuestAddress(0x1000)).unwrap()); // guest_memory represents as the vec of GuestRegion let guest_memory = GuestMemoryMmap::from_regions(mmap_regions).unwrap(); let track_dirty_page = false; // setup Guest Memory vm.memory_init(&guest_memory, kvm.get_nr_memslots(), track_dirty_page).unwrap(); }
The create_vm
consequently performs a mmap in the following way and returns a part of the structure (GuestMmapRegion) representing the GuestMemory
#![allow(unused)] fn main() { pub fn create_region( maybe_file_offset: Option<FileOffset>, size: usize, prot: i32, flags: i32, track_dirty_pages: bool, ) -> Result<GuestMmapRegion, MmapRegionError> { ... let region_addr = unsafe { libc::mmap( region_start_addr as *mut libc::c_void, size, prot, flags | libc::MAP_FIXED, fd, offset as libc::off_t, ) }; let bitmap = match track_dirty_pages { true => Some(AtomicBitmap::with_len(size)), false => None, }; unsafe { MmapRegionBuilder::new_with_bitmap(size, bitmap) .with_raw_mmap_pointer(region_addr as *mut u8) .with_mmap_prot(prot) .with_mmap_flags(flags) .build() } }
Let's check the structure about Memory here.
In src/kvm/memory.rs
, the following Memory structure is defined based on vm-memory crate
pub type GuestMemoryMmap = vm_memory::GuestMemoryMmap<Option<AtomicBitmap>>;
pub type GuestRegionMmap = vm_memory::GuestRegionMmap<Option<AtomicBitmap>>;
pub type GuestMmapRegion = vm_memory::MmapRegion<Option<AtomicBitmap>>;
The MmapRegionBuilder
is also defined in the vm-memory
crate, and this build
method creates the MmapRegion
.
This time, since we have performed the mmap myself in advance and passed that address to with_raw_mmap_pointer
, use that area to initialize. Otherwise, mmap is performed in the build
method. In any case, this build
method will get the MmapRegion
structure, but defines a synonym as described above, which is returned as the GuestMmapRegion
. By calling the create_region
function once, you can allocate and obtain one region of GuestMemory based on the information(size
, flags
, ...etc) specified in the argument.
The region allocated here is only mmapped from the virtual address space of the VMM process, and no further information is available. To use this area as Guest Memory, a GuestRegionMmap
structure is created from this area. This is simple, specify the corresponding GuestAddress
for this region and initialize GuestRegionMmap
with a tuple of mmapped area and GuestAddress. In following code, the initialized GuestRegionMmap
is pushed to Vec for subsequent processing.
#![allow(unused)] fn main() { map_regions.push(GuestRegionMmap::new(region, GuestAddress(0x1000)).unwrap()); }
Now, the mmap_regions: Vec<GuestRegionMmap>
created as above represents the entire memory of the Guest VM, and each region that makes up the guest memory holds information on the area allocated by the VMM for the region and the top address of the Guest side.
The GuestMemoryMmap
structure representing the Guest Memory is initialized from this Vec information and set to VM by the memory_init
method.
#![allow(unused)] fn main() { let guest_memory = GuestMemoryMmap::from_regions(mmap_regions).unwrap(); vm.memory_init(&guest_memory, kvm.get_nr_memslots(), track_dirty_page).unwrap(); }
Next, let's check the operation of this memory_init
. This calls set_kvm_memory_regions
and the actual process is described there.
#![allow(unused)] fn main() { pub fn set_kvm_memory_regions( &self, guest_mem: &GuestMemoryMmap, track_dirty_pages: bool, ) -> Result<()> { let mut flags = 0u32; if track_dirty_pages { flags |= KVM_MEM_LOG_DIRTY_PAGES; } guest_mem .iter() .enumerate() .try_for_each(|(index, region)| { let memory_region = kvm_userspace_memory_region { slot: index as u32, guest_phys_addr: region.start_addr().raw_value() as u64, memory_size: region.len() as u64, userspace_addr: guest_mem.get_host_address(region.start_addr()).unwrap() as u64, flags, }; unsafe { self.fd.set_user_memory_region(memory_region) } }) .map_err(Error::SetUserMemoryRegion)?; Ok(()) } }
Here we can see that set_user_memory_region
is called using the necessary information while iterating the region.
In other words, it is processing the same as the example code except that there may be more than one region.
Now that we've gone through the explanation of memory preparation, let's take a look at the vm-memory
crate!
The information presented here is only the minimum required, so please refer to Design or other sources for more details.
This will also be related to the above iteration, where we were able to call methods such as sart_addr()
and len()
to construct the necessary information for set_user_memory_region
.
#![allow(unused)] fn main() { GuestAddress (struct) : Represent Guest Physicall Address (GPA) FileOffset(struct) : Represents the start point within a 'File' that backs a 'GuestMemoryRegion' GuestMemoryRegion(trait) : Represents a continuous region of guest physical memory / trait GuestMemory(trait) : Represents a container for a immutable collection of GuestMemoryRegion object / trait MmapRegion(struct) : Helper structure for working with mmaped memory regions GuestRegionMmap(struct & implement GuestMemoryRegion trait) : Represents a continuous region of the guest's physical memory that is backed by a mapping in the virtual address space of the calling process GuestMemoryMmap(struct & implement GuestMemory trait) : Represents the entire physical memory of the guest by tracking all its memory regions }
Since GuestRegionMmap
implements the GuestMemoryRegion
trait, there are implementations of functions such as start_addr()
and len()
, which were used in the above interation.
The following figure briefly summarizes the relationship between these structures
As you can see, what is being done is essentially the same.
The final step is prepareing vCPU (vCPU is a CPU to be attached to a virtual machine).
Currently, a VM has been created and memory containing instructions has been inserted, but these is no CPU, so the instructions can't be executed. Therefore, let's create a vCPU, associate it with the VM, and execute the instruction by running the vCPU!
Using the file descriptor obtained during VM creaion (vmfd
), the resulting ioctl
will be issued as follows.
#![allow(unused)] fn main() { vcpufd = ioctl(vmfd, KVM_CREATE_VCPU, 0) }
The create_vm
method that was just issued to obtain the vmfd
is designed to return a kvm_ioctls::VmFd
strucure as a result, and by execuing the create_vcpu
method, which is a method of this structure, the above ioctl is consequently issued and returns the result as a kvm_ioctls::VcpuFd
structure.
VcpuFd
provides utilities for getting and setting various CPU states.
For example, if you want o get/set a register set from the vCPU, you would normally issue the following ioctl
#![allow(unused)] fn main() { ioctl(vcpufd, KVM_GET_SREGS, &sregs); ioctl(vcpufd, KVM_SET_SREGS, &sregs); }
For these, the following methods are available in kvm_ioctls::VcpuFd
#![allow(unused)] fn main() { get_sregs(&self) -> Result<kvm_sregs> set_sregs(&self, sregs: &kvm_sregs) -> Result<()> }
VcpuFd
also provids a method called run
, which issues the following insructions to actually run the vCPU.
#![allow(unused)] fn main() { ioctl(vcpufd, KVM_RUN, NULL) }
and then, we can aquire return values that has the type Result<VcpuExit>
resulting this method.
When running vCPU, exit occurs for various reasons. This is an instruction that the CPU cannot handle, and the OS usually tries to deal with it by invoking the corresponding handler.
If this type of exit comes back from the VM's vCPU, as in the case, it will be necessary to write the appropriate code to handle the situation.
VcpuExit
is defined in kvm_ioctls::VcpuExit
as enum.
When Exit are occurred on several reasons in running vCPU, the exit reasons that are defined in kvm.h in linux kernel are wrapped to VcpuExit
.
Therefore, it is sufficient to write a process that pattern matches this result and appropriately handles the error to be handled.
Now, there is a instruction that execute outputting values through I/O port and this will occur the KVM_EXIT_IO_OUT
.
VcpuExit
wrap this exit reason as IoOut
.
Originally (in C programm as example), we require to calculate appropriate offset to get output data from I/O port, but now, this process are implemented in run
method and returned as VcpuExit that contains necessary values.
So, we don't have to write these unsafe code (pointer offset calculation) and handle these exit as you will.
#![allow(unused)] fn main() { loop { match vcpu.run().expect("vcpu run failed") { kvm_ioctls::VcpuExit::IoOut(addr, data) => { println!( "Recieved I/O out exit. \ Address: {:#x}, Data(hex): {:#x}", addr, data[0], ); }, kvm_ioctls::VcpuExit::Hlt => { break; } exit => panic!("unexpected exit reason: {:?}", exit), } } }
In above, only handle KVM_EXIT_IO_OUT
and KVM_EXIT_HLT
, and the others will be processed as panic. (Although all exits should be handled, I want to focus on the description of KVM API example and keep it simply)
Since we are here, let's take a look at the processing of the run
method in some detail.
Let's check the processing of KVM_EXIT_IO_OUT
.
If you look at the LWN article, you will see that it calculates the offset and outputs the necessary information in the following way.
#![allow(unused)] fn main() { case KVM_EXIT_IO: if (run->io.direction == KVM_EXIT_IO_OUT && run->io.size == 1 && run->io.port == 0x3f8 && run->io.count == 1) putchar(*(((char *)run) + run->io.data_offset)); else errx(1, "unhandled KVM_EXIT_IO"); break; }
On the other hand, run
method implemented in kvm_ioctl::VcpuFd
is like bellow
#![allow(unused)] fn main() { ... let run = self.kvm_run_ptr.as_mut_ref(); match run.exit_reason { ... KVM_EXIT_IO => { let run_start = run as *mut kvm_run as *mut u8; // Safe because the exit_reason (which comes from the kernel) told us which // union field to use. let io = unsafe { run.__bindgen_anon_1.io }; let port = io.port; let data_size = io.count as usize * io.size as usize; // The data_offset is defined by the kernel to be some number of bytes into the // kvm_run stucture, which we have fully mmap'd. let data_ptr = unsafe { run_start.offset(io.data_offset as isize) }; // The slice's lifetime is limited to the lifetime of this vCPU, which is equal // to the mmap of the `kvm_run` struct that this is slicing from. let data_slice = unsafe { std::slice::from_raw_parts_mut::<u8>(data_ptr as *mut u8, data_size) }; match u32::from(io.direction) { KVM_EXIT_IO_IN => Ok(VcpuExit::IoIn(port, data_slice)), KVM_EXIT_IO_OUT => Ok(VcpuExit::IoOut(port, data_slice)), _ => Err(errno::Error::new(EINVAL)), } } ... }
Let me explain a little. The kvm_run
is provided by the kvm-bindings
crate, which is a structure automatically generated from a header file using bindgen
, so it is a structure like the linux kernel's kvm_run
converted directory to Rust.
First, kvm_run
is obtained in the form of a pointer, a method of obtaining a pointer often used in Rust.
This correspoinds to the first address of the kvm_run
structure which is bound to run_start
variable.
And the information corresponding to run->io(.member)
can be obtained from run.__bindgen_anon_1.io
, although it is a bit tricky. The field named __bindgen_anon_1
is the effect of automatic generation by bindgen
.
The data we want is at the first address of kvm_run
plus io.data_offset
. This process is performed in run_start.offset(io.data_offset as isize)
. And the data size can be calculated from io->size
and io->count
(in the LWN example, it is 1byte, so it's taken directory from the offset by putchar). This part is calculated and stored in the value data_size
, and std::slice::from_raw_parts_mut
actually retrieves the data using this size.
Finally, checking io.direction
, we change the wrap type for KVM_EXIT_IO_IN
or KVM_EXIT_IO_OUT
respectively, and return the descired information such as port
and data_slice
together.
As can be seen from the above, what is being done is clear.
However, it still contains many unsafe operations because it involves pointer manipuration.
We can see that by using these libraries, we are able to implement VMM on a stable implementation.
Well, it's ben a long time comming, but let's take a look back at the rust-vmm crates we're using again.
#![allow(unused)] fn main() { kvm-bindings : Library that includes structures automatically generated from kvm.h by bindgen. kvm-ioctls : Library that hides ioctl and unsafe processes related to kvm operations and provides user-friendly sructures, functions and methods. vm-memory : Library that provides structures and operations to the Memory }
This knowledge will come up again and again in future discussion and is basic and important.
Load Linux Kernel
In this section, we will explain upon the implementation of launching a Guest VM as the first step in VMM. While our VMM has minimal functionality, booting the Linux Kernel demands a variety of knowledge.
In this section, we will explain the essential aspects of launching a Guest VM and delve into how it is implemented in ToyVMM. To achieve this, we will divide it into several detailed chapters and provide explanations for each topic.
The topics are as follows:
- 02-1. Overview of Booting Linux
- 02-2. ELF binary format and vmlinux structure
- 02-3. Loading initrd
- 02-4. Setup registers of vcpu
- 02-5. Serial console implementation
- 02-6. ToyVMM implementation
Additionally, this document is based on the following commit numbers:
- ToyVMM:
27fdf196dfb31938f24785ca64e7233a6dc8fceb
- Firecracker:
4bf121fc032cc2d94a20a3625f2af3918545154a
If you refer to this document while inspecting ToyVMM's code, it may be beneficial.
Overview of Booting Linux
General Booting Mechanism
In Linux, the operating system starts by executing programs in the following order:
- BIOS
- Boot Loader (GRUB)
- Linux Kernel (vmlinuz)
- init
The BIOS program is stored in the ROM on the motherboard. When you power on your computer, the CPU is instructed to start executing code from a specific address mapped to this ROM area. The BIOS performs hardware detection and initialization, then searches for the OS boot drive (HDD/SSD, USB flash drive, etc.). During this process, the boot drive needs to be formatted in either MBR or GPT format, depending on the BIOS type, as shown in the table below:
BIOS \ DISK Format | MBR | GPT |
---|---|---|
Legacy BIOS | ◯ | - |
UEFI | ◯ * | ◯ |
* UEFI supports Legacy Boot Mode and thus supports MBR.
Next, I will explain the process of searching for the OS when using MBR. But before going into details, let's briefly review the structure of MBR. The MBR structure explained here assumes HDD/SSD or USB flash memory and implicitly assumes the presence of the Partition Entry, as described later. Please note that this document uses the terms provided on Wikipedia, so keep that in mind.
MBR is a 512-byte sector located at the beginning of the boot drive and consists of three main parts:
- Bootstrap code area (446 bytes)
- Partition Entry (64 bytes = 16 bytes * 4)
- Boot Signature (2 bytes)
I won't go into the details of MBR here, but the Boot code area contains machine code programs (Boot Loaders) to boot the OS, and the Partition Entry stores information about the logical partitions on that disk. It's worth noting that the Boot code area is only 446 bytes in size, so Boot Loaders are typically stored elsewhere. A minimal program is placed in the Boot code area to load the actual Boot Loader into memory.
The critical part here is the "Boot Signature," which contains a 2-byte value used to ensure that the drive is a bootable device. When the BIOS searches for the OS boot drive, it reads the first sector (512 bytes), checks if the last 2 bytes (Boot Signature) are 0x55
and 0xAA
, and identifies the drive as a bootable disk. It loads the first sector (512 bytes) from that disk into memory at 0x7c00 - 0x7fff
and begins executing the program from 0x7c00
.
Now, as a simple validation, let's check the Boot Signature on your machine. In this example, you are using a virtual machine with the boot drive labeled as vda
. On a regular machine, it might be something like sda
. By writing the first sector's content to a file and examining the 2 bytes at an offset of 510 bytes, you should see the 0x55
0xAA
signature as expected.
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sr0 11:0 1 2M 0 rom
vda 252:0 0 300G 0 disk
├─vda1 252:1 0 1M 0 part
└─vda2 252:2 0 300G 0 part /
$ sudo dd if=/dev/vda of=mbr bs=512 count=1
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.000214802 s, 2.4 MB/s
$ hexdump -s 510 -C mbr
000001fe 55 aa |U.|
00000200
Now, back to our discussion. After confirming the Boot Signature, the BIOS identifies the disk as a bootable disk and loads the first sector (512 bytes) from it into memory at address 0x7c00
. The program execution starts from 0x7c00
.
Moving on, once the Boot Loader is loaded into memory, it takes on the responsibility of loading the Linux Kernel and initramfs from the disk and starting the kernel. In recent years, GRUB has become a common choice as a Boot Loader. I'll skip the detailed workings of the Boot Loader for now. The essential point is that the Boot Loader needs to load the specified kernel and initrd from the disk.
To achieve this, one straightforward method would be to inform the Boot Loader of the location of the kernel file on the disk. However, if you look at the contents of grub.cfg
, you'll notice that the kernel and initrd locations are specified in the form of file paths. This means that the Boot Loader must have the ability to interpret the file system. In practice, several Boot Loaders can interpret various file systems and locate the kernel based on directory path information. However, it's essential to note that Boot Loaders are limited to supporting specific file system formats, and they cannot interpret other formats. The Boot Loader loads the specified kernel and RAM disk from grub.cfg
, and by jumping to the kernel's entry point, it hands over the execution to the kernel, completing its own processing.
Before delving into the details of the kernel's processing, let's briefly organize some information about the kernel file. The kernel file is generally named vmlinuz*
. You might be familiar with a kernel file located at /boot/vmlinuz-*
, which is believed to be the kernel. However, this file is in the bzImage
format. You can easily check this using the file
command. The bzImage
includes the actual kernel binary along with several other files used for low-level initialization. In this document, I'll refer to the kernel file in the bzImage
format as vmlinuz
and the actual kernel binary in executable format as vmlinux.bin
.
When control is handed over from the BootLoader to vmlinuz
, vmlinuz
performs low-level initialization, then decompresses the kernel core, loads it into memory, and transfers control to the kernel's entry routine. Once all initialization processes are completed, the kernel creates a tmpfs
filesystem, unpacks the initramfs
placed in RAM by the BootLoader into it, and starts the init
script located in the root directory.
This init
script prepares to mount the main filesystem stored on the disk and mounts other important filesystems. initramfs
contains various device drivers and allows mounting root filesystems in different formats. After this is done, the root is switched to the main root filesystem, and the /sbin/init
binary stored there is executed.
/sbin/init
is the first process to be launched in the system (with PID=1), and it serves as the parent for all other processes responsible for starting other processes. There are various implementations of init
, such as SysVinit
and Upstart
, but what is commonly used in recent systems like CentOS and Ubuntu is Systemd
. The ultimate responsibility of init
is to further prepare the system and ensure that the necessary services are running and the system is in a state where users can log in when the boot process is complete.
This is a very high-level overview of the process from powering on to the OS booting up.
initrd and initramfs
In the previously discussed Linux boot process, we introduced initramfs
, a file system that is unpacked into memory. However, what we often encounter is /boot/initrd.img
. Here, we will explain the differences between initrd
and initramfs
.
initrd
stands for "initial RAM disk, while initramfs
stands for "initial RAM File System". Although they are different in nature, they serve the same purpose, which is to provide the necessary commands, libraries, and modules for mounting the root file system and launching the /sbin/init
script located in the root file system.
The challenge that both initrd
and initramfs
address is that the system you want to boot originally resides in some storage device. To load it, you need appropriate device drivers and a file system for mounting.
initrd
and initramfs
both address this issue, but they use different methods. As their names suggest, initrd
uses a block device, while initramfs
uses a RAM file system based on tmpfs
. Traditionally, initrd
was used, but starting from Kernel 2.6, initramfs
became available, and it is now the more common choice.
The shift from initrd
to initramfs
occurred because initrd
had several issues:
-
A RAM disk is a mechanism that creates a pseudo block device in RAM, treating it as if it were a secondary storage device. However, because of this behavior, it inadvertently consumes memory cache, similar to regular block devices, leading to unnecessary memory usage. Furthermore, mechanisms such as paging come into play, consuming more memory capacity.
-
A RAM disk requires a file system driver, such as ext2, to format and interpret its data.
-
RAM disks have a fixed size, which can lead to problems; if they are too small, they may not accommodate all the necessary scripts, and if they are too large, they waste memory.
To address these issues, initramfs
was developed. It is a lightweight, memory-based file system that can be flexibly sized and is based on tmpfs
. It is not a block device, so it doesn't interfere with memory caching or paging, and it doesn't require file system drivers for block devices. Additionally, it resolves the fixed size problem.
Whether using initrd
or initramfs
, both methods provide tools inside them to mount the root file system and switch to it. The startup script, /sbin/init
, located in the file system, is then executed.
Inspecting the contents of initramfs
Let's unpack and examine the contents of an initramfs
. We'll use an Ubuntu 20.04.2 LTS initrd
for this example. (Note: The file named initrd
is actually a proper initramfs
). An initramfs
consists of several files concatenated in CPIO format. When you extract it directly using the cpio
command, you'll see only the initial files (like AuthenticAMD.bin
) as follows:
$ mkdir initrd-work && cd initrd-work
$ sudo cp /boot/initrd.img ./
$ cat initrd.img | cpio -idvm
.
kernel
kernel/x86
kernel/x86/microcode
kernel/x86/microcode/AuthenticAMD.bin
62 blocks
You can extract all the files using a combination of dd
and cpio
, but there's a handy tool called unmkinitramfs
that can do this for you:
$ mkdir extract
$ unmkinitramfs initrd.img extract
$ ls extract
early early2 main
After extracting, you'll see directories like early
, early2
, and main
. For instance, early
contains the same files that were seen when extracted with cpio
. The most crucial part is under main
, where the contents of the file system root are stored:
$ ls extract/early/kernel/x86/microcode
AuthenticAMD.bin
$ ls extract/early2/kernel/x86/microcode
GenuineIntel.bin
$ ls extract/main
bin conf cryptroot etc init lib lib32 lib64 libx32 run sbin scripts usr var
By chrooting into this extracted content, you can pseudo-operate the Linux boot-time RAM filesystem and understand what operations can be performed:
$ sudo chroot extract/main /bin/sh
BusyBox v1.30.1 (Ubuntu 1:1.30.1-4ubuntu6.3) built-in shell (ash)
Enter 'help' for a list of built-in commands.
# ls
scripts init run etc var usr conf
lib64 bin lib libx32 lib32 sbin cryptroot
# pwd
/
# which mount
/usr/bin/mount
# exit
As shown above, there is an init
script file in the root directory, which is the script executed after extracting initramfs
. The init
script reads the contents of /proc/cmdline
and extracts disk information (e.g., root=/dev/sda1
) to perform the necessary mounting operations. If this information is missing, this init
script in the Ubuntu 20.04LTS initrd would encounter an error.
In the case of ToyVMM, we use an initramfs
based on firecracker-initrd
.
Therefore, the behavior might differ slightly.
About firecracker-initrd
In ToyVMM, we use firecracker-initrd. Firecracker-initrd creates an initrd.img (initramfs
) based on Alpine Linux. Unlike the Ubuntu initrd we discussed earlier, it does not include additional CPIO files like microcode, so you can simply extract it to see the root filesystem:
$ cat initrd.img | cpio -idv
$ ls
bin dev etc home init initrd.img lib media mnt opt proc root run sbin srv sys tmp usr var
Alpine Linux typically unpacks a filesystem into RAM during normal boot, and then the OS starts. Afterward, decisions like whether to write the OS to a disk using setup-alpine
depend on specific needs. In ToyVMM, when you boot a VM using this initramfs, it doesn't immediately mount the root file system by default. Instead, it simply unpacks the file system into RAM, and Alpine Linux starts. This is different from the traditional approach where you load the boot area into secondary storage and inform the init script via /proc/cmdline
.
Boot Sequence of Linux Kernel in ToyVMM
Now, let's compare what we've discussed so far with the Linux boot process in ToyVMM:
Boot Process (on Linux) | ToyVMM |
---|---|
BIOS | Not implemented yet |
Boot Loader | Requires implementation: Loading vmlinux/initrd.img, basic setup |
Linux Kernel | Processed by vmlinux.bin |
init | Processed by init scripts (from firecracker-initrd's initrd.img ) |
The current implementation of ToyVMM does not support loading bzImage
and instead uses the ELF binary vmlinux.bin
. It currently omits BIOS-related functions.
For the Boot Loader's tasks, such as loading vmlinux.bin
and initrd.img
into memory, implementation is needed. The Linux Kernel itself is processed by vmlinux.bin
, while the init
process is handled by the init
scripts found in initrd.img
from the firecracker-initrd
.
For more detailed implementation instructions, you can refer to 02-6_minimal_vmm_implementation.
References
- MBR(Master Boot Records)の構造
- Initrd(4) - Linux man page
- Initramfsのしくみ
- Initramfs/ガイド
- Kernel Boot Process
- What's the Difference Between initrd and initramfs
- bzImage
- Initデーモンを理解する
- Linuxがブートするまで
- filesystems/ramfs-rootfs-initramfs.txt
ELF binary format and vmlinux structure
At the time of writing this document, the kernel used to boot a VM in ToyVMM assumes an ELF-formatted vmlinux.bin
. Therefore, within the VMM, it's necessary to interpret the ELF format and load the kernel into the memory area prepared for the VM appropriately. This process is implemented in the rust-vmm/linux-loader crate. While ToyVMM abstracts this implementation by using the crate, it is essential to understand how it works. Hence, this section provides an explanation of loading ELF binaries.
ELF Binary Format
The ELF file format consists of the following components:
As shown above, the ELF file format primarily consists of an ELF Header
, Program Header Table
, Segments (Sections)
, and Section Header Table
. When used by a system loader, ELF files treat the entries in the Program Header Table
as a collection of Segments
, while compilers, assemblers, and linkers treat entries in the Section Header Table
as a collection of Sections
.
The ELF Header
contains overall information about the ELF file. Each entry in the Program Header Table
, known as a Program Header
, holds header information about the corresponding Segment
. Therefore, the number of Program Headers
corresponds to the number of Segments
. Furthermore, each Segment
can be divided into multiple Sections
, and the Section Header Table
contains header information for these Sections
.
The ELF Header
always starts at the beginning of the file offset and holds information necessary for reading ELF data. Here are some excerpts from the ELF Header
. For a comprehensive overview, please refer to the Man page of ELF.
Attribute | Meaning |
---|---|
e_entry | Virtual address representing the entry point to start this ELF process |
e_phoff | File offset value to the location of the Program Header Table |
e_shoff | File offset value to the location of the Section Header Table |
e_phentsize | Size of one entry in the Program Header Table |
e_phnum | Number of entries in the Program Header Table |
e_shentsize | Size of one entry in the Section Header Table |
e_shnum | Number of entries in the Section Header Table |
From the above excerpts, you can see that it's possible to extract information about each entry in the Program Header
and Section Header
.
Now, let's focus on the contents of the Program Header
.
Attribute | Meaning |
---|---|
p_type | Represents the type of the Segment pointed to by this Program Header , providing hints on how to interpret it |
p_offset | File offset value to the Segment pointed to by this Program Header |
p_paddr | In systems where physical addresses are meaningful, this value points to the physical address of the Segment pointed to by this Program Header |
p_filesz | Byte size of the file image of the Segment pointed to by this Program Header |
p_memsz | Byte size of the memory image of the Segment pointed to by this Program Header |
p_flags | Flags that indicate information about the Segment pointed to by this Program Header , such as executable, writable, and readable |
As mentioned earlier, by interpreting the contents of the Program Header
, you can obtain information about the position, size, and how to interpret the respective segment. For our purposes, understanding the structure of the Program Header
is sufficient, so we will omit details about the Section Header
and other components.
Now, the vmlinux.bin
we will be working with has five Program Header
entries, with the first four having a p_type
value of PT_LOAD
, and the last one having PT_NOTE
. Let's extract some details about PT_LOAD
and PT_NOTE
from the Man page of ELF:
p_type | Meaning |
---|---|
PT_LOAD | Represents a loadable Segment described by p_filesz and p_memsz . |
PT_NOTE | Contains auxiliary information for location and size. |
In the case of PT_LOAD
, the byte sequence of the file is associated with the beginning of the memory segment. You can load the segment's contents into memory by copying the data from the address corresponding to the segment's memory address, calculated using p_offset
, for the size specified by p_memsz
.
With this minimal knowledge of ELF, let's proceed to analyze the content of vmlinux.bin
.
Analyzing vmlinux
Let's analyze the content of vmlinux
now. Some of the information we'll extract will be crucial for future tasks. The readelf
command is a powerful tool for dumping ELF-formatted files in a human-readable format. In this section, we will display the ELF Header (-h
) and Program Header (-l
) of vmlinux.bin
.
$ readelf -h -l vmlinux.bin
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: Advanced Micro Devices X86-64
Version: 0x1
Entry point address: 0x1000000
Start of program headers: 64 (bytes into file)
Start of section headers: 21439000 (bytes into file)
Flags: 0x0
Size of this header: 64 (bytes)
Size of program headers: 56 (bytes)
Number of program headers: 5
Size of section headers: 64 (bytes)
Number of section headers: 36
Section header string table index: 35
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000200000 0xffffffff81000000 0x0000000001000000
0x0000000000b72000 0x0000000000b72000 R E 0x200000
LOAD 0x0000000000e00000 0xffffffff81c00000 0x0000000001c00000
0x00000000000b0000 0x00000000000b0000 RW 0x200000
LOAD 0x0000000001000000 0x0000000000000000 0x0000000001cb0000
0x000000000001f658 0x000000000001f658 RW 0x200000
LOAD 0x00000000010d0000 0xffffffff81cd0000 0x0000000001cd0000
0x0000000000133000 0x0000000000413000 RWE 0x200000
NOTE 0x0000000000a031d4 0xffffffff818031d4 0x00000000018031d4
0x0000000000000024 0x0000000000000024 0x4
Section to Segment mapping:
Segment Sections...
00 .text .notes __ex_table .rodata .pci_fixup __ksymtab __ksymtab_gpl __kcrctab __kcrctab_gpl __ksymtab_strings __param __modver
01 .data __bug_table .vvar
02 .data..percpu
03 .init.text .altinstr_aux .init.data .x86_cpu_dev.init .parainstructions .altinstructions .altinstr_replacement .iommu_table .apicdrivers .exit.text .smp_locks .data_nosave .bss .brk
04 .notes
From the ELF Header, we can see that the "Entry point address" (e_entry
value) represents the address (0x0100_0000
) where the ELF process starts, which is essential. This value is returned as the result of loading the kernel using rust-vmm/linux-loader
, and it's also the value to set in the vCPU's eip
(instruction pointer) to start the process.
The e_phnum
value in the ELF Header (Number of program headers
) is 5
, which matches the number of Program Headers (Program Header Table
entries). The Program Headers are displayed next, with the first four having a Type
of LOAD
, and the last one being NOTE
. Additionally, the first and fourth LOAD
entries are marked as executable, indicating that executable code is present around these segments. The first entry is especially important as it likely corresponds to the entry point of the kernel's executable code.
Implementation in ToyVMM.
In ToyVMM, the loading of vmlinux
is done within the load_kernel
function in src/builder.rs
. This function takes boot_config
information, which includes the path to the kernel file, and the memory (guest_memory
) allocated for the VM.
Within load_kernel
, rust-vmm/linux-loader
's Elf
structure (imported as Loader
) is used. This structure implements the KernelLoader
trait, and its load
function is responsible for loading ELF-formatted kernels into guest_memory
. Here's an excerpt from the code:
#![allow(unused)] fn main() { use linux_loader::elf::Elf as Loader; let entry_addr = Loader::load::<File, memory::GuestMemoryMmap>( guest_memory, None, &mut kernel_file, Some(GuestAddress(arch::x86_64::get_kernel_start())), ).map_err(StartVmError::KernelLoader)?; }
Now, let's delve deeper into the implementation of linux-loader
. In linux-loader
, the KernelLoader
trait is defined, and its definition looks like this:
#![allow(unused)] fn main() { /// Trait that specifies kernel image loading support. pub trait KernelLoader { /// How to load a specific kernel image format into the guest memory. /// /// # Arguments /// /// * `guest_mem`: [`GuestMemory`] to load the kernel in. /// * `kernel_offset`: Usage varies between implementations. /// * `kernel_image`: Kernel image to be loaded. /// * `highmem_start_address`: Address where high memory starts. /// /// [`GuestMemory`]: https://docs.rs/vm-memory/latest/vm_memory/guest_memory/trait.GuestMemory.html fn load<F, M: GuestMemory>( guest_mem: &M, kernel_offset: Option<GuestAddress>, kernel_image: &mut F, highmem_start_address: Option<GuestAddress>, ) -> Result<KernelLoaderResult> where F: Read + Seek; } }
As inferred from the comments, this trait requires the load
function to be implemented, which should load a specific kernel image format into the guest memory.
In the case of linux-loader
, there are implementations for x86_64 that support loading ELF format kernels, and it also appears to have implementations for bzImage format kernels. However, for this discussion, let's focus on the ELF implementation.
The load
function, which is expected to be implemented for ELF, performs the following steps:
- Extract the data from the beginning of the ELF file up to the size of the ELF header.
- Create an instance of the
KernelLoaderResult
struct namedloader_result
and store the value of the ELF header'se_entry
field in itskernel_load
member. This value represents the address where the system will initially transfer control, which is essentially the starting point of the process. - Seek within the ELF file to the address where the program header table is located (determined by
e_phoff
), and then loop over all program headers (up toe_phnum
) in the ELF file. - While looping over the program headers, perform the following actions:
- Seek within the ELF file to the location of the segment corresponding to the currently inspected program header (determined by
p_offset
). - Write the data from
kernel_image
(which has already been seeked to the beginning of the segment's data) into the guest memory, starting from the address calculated frommem_offset
to the size of the segment (p_filesz
). - Update the value of
kernel_end
(the address of the end of the loaded segment in GuestMemory) and store the larger value betweenloader_result.kernel_end
and the newly calculated value inloader_result.kernel_end
.
- Seek within the ELF file to the location of the segment corresponding to the currently inspected program header (determined by
- After looping through all program headers, return
loader_result
as the final result.
This code essentially interprets and loads ELF files according to the ELF format. The returned KernelLoaderResult
struct contains important information about the starting and ending positions of the kernel in GuestMemory, with the starting position being particularly crucial for use in Setup registers of vCPU.
References
Loading initrd
In this document, we will discuss loading and configuring initrd
(initramfs
) in order to boot a VM. When we mention initrd
in the following sections, we are implicitly referring to initramfs
. A detailed explanation of initramfs
itself can be found in Overview of booting Linux, so please refer to that section for more information.
Loading initrd and setting up kernel header parameters
The function responsible for loading initrd
is implemented as load_initrd
. It takes two arguments: the memory allocated for the Guest and a mutable reference to the File
structure representing the opened initrd
file (implementing Read
and Seek
traits).
#![allow(unused)] fn main() { fn load_initrd<F>( vm_memory: &memory::GuestMemoryMmap, image: &mut F, ) -> std::result::Result<InitrdConfig, StartVmError> where F: Read + Seek { let size: usize; // Get image size match image.seek(SeekFrom::End(0)) { Err(e) => return Err(StartVmError::InitrdRead(e)), Ok(0) => { return Err(StartVmError::InitrdRead(io::Error::new( io::ErrorKind::InvalidData, "Initrd image seek returned a size of zero", ))) } Ok(s) => size = s as usize, }; // Go back to the image start image.seek(SeekFrom::Start(0)).map_err(StartVmError::InitrdRead)?; // Get the target address let address = arch::initrd_load_addr(vm_memory, size) .map_err(|_| StartVmError::InitrdLoad)?; // Load the image into memory // - read_from is defined as trait methods of Bytes<A> // and GuestMemoryMmap implements this trait. vm_memory .read_from(GuestAddress(address), image, size) .map_err(|_| StartVmError::InitrdLoad)?; Ok(InitrdConfig{ address: GuestAddress(address), size, }) } }
The function performs the following steps:
- Retrieves the size of the
initrd
by seeking to the end of the file and then returning to the start. - Calculates the target address in Guest memory where the
initrd
should be loaded. - Loads the contents of the
initrd
file into the specified Guest memory address. - Returns an
InitrdConfig
structure containing the Guest memory address and size of the loadedinitrd
.
Once the initrd
is loaded into memory, we need to configure the kernel's setup header. This header information is defined by the Boot Protocol. In ToyVMM, these settings are primarily configured in the configure_system
function. The table below outlines the relevant settings, which are documented in the Boot Protocol:
Offset/Size | Name | Meaning | ToyVMM value |
---|---|---|---|
01FE/2 | boot_flag | 0xAA55 magic number | 0xaa55 |
0202/4 | header | Magic signature "HdrS" (0x53726448) | 0x5372_6448 |
0210/1 | type_of_loader | Boot loader identifier | 0xff (undefined) |
0218/4 | ramdisk_image | initrd load address (set by boot loader) | GUEST ADDRESS OF INITRD |
021C/4 | ramdisk_size | initrd size (set by boot loader) | SIZE OF INITRD |
0228/4 | cmd_line_ptr | 32-bit pointer to the kernel command line | 0x20000 |
0230/4 | kernel_alignment | Physical addr alignment required for kernel | 0x0100_0000 |
0238/4 | cmdline_size | Maximum size of the kernel command line | SIZE OF CMDLINE STRING |
These values are written to Guest memory starting at address 0x7000
. The 0x7000
address is also stored in RSI, a vCPU register, for reference during VM startup. For details on vCPU register setup, please refer to Setup registers of vCPU.
Setup E820
Configuring the E820 for the Guest OS allows reporting of available memory regions to the OS and BootLoader. The settings for this are aligned with the implementation in Firecracker. The following code illustrates how the E820 entries are added based on the Guest memory configuration:
#![allow(unused)] fn main() { add_e820_entry(&mut params, 0, EBDA_START, E820_RAM)?; let first_addr_past_32bits = GuestAddress(FIRST_ADDR_PAST_32BITS); let end_32bit_gap_start = GuestAddress(MMIO_MEM_START); let himem_start = GuestAddress(HIGH_MEMORY_START); let last_addr = guest_mem.last_addr(); if last_addr < end_32bit_gap_start { add_e820_entry( &mut params, himem_start.raw_value() as u64, last_addr.unchecked_offset_from(himem_start) as u64 + 1, E820_RAM)?; } else { add_e820_entry( &mut params, himem_start.raw_value(), end_32bit_gap_start.unchecked_offset_from(himem_start), E820_RAM)?; if last_addr > first_addr_past_32bits { add_e820_entry( &mut params, first_addr_past_32bits.raw_value(), last_addr.unchecked_offset_from(first_addr_past_32bits) + 1, E820_RAM)?; } } }
It would be better to understand the design of the entire address space for the Guest VM, considering the code for starting a Guest VM in ToyVMM. Therefore, I'll list the current memory design for the Guest in the following table. Please note that this information may change in the future.
Guest Address | Contents | Note |
---|---|---|
0x0 - 0x9FBFF | E820 | |
0x7000 - 0x7FFF | Boot Params (Header) | ZERO_PAGE_START(=0x7000) |
0x9000 - 0x9FFF | PML4 | Now only 1 entry (8byte), maybe expand later |
0xA000 - 0xAFFF | PDPTE | Now only 1 entry (8byte), maybe expand later |
0xB000 - 0xBFFF | PDE | Now 512 entry (4096byte) |
0x20000 - | CMDLINE | Size depends on cmdline parameter len |
0x100000 | HIGH_MEMORY_START | |
0x100000 - 0x7FFFFFF | E820 | |
0x100000 - 0x20E3000 | vmlinux.bin | Size depends on vmlinux.bin's size |
0x6612000 - 0x7FFF834 | initrd.img | Size depends on initrd.img's size |
0x7FFFFFF | GuestMemory last address | based on (128 << 20 = 128MB = 0x8000000) - 1 |
0xD0000000 | MMIO_MEM_START(4GB - 768MB) | |
0xD0000000 - 0xFFFFFFFF | MMIO_MEM_START - FIRST_ADDR_PAST_32BIT | |
0x100000000 | FIRST_ADDR_PAST_32BIT (4GB~) |
Upon examining the code, you can see that the address range that is designed independently of the GuestMemory size (roughly 0x0 ~ HIGH_MEMORY_START
) is commonly registered as "Usable" in the E820, ranging from 0
to EBDA_START
(0x9FBFF
).
Subsequently, the range registered in the E820 changes depending on how much GuestMemory is allocated. In the current implementation, the GuestMemory is set to reserve 128MB of memory by default, so the Guest Memory ranges from 0x0
to 0x7FF_FFFF
. In this range, vmlnux.bin
content and initrd.img
are mapped.
In other words, the logic guest_mem.last_addr() = 0x7FF_FFFF < 0xD000_0000 = end_32bit_gap_start
applies, so the range HIGH_MEMORY_START ~ guest_mem.last_addr()
is additionally registered. In the future, as you expand, if the GuestMemory size exceeds 4GB, you will register the ranges 0x10_0000 ~ 0xD000_0000
and 0x1_000_0000 ~ guest_mem.last_addr()
.
You will be able to confirm the console output when starting the VM shortly. Here, I've provided part of the output to show that the E820 entries you configured are registered:
[ 0.000000] e820: BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000007ffffff] usable
References
- Linuxのブートシーケンスの基礎まとめ
- Linuxカーネルユーザ・管理者ガイド - 初期RAMdディスクを使用する
- initrd
- initramfs(initrd)のinitをbusyboxだけで書いてみた
- [initramfsとinitrdについて](https://blog.goo.ne.jp/pepolinux/e/4d1f4b6e0f5b5ed389f
Setup registers of vCPU
In this document, we will describe the configuration of vCPU registers. While registers are commonly discussed collectively, there are various types of registers, making it complex to determine how to set each of them. The content related to registers explained in this document focuses solely on the aspect of starting a virtual machine (VM). Additionally, as we want to boot the Guest OS in 64-bit mode, we will briefly explain some settings required for transitioning to 64-bit mode and the associated paging.
Setup vCPU general-purpose registers
Configuration of the vCPU's general-purpose registers can be done through the KVM set_regs
API. For this example, we will set the values of the registers as follows (detailed explanations of each register are omitted):
Register | Value | Meaning |
---|---|---|
RFLAGS | 2 | The bit at 0x02 must be set as a reserved bit |
RIP | KERNEL START ADDRESS (0x0100_0000 ) | Address of the entry point obtained from the ELF |
RSP | BOOT STACK POINTER (0x8ff0 ) | Address of the Stack Pointer used during boot |
RBP | BOOT STACK POINTER (0x8ff0 ) | Set to match RSP before boot processing |
RSI | boot_params ADDRESS (0x7000 ) | Address where boot_params information is stored |
The RIP should store the instruction start address when the vCPU is launched. In this case, we specify the address of the kernel's entry point. Since we plan to execute in 64-bit Long Mode, RIP's address will be treated as a virtual memory address. However, to implement Paging with Identity Mapping, the virtual memory address will be equal to the physical memory address. For RSP and RBP, we put the addresses necessary for the boot stack. These values can be obtained from available memory. RSI should contain the address where the boot_params
structure is stored. ToyVMM is created by mimicking Firecracker's values, so the address values stored in RSP, RBP, and RSI are mimicked from Firecracker.
Setup vCPU special registers
Configuration of vCPU special registers can be done through the KVM set_sregs
API. In this section, we will focus on the registers that are actually configured while briefly mentioning the background. The following explanations may introduce some unfamiliar terms. If you encounter such terms, please take the time to look them up.
IDT (Interrupt Descriptor Table)
The IDT (Interrupt Descriptor Table) is a data structure that holds information about interrupts and exceptions in Protected Mode and Long Mode. Originally, in Real Mode, there was the Interrupt Vector Table (IVT), which served the purpose of informing the CPU where the Interrupt Service Routines (ISRs) were located. In other words, it held handlers for each interrupt or exception, allowing the system to determine which handler to invoke when they occurred.
In Protected Mode and Long Mode, the address representation is different from Real Mode, so IDT is a mechanism that provides similar capabilities but adapted to these modes. The IDT is a table with a maximum of 255 entries, and the IDT's address needs to be set in the IDTR register. When an interrupt occurs, the CPU references the IDT from the IDTR value and executes the specified interrupt handler.
According to the 64-bit Boot Protocol, interrupts should be set to "Disabled." Therefore, the IDT-related configuration is omitted in the ToyVMM (Firecracker) implementation, and we won't delve into the details of the IDT here.
Segmentation, GDT (Global Descriptor Table), LDT (Local Descriptor Table)
Before discussing GDT, let's briefly introduce segmentation. Memory segmentation is a memory management method where programs and data are managed in variable-sized blocks called segments. Segments are groups of information categorized by attributes in memory, and they are one of the memory management methods used to implement virtual memory and memory protection. In Linux, segmentation is used in conjunction with paging, assuming a flat memory model. For the rest of this discussion, we will proceed with this assumption.
The GDT (Global Descriptor Table) is a data structure used to manage memory segments. This structure closely resembles the IDT. The GDT is a table with multiple entries called Segment Descriptors, and the GDT's address needs to be set in the GDTR register. The entries in this table are accessed by the Segment Selector, and they provide information about which address range is covered, what operations are allowed in that region, and other details. The Segment Selector appears in Segmentation Registers and the format of each Entry in the IDT, such as the Gate Descriptor and Task State Segment. We will omit detailed explanations here, so please research further if needed.
The LDT (Local Descriptor Table) is a data structure used to manage segments, similar to GDT. However, LDT can be held separately for each task or thread, distinguishing it from GDT. Having a separate GDT descriptor for each task allows segments to be shared among a task's own programs while keeping them separate from segments used by different tasks, enhancing security between tasks. Since LDT is not relevant to this implementation, we will also skip detailed explanations about it.
GDT setup for 64-bit mode
As specified in the 64-bit Boot Protocol, in 64-bit mode, each Segment Descriptor must be set up as a 4G flat segment. Code and Data Segments should be assigned the appropriate permissions. The Global Descriptor Table indicates that in 64-bit mode, base and limit are essentially ignored, and each Descriptor covers the entire linear address space, except for the flags. Therefore, it seems that the values for flags other than the flags are not critical. Nonetheless, in this example, explicit setup is done to ensure a flat segment. Additionally, it is mentioned that the values for DS
, ES
, and SS
should be the same as DS
, so this is implemented accordingly.
Subsequently, we will examine how these settings are configured in ToyVMM (you can read it as Firecracker). These settings are done in the configure_seguments_and_sregs
function. To make it easier to understand, some comments have been added:
#![allow(unused)] fn main() { fn configure_segments_and_sregs(sregs: &mut kvm_sregs, mem: &GuestMemoryMmap) -> Result<(), RegError> { let gdt_table: [u64; BOOT_GDT_MAX as usize] = [ gdt::gdt_entry(0, 0, 0), // NULL gdt::gdt_entry(0xa09b, 0, 0xfffff), // CODE gdt::gdt_entry(0xc093, 0, 0xfffff), // DATA gdt::gdt_entry(0x808b, 0, 0xfffff), // TSS ]; // > https://wiki.osdev.org/Global_Descriptor_Table // // 55 52 47 40 39 31 16 15 0 // CODE: 0b0..._1010_1111_1001_1011_0000_0000_0000_0000_0000_0000_1111_1111_1111_1111 // <-f-> <-Access-><---------------------------> <----- limit -----> // - Flags : 1010 => G(limit is in 4KiB), L(Long mode) // - Access : 1001_1011 => P(must 1), S(code/data type), E(executable), RW(readable/writable), A(CPU access allowed) // - 0xa09b of A,9,B represents above values // // DATA: 0b0..._1100_1111_1001_0011_0000_0000_0000_0000_0000_0000_1111_1111_1111_1111 // - Flags : 1100 => G(limit is in 4KiB), DB(32-bit protected mode) // - Access : 1001_0011 => P(must 1), S(code/data type), RW(readable/writable), A(CPU access allowed) // // TSS // - Flags : 1000 => G(limit is in 4KiB) // - Access : 1000_1011 => P(must 1), E(executable), RW(readable/writable), A(CPU access allowed) // - TSS requires to support Intel VT let code_seg = gdt::kvm_segment_from_gdt(gdt_table[1], 1); let data_seg = gdt::kvm_segment_from_gdt(gdt_table[2], 2); let tss_seg = gdt::kvm_segment_from_gdt(gdt_table[3], 3); // Write segments write_gdt_table(&gdt_table[..], mem)?; sregs.gdt.base = BOOT_GDT_OFFSET as u64; sregs.gdt.limit = mem::size_of_val(&gdt_table) as u16 - 1; write_idt_value(0, mem)?; sregs.idt.base = BOOT_IDT_OFFSET as u64; sregs.idt.limit = mem::size_of::<u64>() as u16 - 1; sregs.cs = code_seg; sregs.ds = data_seg; sregs.es = data_seg; sregs.fs = data_seg; sregs.gs = data_seg; sregs.ss = data_seg; sregs.tr = tss_seg; // 64-bit protected mode sregs.cr0 |= X86_CR0_PE; sregs.efer |= EFER_LME | EFER_LMA; Ok(()) } }
In the above code, a table with 4 entries is created as the GDT to set up. The first entry must be Null as required by the GDT. For the rest, it can be seen that settings for the CODE Segment, DATA Segment, and TSS Segment are made for the entire memory region. The TSS setting is done to meet the requirements of Intel VT, and it's not substantially used within the scope of this document.
Now, when creating this GDT, a function called gdt_entry
is called to create each entry. Here's the code for this function:
#![allow(unused)] fn main() { pub fn gdt_entry(flags: u16, base: u32, limit: u32) -> u64 { ((u64::from(base) & 0xff00_0000u64) << (56 - 24)) | ((u64::from(flags) & 0x0000_f0ffu64) << 40) | ((u64::from(limit) & 0x000f_0000u64) << (48 - 16)) | ((u64::from(base) & 0x00ff_ffffu64) << 16) | (u64::from(limit) & 0x0000_ffffu64) } }
For this function, all entries have 0x0
as the base and 0xFFFFF
(2^5 = 32-bit = 4GB
) as the limit, which makes it a flat segmentation. The flags
argument for each entry is configured individually, which in turn corresponds to the values in GDT's Flags
and AccessByte
. If you look at the comments in the code, you can see the values returned by gdt_entry
for each entry and what those values represent when parsed. According to the comments, as required by the 64-bit Boot Protocol, the CODE Segment has Execute/Read permission and the "long mode (64-bit code segment)" flag, while the DATA Segment has Read/Write permission.
The GDT created as mentioned above is written to GuestMemory using the write_gdt_table
function, and the starting address of that is stored in sregs.gdt.base
.
Regarding the subsequent IDT settings, as mentioned earlier, it appears to be disabled. Therefore, nothing is written to memory. However, the code decides on which address in GuestMemory to use and stores that address in sregs.idt.base
.
Continuing, other register values are set. As mentioned earlier, CS
is set with information about the CODE Segment, and DS
, ES
, SS
are set with information about the DATA Segment, while TR
is set with information about the TSS Segment. In the code above, FS
and GS
are also set with information about the DATA Segment, but these segment values may not need to be configured.
Finally, settings are made for CR0 and EFER registers, which will be explained later.
64-bit protected mode
The Long mode
is the native mode for x86_64 processors, offering several additional features compared to the legacy x86 mode. However, we won't go into the details of these additional features here. Long mode
consists of two submodes: 64-bit mode
and compatibility mode
.
To switch to 64-bit mode, you need to perform the following steps:
- Set
CR4.PAE
to enable Physical Address Extension (PAE). - Create the Page Table and load the address of the top-level page table into
CR3
register. - Set
CR0.PG
to enable Paging. - Set
EFER.LME
to enable Long Mode.
Setting the values in the registers involves updating the corresponding fields in the kvm_sregs
structure and then configuring them using set_sregs
. The key part is creating the Page Table.
4-Level Page Table for entering 64-bit mode
The processes related to booting the Linux Kernel are categorized into several stages based on the available memory address space. Immediately after booting, the process of setting up and interacting with physical memory addresses is known as x16 Real-Mode
, which operates in a 16-bit memory alignment.
On the other hand, as many readers are aware, familiar operating systems like ours can be either 32-bit or 64-bit. These distinctions are made possible through a feature known as CPU mode switching, which transitions the CPU into modes called x32 Protected Mode
and x64 Long Mode
. Once switched to these modes, the CPU can only utilize virtual memory addresses.
Especially in the x64 CPU architecture, a 4-level page table is typically used to translate 64-bit virtual addresses into physical addresses. This means that before switching to x64 Long Mode
, a 4-level page table must be constructed and conveyed to the CPU. This process is implemented as part of the BootLoader's functionality.
Now, another crucial point to consider is that while the RIP value currently contains the physical address value indicating the kernel's entry point, when handling it in x64 Long Mode
, this address is used as a virtual address. Therefore, if this address were to be mapped to a different physical address, the OS would fail to boot.
Hence, at this stage, a simple page table is created where virtual memory addresses map to the same physical memory addresses. This is often referred to as Identity Mapping and addresses the issue mentioned above.
Note: It's important to note that the page table created by the BootLoader for x64 is a temporary requirement for executing the kernel. When we typically think of virtual memory addresses and page tables, we often associate them with user-space processes. However, the paging mechanism for user processes is implemented within the kernel and is configured when the kernel boots. Therefore, the mechanism for translating BootLoader's page table, whether it's Identity Mapping or not, has no impact on the paging mechanism for individual processes after the OS boots.
Page Table implementation in ToyVMM
Let's dive into the specific implementation of ToyVMM to understand the Page Table configuration better. This implementation closely follows that of Firecracker.
Let's briefly discuss the structure of the 4-Level Page Table. Essentially, at each level, there exists a table with its own designation:
Level 4: Page Map Level 4 (PML4) Level 3: Page Directory Pointer Table (PDPT) Level 2: Page Directory Table (PDT) Level 1: Page Tables (PT)
Each table can hold 512 entities, and one entity consists of 8 bytes (64 bits). Therefore, the entire table size is 512 (entities) * 8 (bytes per entity) = 4096 bytes. This size conveniently fits into a single page (4KB).
The structure of each level's entity is as follows:
Source: x86 Initial Boot Sequence and OSdev/Paging
From the above, it seems that the setup should satisfy the following conditions:
- Consider the data within CR3, which serves as the address of PML4, as ranging from bits 12 to 32+ in order to design the PML4 address.
- To enable the PML4, set the 0th bit, and design the address of PDPT within the range of bits 12 to 32+.
- To utilize the layout of PDPTE page directory, do not set the 7th bit of PDPTE, and design the address of PD within the range of bits 12 to 32+.
- To allow 2MB pages in PDE, set the 7th bit and design the Physical Address within the range of bits 21 to 32+.
- In Firecracker, it appears that 2MiB paging is implemented without using Level 1 Page Tables (i.e., without using 4KiB pages). ToyVMM's implementation follows suit.
Now, let's extract the actual code from the implementation based on the above.
#![allow(unused)] fn main() { fn setup_page_tables(sregs: &mut kvm_sregs, mem: &GuestMemoryMmap) -> Result<(), RegError> { let boot_pml4_addr = GuestAddress(PML4_START); let boot_pdpte_addr = GuestAddress(PDPTE_START); let boot_pde_addr = GuestAddress(PDE_START); // Entry converting VA [0..512GB) mem.write_obj(boot_pdpte_addr.raw_value() as u64 | 0x03, boot_pml4_addr) .map_err(|_| RegError::WritePdpteAddress)?; // Entry covering VA [0..1GB) mem.write_obj(boot_pde_addr.raw_value() as u64 | 0x03, boot_pdpte_addr) .map_err(|_| RegError::WritePdpteAddress)?; // 512 MB entries together covering VA [0..1GB). // Note we are assuming CPU support 2MB pages (/proc/cpuinfo has 'pse'). for i in 0..512 { mem.write_obj((i << 21) + 0x83u64, boot_pde_addr.unchecked_add(i * 8)) .map_err(|_| RegError::WritePdeAddress)?; } sregs.cr3 = boot_pml4_addr.raw_value() as u64; sregs.cr4 |= X86_CR4_PAE; sregs.cr0 |= X86_CR0_PG; Ok(()) } }
As seen, the implementation is quite simple.
PML4_START
, PDPTE_START
, and PDE_START
have hardcoded address values, which are PML4_START=0x9000
, PDPTE_START=0xa000
, and PDE_START=0xb000
, respectively, meeting the requirements of the address designs mentioned above.
From the code, it's clear that there is only one PML4
and one PDPT
Table, and only the initial entry is set up. This is sufficient in this implementation because the kernel's address being translated by these page tables is 0x0100_0000
. These tables, specifically PML4
and PDPT
, will always look at the first entry (as described later), making this implementation suitable.
In PML4
, the information about the starting address of PDPT
is written by taking the logical OR of that address with 0x03
. Similarly, in PDPT
, the starting address of PD
is written by taking the logical OR of that address with 0x03
. The reason for using 0x03
here is to set the 0th and 1st bits of PML4E
and PDPTE
, which correspond to the R/W permission flag and the existence flag of that entry. These bits are essential in this case.
For PD
, a loop is used to create 512 entries. It writes the value resulting from shifting the loop's index by 21 bits to the beginning of PD's address, every 8 bytes (1 entry size) from the starting address. The reason for using 0x83
here is to set the R/W permission flag, the existence confirmation flag, and the flag that determines whether to treat it as a 2MB page frame. This flag-setting allows using the value obtained by offsetting 21 bits from the index as the address (utilizing the layout of PDE 2MB page
in the diagram). Therefore, for PDE
, the entry at index 0 corresponds to an address of 0x0000_0000
, and the entry at index 1 corresponds to an address of 0x0010_0000
, and so on, based on the value from the loop described above.
Now, let's check whether the kernel's address stored in EIP (0x0100_0000
) is correctly converted using the Page Table we just created! As mentioned earlier, when transitioning to x64 Long Mode
, this kernel address is treated as a 64-bit virtual address. Currently, ToyVMM (and Firecracker) loads the kernel at physical address 0x0100_0000
, and this value is stored in the eip
register.
Therefore, by treating 0x0100_0000
as a virtual address and using the conversion table mentioned above, we expect the result of the address translation to be 0x0100_0000
.
Let's calculate it explicitly. When converting a 64-bit virtual address with 4-Level Page Table, you split the lower 48 bits of the virtual address into groups of 9 + 9 + 9 + 9 + 12
bits each. These four groups of 9 bits are used as the index values for each Page table entry. You look up the layout of the identified entry in this way, then check the physical address of the next Page Table, and similarly determine the entry to be used in the next Page Table based on the physical address and virtual address. Continuing this process will eventually yield the desired physical address. Since Pages are at least 4KB in size, the address value is also in multiples of 4KB, so the final 12 bits of the virtual address serve as the offset (2^12 = 4KB
).
Let's remember that in this case, we have set the flag in PDE to treat it as a 2MB page frame. In this scenario, the result obtained from PDE is used directly as the physical address mapping. The 9 bits that are not used for PTE are treated as an offset, adding up to a total offset of 21 bits when combined with the original 12 bits. This 21-bit offset corresponds to the 2MB size. Similarly, when you set the flag in PDPTE, it is treated as a 1GB page frame.
Based on the above discussion, let's convert 0x0100_0000
. In binary representation for clarity, it is 0b0..._0000_0001_0000_0000_0000_0000_0000_0000
. Following the virtual address conversion method, it breaks down as follows:
Entry index for | Range of Virtual Address | Value |
---|---|---|
Page Map Level4 (PML4) | 47 ~ 39 bit | 0b0_0000_0000 |
Page Directory Pointer Table (PDPT) | 38 ~ 30 bit | 0b0_0000_0000 |
Page Directory Table (PDT) | 29 ~ 21 bit | 0b0_0000_1000 |
Page Tables (PT) | 20 ~ 12 bit | 0b0_0000_0000 |
- | 11 ~ 0 bit (offset) | 0b0_0000_0000 |
From this breakdown, you can see that the index values for PML4E
and PDPTE
are 0
, so you'll check the 64 bits directly from the beginning of each table. As implemented, PML4E
at index 0 contains the address of PDPT
, and PDPTE
at index 0 contains the address of PDT
. So, you follow this structure to reach PDT
.
Now, the PDE's index value is 0b0_0000_1000
from the virtual memory address above, so you will check the 8th entry in PDT
. The value stored in this entry for the 2MB Page frame area is 0b0...0000_1000
. Therefore, when you add the 21-bit offset to this value, you get 0b1_0000_0000_0000_0000_0000_0000 = 0x100_0000
as the resulting physical address after conversion. This matches the input virtual address.
Hence, even after the conversion, the kernel's entry point will still be pointed to, and the kernel will begin execution in 64-bit long mode.
It's worth noting that this Page Table, as designed in this implementation, effectively enables Identity Mapping in the range of 2^21 ~ 2^30-1
.
Note
Upon revisiting the Page Table created this time, it's important to note that there is only one Entry for PML4 and PDPT. As a result, the virtual memory address range that can be targeted is at most 2^31 - 1
. If you go beyond this range, there would be cases where indices other than 0 are used for PML4E and PDPTE.
Additionally, in the PD's Entry, the 2MB page frame is enabled. Therefore, the lower 21 bits of the virtual memory address are treated as an offset. Furthermore, since the PDE's address design corresponds to an index, this Page Table effectively enables Identity Mapping in the range of 2^21
to 2^30 - 1
.
What to do next?
Up to this point, it's possible to start a Guest VM just by combining the discussed concepts. However, in this state, the Guest VM can be started but cannot be interacted with, leaving the setup somewhat incomplete. To ensure that the started Guest VM is configured as expected and for further interactions, we need to create an interface to control the Guest VM. In the next chapter, we will discuss the use of Serial
and how to implement it within ToyVMM to allow keyboard interactions after starting the Guest VM!
References
- The Linux/x86 Boot Protocol - 64-bit Boot Protocol
- Linux Insides: カーネル起動プロセス part4
- Global Descriptor Table (wiki)
- Interrupt Descriptor Tabke (wiki)
- Segmentation (wiki)
- Control register (wiki)
- Long mode (wiki)
- x86 initial boot sequence
- Virtual Memory - Intro to Paging Tables
- Writing an OS in Rust - Introduction to Paging
- Intel 64 and IA-32 Architectures Software Developer's Manual
Serial Console implementation
About Serial UART and ttyS0
UART(Universal Asynchronous Receiver/Transmitter) is an asynchronous serial communication standard used to connect computers and microcontrollers to peripheral devices. UART allows for the conversion of parallel and serial signals, enabling the conversion of input parallel data into serial data and transmitting it to the other party over a communication line. Integrated circuits designed for this purpose, known as 8250 UART devices, were manufactured, followed by various other families.
Now, in this case, we are attempting to boot the Guest OS (Linux), and having a serial console is quite useful for debugging and other purposes. A serial console sends all console outputs of the Guest to the serial port. With the serial terminal properly configured, you can remotely monitor the system's boot status or log in to the system via the serial port. In this instance, we will use this method to check the state of a Guest VM running on ToyVMM and perform operations within the Guest.
To output console messages to the serial port, it is necessary to set console=ttyS0
as a kernel boot parameter. In the current implementation of ToyVMM, this value is provided as the default.
The challenge lies on the side that receives this, the serial terminal. Since the I/O port address corresponding to the serial port is fixed, ToyVMM's layer will receive instructions like KVM_EXIT_IO
for the nearby address. In other words, it needs to properly handle output information to the serial console issued from the Guest OS and other necessary setup requests. This can be achieved by emulating the UART device. Furthermore, by emulating the device, if we can output console output to the standard output and reflect our standard input to the Guest VM, when starting the VM from ToyVMM, we can confirm the boot information and perform operations on the Guest from our local terminal.
In summary, we need to create something like the conceptual diagram below:
We will explain this in detail in the following sections.
Serial UART
For detailed information about Serial UART, you can refer to the following resources by Lammet Bies and Wikibooks, which provide rich information:
The following figures are based on Lammet's document, with a brief explanation of each bit of each register. Although this diagram was created by me personally in writing this document, it is attached in the hope that it will help readers understand the meaning of each register and bit. However, the meaning of each register and bit is not explained in this document, so please refer to the above document for confirmation:
Basically, UART operations are performed by manipulating the registers and bits shown above. In our case, we need to emulate this in software, and we plan to do this using rust-vmm/vm-superio. In the following sections, we'll briefly compare the implementation of rust-vmm/vm-superio with the above specifications.
Software Implementation of Serial Device using rust-vmm/vm-superio
Initial Value Settings/RW Implementation
Here, we will review the implementation of the serial device using rust-vmm/vm-superio while comparing it with the above specifications. I encourage you to obtain the code from the link provided and inspect it for yourself. The following content is based on version vm-superio-0.6.0
, so please note that it may have changed in the latest code.
First, let's organize some initial values for certain variables. rust-vmm/vm-superio was originally designed for VMM usage, so it initializes certain register values and doesn't anticipate changes.
Variable | DEFAULT VALUE | Meaning | REGISTER |
---|---|---|---|
baud_divisor_low | 0x0c | Baud rate 9600 bps | |
baud_divisor_high | 0x00 | Baud rate 9600 bps | |
interrupt_enable | 0x00 | No interrupts enabled | IER |
interrupt_identification | 0b0000_0001 | No pending interrupt | IIR |
line_control | 0b0000_0011 | 8-bit word length | LCR |
line_status | 0b0110_0000 | (1) | LSR |
modem_control | 0b0000_1000 | (2) | MCR |
modem_status | 0b1011_0000 | (3) | MSR |
scratch | 0b0000_0000 | SCR | |
in_buffer | Vec::new() | Vector values (buffer) | - |
- (1) Setting THR empty-related bits. Setting these bits means that data can be received at any time. This represents the assumption that it will be used as a virtual device.
- (2) Many UARTs enable interrupts by default by setting Auxiliary Output 2 to 1.
- (3) Connected state and hardware data flow initialization.
Now, let's look at the processing when a write request is received. As a result of KVM_EXIT_IO
, we receive the address where IO occurred and the data to be written. On the ToyVMM side, we calculate the appropriate device (in this case, the Serial UART device) and its offset from the base address based on these values and call the write
function defined in vm-superio
. The following content is a simplified table representing the processing of Serial::write
. In general, it involves straightforward register value modification, with a few exceptions:
Variable | OFFSET(u8) | Additional Conditions | Write |
---|---|---|---|
DLAB_LOW_OFFSET | 0 | is_dlab_set = true | Modify self.baud_divisor_low |
DLAB_HIGH_OFFSET | 1 | is_dlab_set = true | Modify self.baud_divisor_high |
DATA_OFFSET | 0 | - (is_dlab_set = false) | (1) |
IER_OFFSET | 1 | - (is_dlab_set = false) | (2) |
LCR_OFFSET | 3 | - | Modify self.line_control |
MCR_OFFSET | 4 | - | Modify self.modem_control |
SCR_OFFSET | 7 | - | Modify self.scratch |
- (1) Depending on the current state of the Serial, we handle cases where LOOP_BACK_MODE (MCR bit 4) is enabled and when it is not enabled.
- If it is enabled, it simulates passing what is written to the transmit register directly to the receive register (loopback), which is not important in this context.
- If it is not enabled, it writes the data to be written to the output and depends on the existing configuration to generate interrupts.
- As shown in the table above, we do not support changing IIR due to write from outside, and the default value is set to
0b0000_0001
. - If the THR empty bit flag of IER is set for IER_OFFSET, it sets the corresponding flag for THR empty in IIR and triggers an interrupt.
- As shown in the table above, we do not support changing IIR due to write from outside, and the default value is set to
- (2) Among the bits of IER, only bits 0-3 are masked, and the result is written back to
self.interrupt_enable
.
Next, let's look at the processing when a read request is received. Similarly, we present the processing of Serial::read
in a simplified table. Unlike write, in the case of read, it mainly involves returning data as the result.
Variable | OFFSET(u8) | Additional Conditions | Read |
---|---|---|---|
DLAB_LOW_OFFSET | 0 | is_dlab_set = true | Read self.baud_divisor_low |
DLAB_HIGH_OFFSET | 1 | is_dlab_set = true | Read self.baud_divisor_high |
DATA_OFFSET | 0 | - (is_dlab_set = false) | (1) |
IER_OFFSET | 1 | - (is_dlab_set = false) | Read self.interrupt_enable |
IIR_OFFSET | 2 | - | (2) |
LCR_OFFSET | 3 | - | Read self.line_control |
MCR_OFFSET | 4 | - | Read self.modem_control |
LSR_OFFSET | 5 | - | Read self.line_status |
MSR_OFFSET | 6 | - | (3) |
SCR_OFFSET | 7 | - | Read self.scratch |
- (1) Reads data from the buffer held by the Serial structure. In the current implementation, this buffer is only filled by write in loopback mode, so read operations related to this region are not issued in the boot sequence of the OS.
- (2) Returns the result of
self.interrupt_identification
|0b1100_0000 (FIFO enabled)
and resets it to the default value. - (3) Depending on whether the current state is loopback mode, it handles differently.
- In the case of loopback, it adjusts appropriately (not important for this context).
- In the case of non-loopback, it straightforwardly returns the value of
self.modem_status
.
Usage of rust-vmm/vm-superio in ToyVMM
In ToyVMM, we use rust-vmm/vm-superio to handle KVM_EXIT_IO
contents. Additionally, two things need to be considered:
- Outputting console output destined for the serial port to the standard output to allow monitoring of the boot sequence and internal state of the Guest VM.
- Passing the content of standard input to the Guest VM.
In the following sections, we'll go through each of these in order.
Outputting Console Output Destined for the Serial Port to Standard Output
To monitor the boot sequence and internal state of the Guest VM, we will redirect console output destined for the serial port to the standard output. "Console output destined for the serial port" corresponds to the case of KVM_EXIT_IO_OUT
where KVM_EXIT_IO
is issued for the "IO Port address for Serial". The code section below handles this:
#![allow(unused)] fn main() { ... loop { match vcpu.run() { Ok(run) => match run { ... VcpuExit::IoOut(addr, data) => { io_bus.write(addr as u64, data); } ... } } } ... }
Here, as a result of KVM_EXIT_IO_OUT
, we receive the address and data to be written. On the ToyVMM side, we simply call io_bus.write
with these values. The setup for this io_bus
is done as follows:
#![allow(unused)] fn main() { let mut io_bus = IoBus::new(); let com_evt_1_3 = EventFdTrigger::new(EventFd::new(libc::EFD_NONBLOCK).unwrap()); let stdio_serial = Arc::new(Mutex::new(SerialDevice { serial: serial::Serial::with_events( com_evt_1_3.try_clone().unwrap(), SerialEventsWrapper { buffer_read_event_fd: None, }, Box::new(std::io::stdout()), ), })); io_bus.insert(stdio_serial.clone(), 0x3f8, 0x8).unwrap(); vm.fd().register_irqfd(&com_evt_1_3, 4).unwrap(); }
The setup above requires some explanation, so let's go through it step by step. In essence, it accomplishes the following:
- Initializes an I/O Bus represented by
IoBus
and initializes the eventfd for interrupts. - Initializes the Serial Device. During initialization, we provide an
eventfd
for generating interrupts in the Guest and an FD (std::io::stdout()
) for standard output. - Registers the Serial Device we initialized with the
IoBus
. During registration, we specify0x3f8
as the base and0x8
as the range.- This means that the range of
0x8
starting from the base0x3f8
represents the address space used by this Serial Device.
- This means that the range of
Handling the I/O Bus
The address value passed via KVM_EXIT_IO
becomes the value within the entire address space. On the other hand, the read/write
implementation in rust-vmm/vm-superio works based on an offset value from the Serial Device's base address. Therefore, there's a need for processing to bridge this gap.
You could simply calculate the offset, but in Firecracker, considering future extensibility (using I/O Ports for devices other than Serial), there's a Bus
structure representing the I/O Bus. This structure allows devices to be registered along with BusRange
(a structure representing the base address and address range for devices on the bus). Furthermore, when an I/O at a specific address occurs, the mechanism checks that address, retrieves the device registered in the corresponding address range, and performs I/O on that device using the offset from the base address.
For instance, the write
function is implemented as follows, where it retrieves the registered device and its offset based on the address information using the get_device
function, and then calls the write
function implemented in that device with the offset.
#![allow(unused)] fn main() { pub fn write(&self, addr: u64, data: &[u8]) -> bool { if let Some((offset, dev)) = self.get_device(addr) { // OK to unwrap as lock() failing is a serious error condition and should panic. dev.lock() .expect("Failed to acquire device lock") .write(offset, data); true } else { false } } }
Let's consider the Serial device as an example. As mentioned earlier, KVM_EXIT_IO_OUT
for the Serial device from the Guest VM occurs within an address range of 8 bytes with a base address of 0x3f8
. ToyVMM's IoBus also registers the Serial Device with the same address base and range. For example, when you trap an instruction that writes 0b1001_0011
to 0x3fb
as KVM_EXIT_IO_OUT
, it interprets this instruction as writing 0b1001_0011
to LCR
at the position 0x3
from the base address 0x3f8
.
Interrupt Notification to Guest VM via eventfd/irqfd
Now, let's discuss KVM and interrupts. We will reference some Linux source code, mainly from version v4.18
.
:warning: The following information is mainly based on source code and may not capture all the details of state transitions. If you find any inaccuracies, please let me know in the comments.
In rust-vmm/vm-superio, during Serial initialization, it requires an EventFd
as its first argument. This is a wrapper for eventfd in Linux. Eventfd allows inter-process and process-to-kernel event notifications.
Next is irqfd. irqfd is a mechanism based on eventfd that allows injecting interrupts into a VM. In simple terms, it's like having one end of eventfd held by KVM, and the other end's notifications are interpreted as interrupts to the Guest VM. This irqfd-based interrupt is meant to emulate interrupts from the external world to the Guest VM, which corresponds to regular system interrupts from peripheral devices in a typical system. The reverse direction of interrupts is handled using the ioeventfd
mechanism, which we'll omit for now.
Let's examine how irqfd is connected to Guest VM interrupts by looking at the source code. When you perform an ioctl with KVM_IRQFD
against KVM, it goes through the KVM processing with the data passed to kvm_irqfd
and kvm_irqfd_assign
. In the kvm_irqfd_assign
function, an instance of the kvm_kernel_irqfd
structure is created. At this point, settings are made based on additional information passed during the ioctl. Particularly, the gsi
field in the kvm_kernel_irqfd
structure is set based on the value passed as an argument during the ioctl. This gsi
corresponds to the index of the interrupt table for the Guest, so when making the ioctl, you specify which interrupt table entry you want to use along with the eventfd. ToyVMM sets this up with a line like this:
#![allow(unused)] fn main() { vm.fd().register_irqfd(&com_evt_1_3, 4).unwrap(); }
This is defined as a method in the kvm_ioctl::VmFd
structure.
#![allow(unused)] fn main() { pub fn register_irqfd(&self, fd: &EventFd, gsi: u32) -> Result<()> { let irqfd = kvm_irqfd { fd: fd.as_raw_fd() as u32, gsi, ..Default::default() }; // Safe because we know that our file is a VM fd, we know the kernel will only read // the correct amount of memory from our pointer, and we verify the return result. let ret = unsafe { ioctl_with_ref(self, KVM_IRQFD(), &irqfd) }; if ret == 0 { Ok(()) } else { Err(errno::Error::last()) } } }
In other words, in the aforementioned setup, the eventfd (com_evt_1_3
) used by the Serial device has been configured with GSI=4 (the Guest VM's interrupt table index for the COM1 port). Therefore, any write
operation performed on com_evt_1_3
results in an interrupt being sent to the Guest VM as if it were generated from COM1. From the Guest's perspective, this means that an interrupt originated from the Serial device downstream of COM1, leading to the invocation of the Guest VM's COM1 interrupt handler.
Now, let's discuss the setup of the Guest-side Interrupt Table (GSI: Global System Interrupt Table) and how and when it's established. In short, these tables are set up by issuing an ioctl to KVM with KVM_CREATE_IRQCHIP
. This operation creates two interrupt controllers, the PIC
and IOAPIC
(internally, the kvm_pic_init
function handles PIC initialization, registers read/write ops, and sets it in kvm->arch.vpic
. Similarly, kvm_ioapic_init
initializes the IOAPIC, registers read/write ops, and sets it in kvm->arch.vioapic
). These hardware components, such as the PIC and IOAPIC, are implemented within KVM for the purpose of acceleration, so there's no need to emulate them separately. While you could delegate this task to qemu, we'll omit this detail here since we're not using it.
Furthermore, the kvm_setup_default_irq_routing
function sets up default IRQ routing. This process determines which handler will be invoked for each GSI-based interrupt. Let's take a closer look at the contents of kvm_setup_default_irq_routing
. This function calls kvm_set_irq_routing
, where the essential processing takes place. Here, a kvm_irq_routing_table
is created and populated with kvm_kernel_irq_routing_entry
structures that represent the mapping from GSI to IRQ.
The kvm_kernel_irq_routing_entry
structures are populated using a loop that iterates through a default_routing
array. Here's how default_routing
is defined along with related macros:
#define SELECT_PIC(irq) \
((irq) < 8 ? KVM_IRQCHIP_PIC_MASTER : KVM_IRQCHIP_PIC_SLAVE)
#define IOAPIC_ROUTING_ENTRY(irq) \
{ .gsi = irq, .type = KVM_IRQ_ROUTING_IRQCHIP, \
.u.irqchip = { .irqchip = KVM_IRQCHIP_IOAPIC, .pin = (irq) } }
#define ROUTING_ENTRY1(irq) IOAPIC_ROUTING_ENTRY(irq)
#define PIC_ROUTING_ENTRY(irq) \
{ .gsi = irq, .type = KVM_IRQ_ROUTING_IRQCHIP, \
.u.irqchip = { .irqchip = SELECT_PIC(irq), .pin = (irq) % 8 } }
#define ROUTING_ENTRY2(irq) \
IOAPIC_ROUTING_ENTRY(irq), PIC_ROUTING_ENTRY(irq)
static const struct kvm_irq_routing_entry default_routing[] = {
ROUTING_ENTRY2(0), ROUTING_ENTRY2(1),
ROUTING_ENTRY2(2), ROUTING_ENTRY2(3),
ROUTING_ENTRY2(4), ROUTING_ENTRY2(5),
ROUTING_ENTRY2(6), ROUTING_ENTRY2(7),
ROUTING_ENTRY2(8), ROUTING_ENTRY2(9),
ROUTING_ENTRY2(10), ROUTING_ENTRY2(11),
ROUTING_ENTRY2(12), ROUTING_ENTRY2(13),
ROUTING_ENTRY2(14), ROUTING_ENTRY2(15),
ROUTING_ENTRY1(16), ROUTING_ENTRY1(17),
ROUTING_ENTRY1(18), ROUTING_ENTRY1(19),
ROUTING_ENTRY1(20), ROUTING_ENTRY1(21),
ROUTING_ENTRY1(22), ROUTING_ENTRY1(23),
};
As you can see, IRQ numbers 0-15 are passed to ROUTING_ENTRY2
, and IRQ numbers 16-23 are passed to ROUTING_ENTRY1
. ROUTING_ENTRY2
calls both IOAPIC_ROUTING_ENTRY
and PIC_ROUTING_ENTRY
, while ROUTING_ENTRY1
calls IOAPIC_ROUTING_ENTRY
only, creating structures with the necessary information.
These structures are used to set up each .u.irqchip.irqchip
value (KVM_IRQCHIP_PIC_SLAVE
, KVM_IRQCHIP_PIC_MASTER
, KVM_IRQCHIP_IOAPIC
) appropriately in the kvm_set_routing_entry
function, depending on the IRQ. This function performs callbacks (kvm_set_pic_irq
, kvm_set_ioapic_irq
) and any necessary configurations when an interrupt occurs. We'll discuss these callbacks in more detail later.
int kvm_set_routing_entry(struct kvm *kvm,
struct kvm_kernel_irq_routing_entry *e,
const struct kvm_irq_routing_entry *ue)
{
/* We can't check irqchip_in_kernel() here as some callers are
* currently initializing the irqchip. Other callers should therefore
* check kvm_arch_can_set_irq_routing() before calling this function.
*/
switch (ue->type) {
case KVM_IRQ_ROUTING_IRQCHIP:
if (irqchip_split(kvm))
return -EINVAL;
e->irqchip.pin = ue->u.irqchip.pin;
switch (ue->u.irqchip.irqchip) {
case KVM_IRQCHIP_PIC_SLAVE:
e->irqchip.pin += PIC_NUM_PINS / 2;
/* fall through */
case KVM_IRQCHIP_PIC_MASTER:
if (ue->u.irqchip.pin >= PIC_NUM_PINS / 2)
return -EINVAL;
e->set = kvm_set_pic_irq;
break;
case KVM_IRQCHIP_IOAPIC:
if (ue->u.irqchip.pin >= KVM_IOAPIC_NUM_PINS)
return -EINVAL;
e->set = kvm_set_ioapic_irq;
break;
default:
return -EINVAL;
}
e->irqchip.irqchip = ue->u.irqchip.irqchip;
break;
...
Now, let's return to the discussion of irqfd
. Although not mentioned earlier, the kvm_irqfd_assign
function includes the init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup)
process, registering irqfd_wakeup
with &irqfd->wait->func
. This function is called when an interrupt occurs, and it invokes schedule_work(&irqfd->inject)
.
The inject
field is also initialized within the kvm_irqfd_assign
function, resulting in a call to the irqfd_inject
function. Inside irqfd_inject
, the kvm_set_irq
function is called.
The kvm_set_irq
function lists entries with the incoming IRQ number and calls their set
callbacks. This means that functions like kvm_set_pic_irq
and kvm_set_ioapic_irq
, as described earlier, will be called based on the routing information.
The following explanation will go into a little more depth on interrupt processing, but since they are not necessary for understanding ToyVMM, you may skip to ToyVMM serial console.
Let's take a closer look at the kvm_set_pic_irq
handler, which is responsible for handling interrupts. While this discussion slightly deviates from the main topic, it's a good opportunity to explore it more thoroughly.
kvm_set_pic_irq
simply utilizes the kvm_pic_set_irq
function, passing the relevant parameters.
static int kvm_set_pic_irq(struct kvm_kernel_irq_routing_entry *e,
struct kvm *kvm, int irq_source_id, int level,
bool line_status)
{
struct kvm_pic *pic = kvm->arch.vpic;
return kvm_pic_set_irq(pic, e->irqchip.pin, irq_source_id, level);
}
Let's inspect the implementation of kvm_pic_set_irq
:
int kvm_pic_set_irq(struct kvm_pic *s, int irq, int irq_source_id, int level)
{
int ret, irq_level;
BUG_ON(irq < 0 || irq >= PIC_NUM_PINS);
pic_lock(s);
irq_level = __kvm_irq_line_state(&s->irq_states[irq],
irq_source_id, level);
ret = pic_set_irq1(&s->pics[irq >> 3], irq & 7, irq_level);
pic_update_irq(s);
trace_kvm_pic_set_irq(irq >> 3, irq & 7, s->pics[irq >> 3].elcr,
s->pics[irq >> 3].imr, ret == 0);
pic_unlock(s);
return ret;
}
In pic_set_irq1
, the IRQ level is set, and then pic_update_irq
calls the pic_irq_request
and updates the kvm->arch.vpic
structure.
/*
* raise irq to CPU if necessary. must be called every time the active
* irq may changejjj
*/
static void pic_update_irq(struct kvm_pic *s)
{
int irq2, irq;
irq2 = pic_get_irq(&s->pics[1]);
if (irq2 >= 0) {
/*
* if irq request by slave pic, signal master PIC
*/
pic_set_irq1(&s->pics[0], 2, 1);
pic_set_irq1(&s->pics[0], 2, 0);
}
irq = pic_get_irq(&s->pics[0]);
pic_irq_request(s->kvm, irq >= 0);
}
/*
* callback when PIC0 irq status changed
*/
static void pic_irq_request(struct kvm *kvm, int level)
{
struct kvm_pic *s = kvm->arch.vpic;
if (!s->output)
s->wakeup_needed = true;
s->output = level;
}
After that, kvm_pic_set_irq
invokes pic_unlock
function.
This function is a little more import because if the wakeup_needed
field is true
, then invokes kvm_vcpu_kick
function for vCPU.
static void pic_unlock(struct kvm_pic *s)
__releases(&s->lock)
{
bool wakeup = s->wakeup_needed;
struct kvm_vcpu *vcpu;
int i;
s->wakeup_needed = false;
spin_unlock(&s->lock);
if (wakeup) {
kvm_for_each_vcpu(i, vcpu, s->kvm) {
if (kvm_apic_accept_pic_intr(vcpu)) {
kvm_make_request(KVM_REQ_EVENT, vcpu);
kvm_vcpu_kick(vcpu);
return;
}
}
}
}
void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
{
int me;
int cpu = vcpu->cpu;
if (kvm_vcpu_wake_up(vcpu))
return;
me = get_cpu();
if (cpu != me && (unsigned)cpu < nr_cpu_ids && cpu_online(cpu))
if (kvm_arch_vcpu_should_kick(vcpu))
smp_send_reschedule(cpu);
put_cpu();
}
And the result of invoking smp_send_reschedule
function in kvm_vcpu_kick
, native_smp_send_reschedule
function is called.
static void native_smp_send_reschedule(int cpu)
{
if (unlikely(cpu_is_offline(cpu))) {
WARN_ON(1);
return;
}
apic->send_IPI(cpu, RESCHEDULE_VECTOR);
}
By invoking smp_send_reschedule
, an IPI (Inter-Processor Interrupt) is sent to another CPU, prompting it to reschedule. This results in an interrupt being inserted into the vCPU, causing a VMExit
. Consequently, the vCPU is scheduled when the interrupt is delivered.
Now, let's briefly review the process of how interrupts are inserted. When KVM_RUN
is executed, the following steps are performed (focusing solely on interrupt insertion, omitting other extensive processing):
kvm_arch_vcpu_ioctl_run
-> vcpu_run
-> vcpu_enter_guest
-> inject_pending_event
-> kvm_cpu_has_injectable_intr
Within kvm_cpu_has_injectable_intr
, the kvm_cpu_has_extint
function is called. In this case, it likely returns 1
, probably based on the value of s->output
set by pic_irq_request
.
Therefore, the following part of the inject_pending_event
function is reached:
} else if (kvm_cpu_has_injectable_intr(vcpu)) {
/*
* Because interrupts can be injected asynchronously, we are
* calling check_nested_events again here to avoid a race condition.
* See https://lkml.org/lkml/2014/7/2/60 for discussion about this
* proposal and current concerns. Perhaps we should be setting
* KVM_REQ_EVENT only on certain events and not unconditionally?
*/
if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) {
r = kvm_x86_ops->check_nested_events(vcpu, req_int_win);
if (r != 0)
return r;
}
if (kvm_x86_ops->interrupt_allowed(vcpu)) {
kvm_queue_interrupt(vcpu, kvm_cpu_get_interrupt(vcpu),
false);
kvm_x86_ops->set_irq(vcpu);
}
}
Finally, kvm_x86_ops->set_irq(vcpu)
is called, and this triggers the vmx_inject_irq
callback function. In this process, it inserts the interrupt by setting VMCS
(Virtual Machine Control Structure
) with VMX_ENTRY_INTR_INFO_FIELD
. While not elaborated on here, explaining VMCS
would require delving into hypervisor implementation details, which is beyond the scope of this discussion. It may be added as supplementary information in the documentation in the future.
In summary, this is the flow of interrupt processing using the PIC as an example.
ToyVMM serial console
Now, at this point, let's temporarily set aside the exploration of interrupts and return to discussing the implementation of ToyVMM. Considering the previous discussions, let's organize what processes are being executed within ToyVMM and what happens behind the scenes.
In ToyVMM, before performing register_irqfd
as mentioned earlier, a function called setup_irqchip
is actually executed. This function acts as a thin wrapper and internally makes calls to create_irq_chip
and create_pit2
.
#![allow(unused)] fn main() { #[cfg(target_arch = "x86_64")] pub fn setup_irqchip(&self) -> Result<()> { self.fd.create_irq_chip().map_err(Error::VmSetup)?; let pit_config = kvm_pit_config { flags: KVM_PIT_SPEAKER_DUMMY, ..Default::default() }; self.fd.create_pit2(pit_config).map_err(Error::VmSetup) } }
What's important here is the create_irq_chip
function. Internally, it calls the KVM_CREATE_IRQCHIP
API, as mentioned earlier, to initialize the interrupt controller and IRQ routing. Following this setup, register_irqfd(&com_evt_1_3, 4)
is executed on the configured Guest VM, which, as explained earlier, calls functions like kvm_irqfd_assign
to set up interrupt handlers. This completes the setup of interrupt-related configurations using the KVM API.
Now, let's revisit the interrupts coming from com_evt_1_3
. As previously discussed, one end of the interrupt is passed to KVM along with GSI=4
through register_irqfd
. Consequently, any write
issued from the other end is treated as an interrupt to the Guest VM as if it were sent to the COM1 port. On the other hand, the other end of com_evt_1_3
is passed to the Serial Device, making writes to the eventfd on the Serial Device side (occurring after processing through Serial::write
or through the invocation of Serial::enqueue_raw_byte
) the actual interrupt triggers. In essence, this setup enables the Guest VM and the software-implemented Serial Device to interact in a manner similar to regular server and Serial Device communication.
Furthermore, to represent a Serial Console, we've configured stdout
as the destination for writes corresponding to the Serial Device's output in this case. Therefore, when handling KVM_EXIT_IO_OUT
and writing to THR, the data is passed to stdout
, resulting in console messages being output to standard output. This effectively realizes the desired Serial Console functionality.
Controlling the Guest VM via Standard Input
Finally, to manipulate the Guest VM using standard input, we want to reflect the contents of standard input into the Guest VM. The Serial
struct provided by rust-vmm/vm-superio offers a helper function called enqueue_raw_bytes
. This helper function allows us to send data to the Guest VM without needing to handle low-level register operations or interrupts explicitly, as the function handles these operations internally.
To achieve this, we need to read input from the program and pass it directly to this method. We can set up standard input in raw mode, and the main thread can poll it while waiting for input. When input is received, we can use enqueue_raw_bytes
to send it to the Guest VM. Since each vCPU of the Guest VM is executed in a separate thread, polling standard input in the main thread won't affect the processing of the Guest VM.
Here is a basic implementation:
#![allow(unused)] fn main() { let stdin_handle = io::stdin(); let stdin_lock = stdin_handle.lock(); stdin_lock .set_raw_mode() .expect("failed to set terminal raw mode"); let ctx: PollContext<Token> = PollContext::new().unwrap(); ctx.add(&exit_evt, Token::Exit).unwrap(); ctx.add(&stdin_lock, Token::Stdin).unwrap(); 'poll: loop { let pollevents: PollEvents<Token> = ctx.wait().unwrap(); let tokens: Vec<Token> = pollevents.iter_readable().map(|e| e.token()).collect(); for &token in tokens.iter() { match token { Token::Exit => { println!("vcpu requested shutdown"); break 'poll; } Token::Stdin => { let mut out = [0u8; 64]; tx.send(true).unwrap(); match stdin_lock.read_raw(&mut out[..]) { Ok(0) => { println!("eof!"); } Ok(count) => { stdio_serial .lock() .unwrap() .serial .enqueue_raw_bytes(&out[..count]) .expect("failed to enqueue bytes"); } Err(e) => { println!("error while reading stdin: {:?}", e); } } } _ => {} } } } }
This is a straightforward implementation, but it achieves the desired functionality.
Check UART Request When Booting the Linux Kernel
In the previous sections, we discussed the software implementation of the Serial UART and how it's used internally within ToyVMM. While it works effectively, it's important to examine the UART communication during the Linux Kernel boot process.
Fortunately, due to the VMM's architecture, we need to handle KVM_EXIT_IO
, which allows us to intercept all requests sent to the serial port by injecting debug code into this handling process.
I won't go into detail about the code inserted for debugging purposes here, as it's quite straightforward to insert debug code at the appropriate locations. Instead, I'll provide annotations in three specific formats to make it clear and understandable when looking at requests made to the 0x3f8 (COM1)
register during OS startup.
[Format 1 - Read]
r($register) = $data
- Description
- r = Read operation
- $register = The register corresponding to the offset calculated using the device's address (0x3f8)
- $data = Data read from $register
- Description = Explanation
[Format 2 - Write]
w($register = $data)
- Description
- w = Write operation
- $register = The register corresponding to the offset calculated using the device's address (0x3f8)
- $data = Data to be written to $register
- Description = Explanation
[Format 3 - Write (character)]
w(THR = $data = 0xYY) -> 'CHAR'
- w(THR ...) = Write operation to THR
- $data = Binary data to be written to $register
- 0xYY = $data converted to hexadecimal
- 'CHAR' = 0xYY converted to a character based on the ASCII code table
Now, the following is a somewhat lengthy representation of requests made to the 0x3f8 (COM1)
register during OS startup, formatted according to the above annotations:
# Initial setup, configuring baud rate, etc.
w(IER = 0)
w(LCR = 10010011)
- DLAB = 1 (DLAB: DLL and DLM accessible)
- Break signal = 0 (Break signal disabled)
- Parity = 010 (No parity)
- Stop bits = 0 (1 stop bit)
- Data bits = 11 (8 data bits)
w(DLL = 00001100)
w(DLM = 0)
- DLL = 0x0C, DLM = 0x00 (Speed = 9600 bps)
w(LCR = 00010011)
- DLAB = 0 (DLAB: RBR, THR, and IER accessible)
- Break signal = 0 (Break signal disabled)
- Parity = 010 (No parity)
- Stop bits = 0 (1 stop bit)
- Data bits = 11 (8 data bits)
w(FCR = 0)
w(MCR = 00000001)
- Reserved = 00
- Autoflow control = 0
- Loopback mode = 0
- Auxiliary output 2 = 0
- Auxiliary output 1 = 0
- Request to send = 0
- Data terminal ready = 1
r(IER) = 0
w(IER = 0)
# From here, the actual console output is being received through the serial port,
# and write operations (in this case, writing to stdout) are happening.
# Checking the content of r(LSR) to determine whether to write the next character
r(LSR) = 01100000
- Errornous data in FIFO = 0
- THR is empty, and line is idle = 1
- THR is empty = 1
- Break signal received = 0
- Framing error = 0
- Parity error = 0
- Overrun error = 0
- Data available = 0
- Bits 5 and 6 are related to character transmission and used by UART
- If bits 5 and 6 are set, it means UART is ready to accept a new character
- Bit 6 = '1' means that all characters have been transmitted
- Bit 5 = '1' means that UART is capable of receiving more characters
# Since the next character write is accepted here, we write the character we want to output.
w(THR = 01011011 = 0x5b) -> '['
# Following this, the same pattern repeats:
r(LSR) = 01100000
w(THR = 00100000 = 0x20) -> ' '
# The above operation repeats 3 more times.
# ...
r(LSR) = 01100000
w(THR = 00110000 = 0x30) -> '0'
r(LSR) = 01100000
w(THR = 00101110 = 0x2e) -> '.'
r(LSR) = 01100000
w(THR = 00110000 = 0x30) -> '0'
# The above operation repeats 5 more times
r(LSR) = 01100000
w(THR = 01011101 = 0x5d) -> ']'
r(LSR) = 01100000
w(THR = 00100000 = 0x20) -> ' '
r(LSR) = 01100000
w(THR = 01001100 = 0x4c) -> 'L'
r(LSR) = 01100000
w(THR = 01101001 = 0x69) -> 'i'
r(LSR) = 01100000
w(THR = 01101110 = 0x6e) -> 'n'
r(LSR) = 01100000
w(THR = 01110101 = 0x75) -> 'u'
r(LSR) = 01100000
w(THR = 01111000 = 0x78) -> 'x'
r(LSR) = 01100000
w(THR = 00100000 = 0x20) -> ' '
r(LSR) = 01100000
w(THR = 01110110 = 0x76) -> 'v'
r(LSR) = 01100000
w(THR = 01100101 = 0x65) -> 'e'
r(LSR) = 01100000
w(THR = 01110010 = 0x72) -> 'r'
r(LSR) = 01100000
w(THR = 01110011 = 0x73) -> 's'
r(LSR) = 01100000
w(THR = 01101001 = 0x69) -> 'i'
r(LSR) = 01100000
w(THR = 01101111 = 0x6f) -> 'o'
r(LSR) = 01100000
w(THR = 01101110 = 0x6e) -> 'n'
r(LSR) = 01100000
w(THR = 00100000 = 0x20) -> ' '
r(LSR) = 01100000
w(THR = 00110100 = 0x34) -> '4'
r(LSR) = 01100000
w(THR = 00101110 = 0x2e)-> '.'
r(LSR) = 01100000
w(THR = 00110001 = 0x31) -> '1'
r(LSR) = 01100000
w(THR = 00110100 = 0x34) -> '4'
r(LSR) = 01100000
w(THR = 00101110 = 0x2e) -> '.'
r(LSR) = 01100000
w(THR = 00110001 = 0x31) -> '1'
r(LSR) = 01100000
w(THR = 00110111 = 0x37) -> '7'
r(LSR) = 01100000
w(THR = 00110100 = 0x34) -> '4'
r(LSR) = 01100000
w(THR = 00100000 = 0x20) -> ' '
r(LSR) = 01100000
w(THR = 00101000 = 0x28) -> '('
r(LSR) = 01100000
w(THR = 01000000 = 0x40) -> '@'
w(LSR) = 01100000
r(THR = 00110101 = 0x35) -> '5'
r(LSR) = 01100000
w(THR = 00110111 = 0x37) -> '7'
r(LSR) = 01100000
w(THR = 01100101 = 0x65) -> 'e'
r(LSR) = 01100000
w(THR = 01100100 = 0x64) -> 'd'
r(LSR) = 01100000
w(THR = 01100101 = 0x65) -> 'e'
r(LSR) = 01100000
w(THR = 01100010 = 0x62) -> 'b'
r(LSR) = 01100000
w(THR = 01100010 = 0x62) -> 'b'
r(LSR) = 01100000
w(THR = 00111001 = 0x39) -> '9'
r(LSR) = 01100000
w(THR = 00111001 = 0x39) -> '9'
r(LSR) = 01100000
w(THR = 01100100 = 0x64) -> 'd'
r(LSR) = 01100000
w(THR = 01100010 = 0x62) -> 'b'
r(LSR) = 01100000
w(THR = 00110111 = 0x37) -> '7'
r(LSR) = 01100000
w(THR = 00101001 = 0x29) -> ')'
# Concatenating the output, we get the following line:
[ 0.000000] Linux version 4.14.174 (@57edebb99db7)
# This matches the content of the first line output during OS boot.
Of course, Linux Kernel startup UART requests continue beyond this, and more complex operations take place. However, I won't delve further into these requests here. If you are interested, I encourage you to explore them in detail.
Reference
- Serial UART information
- Wikibooks : Serial Programming / 8250 UART Programming
- rust-vmm/vm-superio
- Interrupt request(PC architecture)
- Linux Serial Console
- KVM IRQFD Implementation
- KVMのなかみ(KVM internals)
- ハイパーバイザーの作り方~ちゃんと理解する仮想化技術~ 第2回 intel VT-xの概要とメモリ仮想化
- External Interrupts in the x86 system. Part1. Interrupt controller evolution
ToyVMM Implementation
To summarize our previous discussions, we have successfully created a minimal VMM with essential features. This ToyVMM is a straightforward VMM with the following functionalities:
- It can boot a Guest OS using
vmlinuz
andinitrd
. - After the Guest OS boots, it can handle input and output as a Serial Terminal, allowing you to monitor and interact with the Guest's state.
Run the Linux Kernel!
Let's actually boot the Linux Kernel.
First, prepare vmlinux.bin
and initrd.img
. Place them in the root directory of the ToyVMM repository. You can download vmlinux.bin
as follows:
# Download vmlinux.bin
wget https://s3.amazonaws.com/spec.ccfc.min/img/quickstart_guide/x86_64/kernels/vmlinux.bin
cp vmlinux.bin <TOYVMM WORKING DIRECTORY>
For initrd.img
, you can create it using marcov/firecracker-initrd, which includes an Alpine Linux root filesystem:
# Create initrd.img
# Using marcov/firecracker-initrd (https://github.com/marcov/firecracker-initrd)
git clone https://github.com/marcov/firecracker-initrd.git
cd firecracker-initrd
bash ./build.sh
# After the above commands, the initrd.img file will be located in build/initrd.img.
# Please move it to the working directory of ToyVMM.
cp build/initrd.img <TOYVMM WORKING DIRECTORY>
With these preparations completed, let's launch the Guest VM:
$ make run_linux
Here, we'll skip the output of the boot sequence, which will be displayed on the standard output. Once the boot is complete, you'll see the Alpine Linux screen, and it will prompt you for login credentials. You can log in using root
as both the username and password:
Welcome to Alpine Linux 3.15
Kernel 4.14.174 on an x86_64 (ttyS0)
(none) login: root
Password:
Welcome to Alpine!
The Alpine Wiki contains a large amount of how-to guides and general
information about administrating Alpine systems.
See <http://wiki.alpinelinux.org/>.
You can set up the system with the command: setup-alpine
You may change this message by editing /etc/motd.
login[1058]: root login on 'ttyS0'
(none):~#
Great! You have successfully booted the Guest VM and can operate it. You can also execute commands within the Guest VM. For example, running the basic ls
command results in the following output:
(none):~# ls -lat /
total 0
drwx------ 3 root root 80 Sep 23 06:44 root
drwxr-xr-x 5 root root 200 Sep 23 06:44 run
drwxr-xr-x 19 root root 400 Sep 23 06:44 .
drwxr-xr-x 19 root root 400 Sep 23 06:44 ..
drwxr-xr-x 7 root root 2120 Sep 23 06:44 dev
dr-xr-xr-x 12 root root 0 Sep 23 06:44 sys
dr-xr-xr-x 55 root root 0 Sep 23 06:44 proc
drwxr-xr-x 2 root root 1780 May 7 00:55 bin
drwxr-xr-x 26 root root 1040 May 7 00:55 etc
lrwxrwxrwx 1 root root 10 May 7 00:55 init -> /sbin/init
drwxr-xr-x 2 root root 3460 May 7 00:55 sbin
drwxr-xr-x 10 root root 700 May 7 00:55 lib
drwxr-xr-x 9 root root 180 May 7 00:54 usr
drwxr-xr-x 2 root root 40 May 7 00:54 home
drwxr-xr-x 5 root root 100 May 7 00:54 media
drwxr-xr-x 2 root root 40 May 7 00:54 mnt
drwxr-xr-x 2 root root 40 May 7 00:54 opt
drwxr-xr-x 2 root root 40 May 7 00:54 srv
drwxr-xr-x 12 root root 260 May 7 00:54 var
drwxrwxrwt 2 root root 40 May 7 00:54 tmp
Well done! At this point, you have created a minimal VMM. However, there are some limitations:
- It can only be operated through a serial console. You might want to implement virtio-net for networking.
- Implementing virtio-blk for block devices.
- Handling PCI devices is not yet supported.
The creation of ToyVMM served several personal objectives, including:
- Deepening the understanding of virtualization.
- Gaining a better understanding of virtio.
- Learning about PCI passthrough:
- Exploring technologies like VFIO.
- Understanding peripheral technologies like mdev, libvfio, and VDPA.
While we have completed the creation of a minimal VMM, the direction you take it from here is up to you. ToyVMM is a great starting point, and you can choose to extend it in various ways. If you're reading this and you're an enthusiastic geek, I encourage you to give it a try! And if possible, I'd be delighted to receive feedback on ToyVMM.
Virtual I/O Device (Virtio)
In this section, as the second step of VMM, we will delve into the implementation of Virtio.
The Virtio specification is maintained by OASIS.
The latest version appears to be version 1.2, which was published on July 1, 2022.
The terminology related to Virtio in this document follows the definitions in version 1.2, so if you want to confirm the meaning of specific terms, please refer to the OASIS page.
In this section, we will cover fundamental knowledge about Virtio and its implementation.
Additionally, as concrete implementations based on Virtio, we will work on virtio-net
and virtio-blk
.
Once virtio-net
is implemented, you will be able to communicate with a booted Guest VM over the network, enabling SSH login and internet connectivity.
Moreover, with virtio-blk
implemented, you will handle block devices, meaning DISK I/O, within the virtual machine.
With these two functionalities, you will have most of the requirements for a typical "virtual machine" in place, making Virtio implementation highly significant.
The topics in this section are structured as follows:
This document is based on the following commit numbers:
- ToyVMM: 58cf0f68a561ee34a28ae4e73481f397f2690b51
- Firecracker: cfd4063620cfde8ab6be87ad0212ea1e05344f5c
From this point onwards, we will explain the implemented source code using file names.
Here are the actual file paths referred to by the file names mentioned in the explanations:
File Name Mentioned in Explanations | File Path |
---|---|
mod.rs | src/vmm/src/devices/virtio/mod.rs |
queue.rs | src/vmm/src/devices/virtio/queue.rs |
mmio.rs | src/vmm/src/devices/virtio/mmio.rs |
status.rs | src/vmm/src/devices/virtio/status.rs |
virtio_device.rs | src/vmm/src/devices/virtio/virtio_device.rs |
net.rs | src/vmm/src/devices/virtio/net.rs |
block.rs | src/vmm/src/devices/virtio/block.rs |
Please note that these file paths may change in the future as source code is updated.
Consider these file paths to be associated with the commit numbers mentioned earlier.
Virtio
What is Virtual I/O Device (Virtio)?
Virtio is a specification for virtual devices standardized by OASIS. It provides a virtual device interface for efficient data transfer and communication between the host system and guest systems (virtual machines).
Based on Virtio, there are implementations like virtio-net
(virtual network device) and virtio-blk
(virtual block device). As their names suggest, these implementations mimic the behavior of network and block devices, allowing guest operating systems to perform I/O operations as if they were using real network and block devices.
Virtio is compatible with major virtualization technologies such as KVM and is supported by a wide range of guest operating systems, including Linux, Windows, and FreeBSD. As a result, it has become an industry-standard specification widely adopted in virtualization environments.
Why is Virtio Necessary?
When it comes to generating I/O within a virtual machine (VM), how should the hypervisor handle it? First and foremost, the hypervisor needs to make the VM recognize the device at VM startup, which requires emulating various PCI devices. Additionally, when I/O is generated for those devices, the hypervisor must mimic the behavior of those devices. A well-known and widely used software for this kind of hardware emulation is QEMU.
The advantage of fully emulating real hardware using software is that you can use device drivers designed for physical hardware that come with the guest OS. However, this approach incurs significant overhead because it involves a VMExit each time an I/O request occurs within the VM. The hypervisor must perform emulation and then return control to the VM.
One proposed and standardized framework to reduce the overhead of virtualization in device I/O is Virtio
. Virtio establishes a queue structure called Virtqueue
in shared memory between the hypervisor and VM. This mechanism minimizes the number of mode transitions caused by VMExit. However, Virtio
requires device drivers that is implemented for it, depending on the kernel build configuration. Many modern OS distributions come with Virtio device drivers installed by default.
Components of Virtio
Virtio mainly consists of the following components:
- Virtqueue: A queue built in shared memory between the host and guest for performing data input and output.
- Virtio driver: The guest-side driver for Virtio-based devices.
- Virtio device: The host-side emulation of devices.
As depicted in the diagram, I/O requests initiated by the guest pass through Virtqueue to the host and responses are also mediated through Virtqueue back to the guest. Detailed behaviors and implementations will be discussed in the next section.
Additionally, when exposing Virtio devices to guests, it's possible to choose specific transport methods. The two common methods are "Virtio Over PCI Bus" which uses PCI (Peripheral Component Interconnect), and "Virtio Over MMIO Bus" which uses MMIO (Memory Mapped I/O). Guests have corresponding drivers such as virtio-pci
and virtio-mmio
for specific transports, along with Virtio drivers (virtio-net
, virtio-blk
) for particular device types.
In ToyVMM, we'll initially adopt virtio-mmio
as the transport and proceed to implement Network devices for virtio-net
and Block devices for virtio-blk
.
References
- OASIS
- Virtio: An I/O virtualization framework for Linux
- virtio: Towards a De-Facto Standard For Virtual I/O Devices
- Introduction to VirtIO
- Virtio on Linux
Implementing Virtio in ToyVMM
In this section, we will delve into the implementation of Virtio in ToyVMM. There are three main topics covered in this discussion:
- Implementation of Virtqueue
- Implementation of lightweight notifications between the guest and host using irqfd and ioeventfd
- Implementation of the MMIO Transport
As mentioned in the previous section, ToyVMM initially utilizes MMIO as the transport method for Virtio. Before diving into the detailed explanation, let's start by illustrating an overview of the Virtio implementation in this context.
By referring to this diagram as needed, we can better understand the explanations and code that follow.
Implementation Approach
In the implementation, the VirtioDevice itself is implemented as an abstract concept (Trait
), and concrete devices like Net
and Block
are created to fulfill this trait. Similarly, since there are options for transport methods like PCI
and MMIO
(with MMIO being used here), we treat transport as an abstract concept and implement it according to the specific implementation, which in this case is MMIO
.
Finally, we need to implement Virtqueues. While the number and usage of Virtqueues can vary depending on the implemented Virtio device, the structure of the Virtqueue remains consistent. We'll provide more details on this later.
Virtqueue Implementation
Virtqueue Deep-Dive
Before delving into the implementation of Virtqueue, let's gain a more detailed understanding of the typical Virtqueue structure. A Virtqueue is composed of three main elements: the Descriptor Table
, the Available Ring
, and the Used Ring
. Here's what each of them does:
Descriptor Table
: A table that holds entries (Descriptor
) that store information such as addr and size of data to be shared between Host and Guest.Available Ring
: Structure that managesDescriptor
that stores information that the Guest wants to notify to the Host.Used Ring
: Structure that managesDescriptor
that stores information that the Host wants to notify to the Guest.
We'll explore each of these elements in detail while understanding how they cooperate. First, the Descriptor Table
contains data structures like Descriptor
(as indicated in the diagram) gathered together.
struct virtq_desc { /* Address (guest-physical). */ le64 addr; /* Length. */ le32 len; /* This marks a buffer as continuing via the next field. */ #define VIRTQ_DESC_F_NEXT 1 /* This marks a buffer as device write-only (otherwise device read-only). */ #define VIRTQ_DESC_F_WRITE 2 /* This means the buffer contains a list of buffer descriptors. */ #define VIRTQ_DESC_F_INDIRECT 4 /* The flags as indicated above. */ le16 flags; /* Next field if flags & NEXT */ le16 next; };
A Descriptor
represents the data to be transferred and the location of the next data in the chain.
addr
is the actual address of the data (guest's physical address), and the length of the data can be obtained fromlen
.flags
provide information about whether there is a next descriptor, whether it's write-only, and other flags.next
indicates the number of the next descriptor, allowing Descriptor Table to be processed sequentially.
Usually, one Descriptor is used to send one piece of data. It's important to note that even if you allocate contiguous memory in virtual address space, if the physical addresses are not contiguous, each Descriptor will be needed for each physical page, resulting in multiple Descriptors being sent sequentially.
Next is the Available Ring
. The Available Ring
is structured as follows:
struct virtq_avail { #define VIRTQ_AVAIL_F_NO_INTERRUPT 1 le16 flags; le16 idx; le16 ring[ /* Queue Size */ ]; le16 used_event; /* Only if VIRTIO_F_EVENT_IDX */ }
The Available Ring
is used to specify the Descriptors that need to be notified from the guest to the host.
flags
are used for temporary interrupt suppression and other purposes.idx
points to the index of the newest entry in thering
.ring
is the actual ring body, holding Descriptor numbers.used_event
is also used for interrupt suppression but is only necessary ifVIRTIO_F_EVENT_IDX
is enabled.
The guest writes the location of the actual data to the Descriptor and the index information to the Available Ring
(specifically in the ring
field). It's important to note that the host needs to remember the index of the last processed ring
. The guest can only provide information about the current state of the ring and the latest index (idx
field). Therefore, the host compares the last processed entry number with the latest index information (idx
) and checks for any differences (indicating new entries). If there are differences, it means there are new entries to process. The host then refers to the ring and retrieves the Descriptor index based on the index difference, obtains the data from the Descriptor, and processes it accordingly, depending on the specific device's implementation.
Finally, there's the Used Ring
, which is the reverse of the Available Ring
, meaning it's used to specify Descriptors that need to be notified from the host to the guest.
struct virtq_used { #define VIRTQ_USED_F_NO_NOTIFY 1 le16 flags; le16 idx; struct virtq_used_elem ring[ /* Queue Size */]; le16 avail_event; /* Only if VIRTIO_F_EVENT_IDX */ }; /* le32 is used here for ids for padding reasons. */ struct virtq_used_elem { /* Index of start of used descriptor chain. */ le32 id; /* * The number of bytes written into the device writable portion of * the buffer described by the descriptor chain. */ le32 len; };
Source: 2.7.8 The Virtqueue Used Ring
flags
are used for temporary interrupt suppression and other purposes.idx
points to the index of the newest entry in thering
.ring
is the actual ring body, holding Descriptor numbers.used_event
is also used for interrupt suppression but is only necessary ifVIRTIO_F_EVENT_IDX
is enabled.
When returning notifications from the host to the guest, the descriptor is used to inform the guest of the location of the data to be returned, corresponding to the reply data. The index of the descriptor is stored in the ring
of the Used Ring
and the idx
value is updated to point to the newest index in the ring
before returning control to the guest.
However, unlike the Available Ring
the elements of the ring
are accompanied by a structure (virtq_used_elem
).
id
is the head entry of the descriptor chain (the same asvirtq_avail.idx
).len
stores information such as the total amount of I/O performed on the descriptor chain referred to byid
on the host side.
The following diagram summarizes what has been explained so far.
This concludes the necessary knowledge for implementing Virtqueue.
Virtqueue implementation on ToyVMM
In ToyVMM, the implementation of Virtqueues is located in queue.rs
.
The concrete addresses of the Descriptor Table
, Available Ring
, and Used Ring
in guest memory are configured through interactions with the guest-side Device Driver during the guest VM startup. While we'll delve into this exchange as we peek into actual I/O requests from the guest, for now, let's focus on this fact.
ToyVMM needs to perform address accesses based on the specific starting addresses and Virtio specifications. In essence, it operates on a per-descriptor basis (where each descriptor points to the address of the actual data). During data processing, it updates the Available Ring
and Used Ring
.
Now, let's explore the code. The Queue
structure in ToyVMM represents a Virtqueue and is defined as follows:
#[derive(Clone)]
/// A virtio queue's parameters
pub struct Queue {
/// The maximal size in elements offered by the device
max_size: u16,
/// The queue size in elements the driver selected
pub size: u16,
/// Indicates if the queue is finished with configuration
pub ready: bool,
/// Guest physical address of descriptor table
pub desc_table: GuestAddress,
/// Guest physical address of the available ring
pub avail_ring: GuestAddress,
/// Guest physical address of the used ring
pub used_ring: GuestAddress,
next_avail: Wrapping<u16>,
next_used: Wrapping<u16>,
}
In this structure, you can see the definitions for the Descriptor Table
, Available Ring
, and Used Ring
, which represent the specific addresses in guest memory. These addresses are initialized during interactions with the guest's Device Driver, as mentioned earlier. From ToyVMM's perspective, these are merely physical memory addresses belonging to the guest, and ToyVMM accesses them based on Virtio specifications.
Now, let's delve into address access using the code.
ToyVMM abstracts the sequence of operations to get Descriptor
from the state of Available Ring
is hiddden as Virtqueue iteration, and in actual device implementations that utilize Virtqueues, you will find code structured like this:
#![allow(unused)] fn main() { // queue: 'Queue' struct // desc_chain: 'DescriptorChain' struct for desc_chain in queue.iter(mem) { // 'desc_chain' contains the 'addr,' 'len,' 'flags,' and 'next' values of the descriptor // Behind the iteration, data related to 'queue.avail_ring' is adjusted. } }
Behind the scenes of this iteration, let's explain what's happening. First, the iter
function is implemented in the Queue
structure, and it creates an AvailIter
structure. To create this AvailIter
, it fetches the latest idx
in the Available Ring
from the GuestMemory
and the avail_ring
's starting address.
/// A consuming iterator over all available descriptor chain heads offered by the driver
pub fn iter<'a, 'b>(&'b mut self, mem: &'a GuestMemoryMmap) -> AvailIter<'a, 'b> {
... // validation codes
let queue_size = self.actual_size();
let avail_ring = self.avail_ring;
// Access the 'idx' fields of available ring
// skip 2 bytes (= u16 / 'flags' member) from avail_ring address
// and get 2 bytes (= u16 / 'idx' member representing the newest index of avail_ring) from that address.
let index_addr = mem.checked_offset(avail_ring, 2).unwrap();
let last_index: u16 = mem.read_obj(index_addr).unwrap();
AvailIter {
mem,
desc_table: self.desc_table,
avail_ring: self.avail_ring,
next_index: self.next_avail,
last_index: Wrapping(last_index),
queue_size,
next_avail: &mut self.next_avail,
}
}
As you can see, the iter
function returns an AvailIter
. Inside the next
function of AvailIter
, if self.next_index
equals self.last_index
, it returns None
, indicating the end of iteration. The next_index
tracks the processed index values.
Inside the next
function, the element pointed to by self.next_index
in the Available Ring
(which corresponds to a descriptor index) is retrieved. The DescriptorChain::checked_new
function is called using this retrieved value, and the result value is returned as the element during iteration.
The checked_new
function calculates the address of the element pointed to by the index value and accesses it, extracting information like the addr
, len
, flags
, and next
of the descriptor. Finally, it constructs a DescriptorChain
structure with this information.
fn checked_new(
mem: &GuestMemoryMmap,
desc_table: GuestAddress,
queue_size: u16,
index: u16,
) -> Option<DescriptorChain> {
if index >= queue_size {
return None;
}
// The size of each element of the descriptor table is 16 bytes
// - le64 addr = 8 bytes
// - le32 len = 4 bytes
// - le16 flags = 2 bytes
// - le16 next = 2 bytes
// So, the calculation of the offset of the address
// indicated by desc_index is 'index * 16'
let desc_head = match mem.checked_offset(desc_table, (index as usize) * 16) {
Some(a) => a,
None => return None,
};
// These reads can't fail unless Guest memory is hopelessly broken
let addr = GuestAddress(mem.read_obj(desc_head).unwrap());
mem.checked_offset(desc_head, 16)?;
let len = mem.read_obj(desc_head.unchecked_add(8)).unwrap();
let flags: u16 = mem.read_obj(desc_head.unchecked_add(12)).unwrap();
let next: u16 = mem.read_obj(desc_head.unchecked_add(14)).unwrap();
let chain = DescriptorChain {
mem,
desc_table,
queue_size,
ttl: queue_size,
index,
addr,
len,
flags,
next,
};
if chain.is_valid() {
Some(chain)
} else {
None
}
}
Since the next
function returns a DescriptorChain
, you access the descriptor's information when processing within the loop by accessing the relevant members of the DescriptorChain
structure.
Although I have not mentioned it much so far, Used Ring
also needs to be updated on the host side.
However, this is not a difficult process and can be implemented by defining the following functions and calling them as necessary.
#![allow(unused)] fn main() { /// Puts an available descriptor head into the used ring for use by the guest pub fn add_used(&mut self, mem: &GuestMemoryMmap, desc_index: u16, len: u32) { if desc_index >= self.actual_size() { // TODO error return; } let used_ring = self.used_ring; let next_used = (self.next_used.0 % self.actual_size()) as u64; // virtq_used structure has 4 byte entry before `ring` fields, so skip 4 byte. // And each ring entry has 8 bytes, so skip 8 * index. let used_elem = used_ring.unchecked_add(4 + next_used * 8); // write the descriptor index to virtq_used_elem.id mem.write_obj(desc_index, used_elem).unwrap(); // write the data length to the virtq_used_elem.len mem.write_obj(len, used_elem.unchecked_add(4)).unwrap(); // increment the used index that is the last processed in host side. self.next_used += Wrapping(1); // This fence ensures all descriptor writes are visible before the index update is. fence(Ordering::Release); mem.write_obj(self.next_used.0, used_ring.unchecked_add(2)) .unwrap(); }
Please remember this underlying mechanism as it forms the basis for the actual I/O implementation in the virtio-net and virtio-blk devices, which we will explain in the following sections.
Implementation of Lightweight Communication between Guest and Host using irqfd and ioeventfd
So far, we've discussed the implementation of Virtqueues, but now let's delve into another crucial aspect related to Virtqueues: the "notification" mechanism required for communication between the host and guest when using Virtqueues. In Virtio, after filling Virtqueues with data, a mechanism for notifying the host from the guest or the guest from the host becomes necessary. Understanding how this notification is realized is essential.
In essence, notifications between the guest and host are achieved using the mechanisms ioeventfd
and irqfd
. Both of these mechanisms are provided through the KVM API
.
First, for notifications from the guest to the host, we use ioeventfd
. ioeventfd
transforms memory writes caused by PIO/MMIO operations in the guest VM into eventfd notifications. The KVM_IOEVENTFD
is used as part of the KVM API, where you provide the eventfd for notifications and the address for MMIO. Writes to this MMIO address are converted into notifications to the specified eventfd. As a result, software on the host side (in this case, ToyVMM) can receive notifications from the guest via eventfd. This mechanism enhances event notification efficiency, making it a more lightweight implementation compared to traditional polling or interrupt handler-based methods.
Next, for notifications from the host to the guest, we use the irqfd
mechanism. Although we've used irqfd
in previous implementations, we employ KVM_IRQFD
here. By passing the eventfd to be used for notifications and the IRQ number corresponding to the desired guest IRQ when using KVM_IRQFD
, writes to the eventfd on the ToyVMM side are converted into hardware interrupts for the specified guest IRQ.
Using the notification features based on the KVM API mentioned above, we achieve communication between the guest and host. Specific usage details will be discussed in the following section, "Implementation of MMIO Transport."
Implementation of MMIO Transport
Now, let's delve into the implementation of MMIO Transport.
Virtio Over MMIO provides the official specification for MMIO Transport, and you may want to refer to it as needed.
MMIO Transport is a method that can be easily used in virtual environments without PCI support, and it appears that Firecracker primarily supports MMIO Transport. MMIO Transport operates by performing device operations through Read/Write to specific memory regions.
MMIO Transport does not utilize a generic Device discovery mechanism like PCI. Therefore, Device discovery in MMIO involves providing information about the memory-mapped device's location and interrupt position to the guest OS, as described in MMIO Device Discovery. While the official documentation suggests documenting this in the Device Tree, an alternative method is to embed it in the kernel's command-line arguments during startup, as documented here. This latter method is used in this context because ToyVMM can dynamically adjust these command-line arguments at guest VM startup.
With this method, you can provide information to the guest VM to perform device discovery. The following format is used to describe MMIO device discovery:
(format)
virtio_mmio.device=<size>@<baseaddr>:<irq>
(example)
virtio_mmio.device=4K@0xd0000000:5
In this case, the guest VM uses the address 0xd0000000
as the starting point and performs Read/Write at predetermined offsets (register positions) to initialize and configure the device. The details are described in MMIO Device Register Layout.
From ToyVMM's perspective, it's crucial to ensure that processing according to the specifications is carried out for each Register when Read/Write operations occur. This is the core of the MMIO Transport implementation. Typically, IO to the MMIO region is processed as KVM_EXIT_MMIO
, and handling this correctly allows initialization and configuration of the device through this flow.
On the other hand, notifications for I/O via Virtqueue, which we've discussed so far, are managed using ioeventfd and irqfd. In MMIO Transport, writing to the address offset 0x050
from the base address corresponds to the process of notifying the device that "data to be processed is present in the buffer." In other words, by associating this address with Eventfd using KVM_IOEVENTFD
and then writing code to handle the desired Virtio Device's processing through this Eventfd, Notify events (writes to MMIO) generated by the guest can be directly notified as interrupts to the Eventfd.
Additionally, since IRQ information is provided to the guest via the command line, the guest sets up to launch the corresponding device handler when an interrupt occurs at the specified IRQ. Conversely, when you want to trigger an interrupt in the Virtio device presented to the Guest VM (when you want to delegate processing to the Guest VM), you can do so by writing to this IRQ. Essentially, by creating Eventfd and registering it with IRQ using KVM_IRQFD
, you can trigger interrupts by writing to this Eventfd from the ToyVMM side.
The figure below summarizes the above discussion, and ToyVMM implements this scheme:
MMIO Transport - Implementation Corresponding to MMIO Device Register Layout
The implementation of MMIO Transport can be found in the mmio.rs
file. The MmioTransport
structure, like the I/O Bus
we discussed in the Serial Console implementation, implements the BusDevice
trait and is registered within the Bus
structure. It allows handling MMIO I/O in response to KVM_EXIT_MMIO
, similar to how traditional VcpuExit::IoIn
and VcpuExit::IoOut
are processed.
Therefore, the MmioTransport
implements the read
and write
functions required to satisfy the BusDevice
. These functions contain specific logic for handling register accesses, which are essentially the device emulation processes. Naturally, this implementation follows the specifications of the MMIO Device Register Layout. Here's a portion of the read
function as an example:
impl BusDevice for MmioTransport {
// OASIS: MMIO Device Register Layout
#[allow(clippy::bool_to_int_with_if)]
fn read(&mut self, offset: u64, data: &mut [u8]) {
match offset {
0x00..=0xff if data.len() == 4 => {
let v = match offset {
0x0 => MMIO_MAGIC_VALUE,
0x04 => MMIO_VERSION,
0x08 => self.device.device_type(),
0x0c => VENDOR_ID,
0x10 => {
self.device.features(self.features_select)
| if self.features_select == 1 { 0x1 } else { 0x0 }
}
0x34 => self.with_queue(0, |q| q.get_max_size() as u32),
0x44 => self.with_queue(0, |q| q.ready as u32),
0x60 => self.interrupt_status.load(Ordering::SeqCst) as u32,
0x70 => self.driver_status,
0xfc => self.config_generation,
_ => {
println!("unknown virtio mmio register read: 0x{:x}", offset);
return;
}
};
LittleEndian::write_u32(data, v);
}
0x100..=0xfff => self.device.read_config(offset - 0x100, data),
_ => {
// WARN!
println!(
"invalid virtio mmio read: 0x{:x}:0x{:x}",
offset,
data.len()
);
}
}
}
A detailed explanation of the read
and write
functions would be quite extensive, so I'll skip it here. However, as you can see, the implementation is straightforward, and you can easily understand it by referring to the specification while examining the source code.
The processing in this part is essential for handling initialization and configuration of Virtio devices during the initialization sequence called by the Device Driver from the Guest VM during startup. By adding debugging code here, you can observe the device initialization sequence initiated by the guest.
Observing MMIO Device Initialization during Guest VM Startup Sequence
The Guest OS includes the Virtio device driver (on the guest side), which is expected to perform Virtio device initialization according to the specification. In this MMIO-based implementation, during startup, the guest VM is supposed to perform R/W operations on the MMIO range of the Virtio device based on the information of the MMIO range specified in the kernel command-line. As Hypervisor, it's necessary to trap this and handle it appropriately since it corresponds to the code section where VMExit occurs during the guest VM's boot. Debugging code can be easily incorporated for observation.
Before we examine the specific processing flow, let's organize the initialization process according to the specification. In the following discussion, we will use the initialization of a Virtio network device (virtio-net
) as an example. Device initialization specification is divided into three parts: Initialization in MMIO Transport (MMIO-specific Initialization And Device Operation), General Initialization And Device Operation (General Initialization And Device Operation), and Device-specific Initialization. Combining these, the flow is generally as follows:
- Read the Magic Number. Read Device ID, Device Type, Vendor ID, and other information.
- Reset the device.
- Set the ACKNOWLEDGE status bit.
- Read the Device feature bits and set the device with feature bits that the OS and driver can interpret.
- Set the FEATURES_OK status bit.
- Perform device-specific settings (detecting and configuring Virtqueues, writing configuration).
- Set the DRIVER_OK status bit, and at this point, the device is in a live state.
Now, keeping this in mind, let's examine the actual processing during the guest VM's startup. Below is an example where I have added debugging code, and the output generated by the debugging code is annotated with comments for explanation.
# Read the magic number from offset 0x00
# Since it's Little Endian, the original values are 116, 114, 105, 118
# 116(10) = 74(16)
# 114(10) = 72(16)
# 105(10) = 69(16)
# 118(10) = 76(16)
# Therefore, it's 0x74726976 (magic number)
MmioRead: addr = 0xd0000000, data = [118, 105, 114, 116]
# Read device id (0x02) from offset 0x04
MmioRead: addr = 0xd0000004, data = [2, 0, 0, 0]
# Read device type (net = 0x01) from offset 0x08
MmioRead: addr = 0xd0000008, data = [1, 0, 0, 0]
# Read vendor id (virtio vendor id = 0x00) from offset 0x0c
MmioRead: addr = 0xd000000c, data = [0, 0, 0, 0]
# This part is Device Initialization Phase (3.1.1 Driver Requirements: Device Initialization)
# Write 0 to offset 0x70 (= Status) to reset the device status
MmioWrite: addr = 0xd0000070, data = [0, 0, 0, 0]
# Read from offset 0x70, and now the device is reset
MmioRead: addr = 0xd0000070, data = [0, 0, 0, 0]
# Write 0x01 to offset 0x70 (= Status) to set the ACKNOWLEDGE bit
MmioWrite: addr = 0xd0000070, data = [1, 0, 0, 0]
# Read from offset 0x70, perhaps for confirmation?
MmioRead: addr = 0xd0000070, data = [1, 0, 0, 0]
# Add 0x02 = Device(2) to offset 0x70 (= Status), so 0x70 is 0x03
MmioWrite: addr = 0xd0000070, data = [3, 0, 0, 0]
# Processing for Device/Driver Feature bits.
# The device provides its own feature set (feature bits),
# and the driver reads it and instructs the device which feature subset to accept.
#
# First, the Virtio device driver in the Guest OS reads the feature bits
# Write 0x01 to offset 0x14 (= DeviceFeatureSel) to select the next operation
MmioWrite: addr = 0xd0000014, data = [1, 0, 0, 0]
# Read from offset 0x10 (= DeviceFeatures).
# It reads DeviceFeatures bit, and it returns (DeviceFeatureSel * 32) + 31 bits.
# Now DeviceFeatureSel=1, so it returns the DeviceFeatures bits of 64~32 bits.
# For virtio-net, DeviceFeatureSel=0x0000_0001_0000_4c83 (64-bit),
# so it returns 0x0000_0001 in Little Endian.
MmioRead: addr = 0xd0000010, data = [1, 0, 0, 0]
# Write 0x00 to offset 0x14 (= DeviceFeatureSel) for the next operation
MmioWrite: addr = 0xd0000014, data = [0, 0, 0, 0]
# Read from offset 0x10 (= DeviceFeatures).
# Now DeviceFeatureSel=0, so it returns the lower 32 bits of DeviceFeatures.
# For virtio-net, DeviceFeatureSel=0x0000_0001_0000_4c83 (64-bit),
# so it returns 0x0000_4c83 in Little Endian.
# Now, Confirmation of Values of 0x0000_4c83
# Reversed Little Endian: 0,0,76,131
# 76(10) = 4c
# 131(10) = 83
# 0x00004c83 -> Ignoring the bit set by VIRTIO_F_VERSION_1 (0x100000000) in avail_features (0x100004c83)
# In other words,
# * virtio_net_sys::VIRTIO_NET_F_GUEST_CSUM
# * virtio_net_sys::VIRTIO_NET_F_CSUM
# * virtio_net_sys::VIRTIO_NET_F_GUEST_TSO4
# * virtio_net_sys::VIRTIO_NET_F_GUEST_UFO
# * virtio_net_sys::VIRTIO_NET_F_HOST_TSO4
# * virtio_net_sys::VIRTIO_NET_F_HOST_UFO
# The feature bits of this information are returned.
MmioRead: addr = 0xd0000010, data = [131, 76, 0, 0]
# The reading of feature bits is done here, and from here, it instructs the device about the feature subset to accept.
# The process is similar to reading, where you write 0x00/0x01 to DriverFeatureSel bit
# and then write the feature bits you want to set to DriverFeature.
# First, write 0x01 to offset 0x24 (DriverFeatureSel/activate guest feature) to set 'acked_features' to 0x01
MmioWrite: addr = 0xd0000024, data = [1, 0, 0, 0]
# Write 0x01 (= 0x0000_0001, one of the values read earlier) to offset 0x20 (DriverFeatures).
# Since DriverFeatureSel is set to 0x01, a 32-bit shift occurs, and 0x0000_0001_0000_0000 is actually set.
MmioWrite: addr = 0xd0000020, data = [1, 0, 0, 0]
# Write 0x00 to offset 0x24 (DriverFeatureSel/activate guest feature) to set 'acked_features' to 0x00
MmioWrite: addr = 0xd0000024, data = [0, 0, 0, 0]
# Write 0x0000_4c83 (the other value read earlier) to offset 0x20 (DriverFeatures).
MmioWrite: addr = 0xd0000020, data = [131, 76, 0, 0]
# The processing of Feature bits is completed here.
# Read offset 0x70(= Status) -> Since 0x03 was specified most recently, returning 0x03 is good.
MmioRead: addr = 0xd0000070, data = [3, 0, 0, 0]
# Write the value (3 + 8 = 11) obtained by 'adding' 0x08 = FEATURES_OK(8) to offset 0x70(= Status)
MmioWrite: addr = 0xd0000070, data = [11, 0, 0, 0]
# Read from offset 0x70(= Status). Naturally, 11 is returned.
MmioRead: addr = 0xd0000070, data = [11, 0, 0, 0]
# Device-specific setup starts from here (4.2.3.2 Virtqueue Configuration)
# Write 0x00 to offset 0x30 (= QueueSel) to select self.queue_select
MmioWrite: addr = 0xd0000030, data = [0, 0, 0, 0]
# Read from offset 0x44 (= QueueReady), and it's not ready yet, so it returns 0x0 as expected
MmioRead: addr = 0xd0000044, data = [0, 0, 0, 0]
# Read from offset 0x34 (= QueueNumMax) to check the queue size
MmioRead: addr = 0xd0000034, data = [0, 1, 0, 0]
# Write the previously read QueueNum to offset 0x38 (= QueueNum)
MmioWrite: addr = 0xd0000038, data = [0, 1, 0, 0]
# Virtual queue's 'descriptor' area 64-bit long physical address
# Write the location of the descriptor area of the selected queue (0)
# to offset 0x80 (= QueueDescLow = lo(q.desc_table) / lower 32 bits of the address)
MmioWrite: addr = 0xd0000080, data = [0, 64, 209, 122]
# Same as above, but set the remaining part of 0x84 (QueueDescHigh = hi(q.desc_table) / higher 32 bits of the address)
MmioWrite: addr = 0xd0000084, data = [0, 0, 0, 0]
# Combining the two, it's 0x0000_0000_7ad1_4000 (q.desc_table) as the base address
# Virtual queue's 'driver' area 64-bit log physical address
# Write the location of the driver area (avail_ring) of the selected queue (0)
# to offset 0x90 (= QueueDeviceLow = lo(q.avail_ring) / lower 32 bits of the address)
MmioWrite: addr = 0xd0000090, data = [0, 80, 209, 122]
# Same as above, but set the remaining part of 0x94 (QueueDeviceHigh = hi(q.avail_ring) / higher 32 bits of the address)
MmioWrite: addr = 0xd0000094, data = [0, 0, 0, 0]
# Combining the two, it's 0x0000_0000_7ad1_5000 (q.avail_ring)
# Address range of q.desc_table: q.avail_ring - q.desc_table = 0x1000 = 512(10)
# Virtual queue's 'device' area 64-bit long physical address
# Write the location of the device area (used_ring) of the selected queue (0) to offset 0xa0 (= QueueDeviceLow = lo(q.used_ring) / lower 32bits of the address)
MmioWrite: addr = 0xd00000a0, data = [0, 96, 209, 122]
# Same as above, but set the remaining part of 0xa4 (QueueDeviceHigh = hi(q.used_ring) / higher 32 bits of the address)
MmioWrite: addr = 0xd00000a4, data = [0, 0, 0, 0]
# Combining the two, it's 0x0000_0000_7ad1_6000 (q.used_ring)
# Address range of q.avail_ring: q.used_ring - q.avail_ring = 0x1000 = 512(10)
# Write 0x1 to offset 0x44 (QueueReady = q.ready) to make it Ready
MmioWrite: addr = 0xd0000044, data = [1, 0, 0, 0]
# The same process is performed for the other queue (1)
MmioWrite: addr = 0xd0000030, data = [1, 0, 0, 0]
MmioRead: addr = 0xd0000044, data = [0, 0, 0, 0]
MmioRead: addr = 0xd0000034, data = [0, 1, 0, 0]
MmioWrite: addr = 0xd0000038, data = [0, 1, 0, 0]
MmioWrite: addr = 0xd0000080, data = [0, 128, 196, 122]
MmioWrite: addr = 0xd0000084, data = [0, 0, 0, 0] # q.desc_table = 0x0000_0000_7ad1_8000
MmioWrite: addr = 0xd0000090, data = [0, 144, 196, 122]
MmioWrite: addr = 0xd0000094, data = [0, 0, 0, 0] # q.avail_ring = 0x0000_0000_7ad1_9000
MmioWrite: addr = 0xd00000a0, data = [0, 160, 196, 122]
MmioWrite: addr = 0xd00000a4, data = [0, 0, 0, 0] # q.used_ring = 0x0000_0000_7ad1_a000
MmioWrite: addr = 0xd0000044, data = [1, 0, 0, 0]
# Device-specific setup (setup of two queues for virtio-net) is completed here
# Read from offset 0x70 (= Status) and return 0x11, which was written recently
MmioRead: addr = 0xd0000070, data = [11, 0, 0, 0]
# Write 0x04 (DRIVER_OK(4)) to offset 0x70 (= Status) to 'add' it to the current value (11 + 4 = 15)
MmioWrite: addr = 0xd0000070, data = [15, 0, 0, 0]
# Read from offset 0x70 (= Status), and naturally, it returns 15
MmioRead: addr = 0xd0000070, data = [15, 0, 0, 0]
# Device Initialization Phase (3.1.1 Driver Requirements: Device Initialization) is completed here
When interpreted carefully, it becomes evident that the behavior aligns with the specifications. For reading and writing device-specific configurations, It execute the appropriate function from the VirtioDevice
associated with the MmioTransport
initialization. In other words, the VirtioDevice
Trait requires implementations to provide the necessary information for such operations.
Additionally, during the initialization process, there are multiple MMIO writes to offset=0x70. These correspond to updating the status as the initialization sequence progresses. ToyVMM confirms the completion of these status updates (ACKNOWLEDGE
-> DRIVER
-> FEATURES_OK
-> DRIVER_OK
transition). After the DRIVER_OK
status update, ToyVMM calls the activate
function to perform device-specific activation procedures (e.g., setting up epoll and its handlers). The specifics of this activation process are delegated to individual device implementations.
Summary
In this section, we provided a detailed explanation of the Virtio
mechanisms within ToyVMM. In the following sections, we will introduce concrete implementations of actual devices that were not covered in this section, specifically Network Devices and Block Devices. We will also verify the execution of specific I/O operations using the code implemented according to the Virtio principles.
Reference
- OASIS
- Creating a Hypervisor
- The Definitive KVM (Kernel-based Virtual Machine) API Documentation
- QEMU
- Introduction to VirtIO
- Virtqueues and Virtio Ring: How the Data Travels
Implement virtio-net device
In this section, we will proceed with the implementation of a Network Device as a specific Virtio Device. While the specification can be found in the official OASIS documentation Network Device, please note that this implementation may not align perfectly with the specification. If you haven't read the previous sections, be sure to review them before continuing with this section.
virtio-net Mechanism
In virtio-net
, three types of Virtqueues are typically used: Transmit Queue, Receive Queue, and Control Queue.
The Transmit Queue is used for data transmission from the guest to the host, the Receive Queue for data transmission from the host to the guest. The Control Queue is used for guest-to-host operations related to NIC settings, such as setting promiscuous mode, enabling/disabling broadcast reception, and multicast reception.
For the sake of brevity, we are omitting the implementation of the Control Queue in this section. It is worth noting that the specification allows scaling the number of Virtqueues, but for simplicity, we are not implementing that.
In the following sections, we will provide detailed implementation-based explanations of the network device.
Network Device Implementation Details
The implementation of virtio-net
can be found in net.rs
.
We will break down and explain the initialization phase and the post-initialization phase.
The following diagram primarily focuses on the initialization process of the Network Device:
The Net
struct implements the VirtioDevice
Trait and is associated with the MmioTransport
. As mentioned earlier, device-specific operations during MMIO Transport initialization depend on the implementation of this Net
struct.
For example, during initialization, a query about the Device Type
occurs. According to the specification, for a Net
device, this should return 0x01
, and the Net
struct implements it as follows:
impl VirtioDevice for Net {
fn device_type(&self) -> u32 {
// types::NETWORK_CARD:u32 = 0x01
types::NETWORK_CARD
}
...
}
Similarly, queries about the Device Feature
should be implemented to return device-specific values. Additionally, during initialization, the guest OS initializes the Descriptor Table
, Available Ring
, and Used Ring
of the Virtqueues, and the addresses for each of these are notified. This allows the addresses to be stored for each queue so they can be referenced during actual processing.
Once the initialization steps are completed and the status is updated to a specific value, ToyVMM executes the activate
function implemented in the device. In the case of the Net
device, within this activate
function, various file descriptors are registered with epoll
, and the setup of handlers (e.g., NetEpollHandler
) triggered by epoll
is performed. The Net
device emulates I/O by creating a Tap device on the host side and writing data received from the guest via Virtqueue to the Tap device for transmission (Tx
), and writing incoming data from the Tap device to Virtqueue to notify the guest (Rx
). Four file descriptors are registered with epoll
: the fd
of the tap
device, an eventfd
for notification of the Tx Virtqueue, an eventfd
for notification of the Rx Virtqueue, and an eventfd
for halting in unexpected situations.
Next, we will provide a detailed diagram of the Network Device in its activated state:
When one of the registered file descriptors in epoll
triggers an event, it dispatches the NetEpollHandler
for event processing. NetEpollHandler
varies its actions based on the event triggered. In any case, within the NetEpollHandler
, it references the Virtqueue and performs I/O emulation.
One important point to note is that the initialization process of the device, based on KVM_EXIT_MMIO
, is a processing call within the thread handling vCPU. In ToyVMM, this is done in a separate thread from the main thread. However, the thread responsible for executing I/O is also separate from the vCPU processing thread (currently handled in the main thread). To facilitate communication between these threads, channels are used to send the initialized NetEpollHandler
. This allows I/O to be processed while the guest VM is running and CPU emulation is conducted in a separate thread.
As mentioned earlier, communication between the host and guest is primarily triggered by events related to Virtqueue Eventfds and Tap device file descriptors. In the following sections, we will provide more detailed explanations of how processing occurs in both the Tx and Rx cases.
Tx (Guest -> Host)
Let's start by examining the implementation of communication in the Guest -> Host direction (Tx) and provide a detailed explanation. Once again, for Tx, the Descriptor Table
, Available Ring
, and Used Ring
function as follows:
Descriptor Table
: It contains descriptors that point to the data the Guest is trying to transmit.Available Ring
: It stores the index of the descriptor pointing to the transmit data. The Host reads this index and processes Tx emuration.Used Ring
: It stores the index of descriptors that have been processed on the Host side. The Guest reads this index to collect processed descriptors.
Tx initiates when the guest (guest device driver) prepares a packet, and control is transferred to ToyVMM when a Write operation occurs on QueueNotify
.
Specifically, in the guest, the following steps are expected:
- The guest sets the data address and length in the first Descriptor's
addr
andlen
fields. - The guest stores the index of the
Descriptor
pointing to the transmit data in theAvailable Ring
entry, pointed to by theAvailable Ring
index. - The guest increments the
Available Ring
index. - To notify the host of unprocessed data, the guest writes to the MMIO
QueueNotify
.
Now, let's shift our focus to the host side, which is handled by ToyVMM. The EventFd triggered by the write to MMIO's QueueNotify
is picked up by epoll monitoring. It triggers the NetEpollHandle's handler processing, specifically, the execution of the TX_QUEUE_EVENT
corresponding operation.
The implementation calls the process_tx
function.
In process_tx
, the processing proceeds as follows:
- Initialization of necessary variables, including:
frame[0u8; 65562]
: A buffer to copy the data prepared by the guest.used_desc_heads[0u16; 256]
: Data for storing the index of processed Descriptors and for updating the Used Ring at the end.used_count
: A counter to keep track of how much data has been read from the guest.
- Iteration over the TX Virtqueue until it stops, repeating steps 3 to 5.
- Reading the data information (located at
addr
) pointed to by the Descriptor and loading it into the buffer. If thenext
field points to another Descriptor, it is followed, and the data is read out. - Writing the read data to the Tap device.
- Storing the index of processed Descriptors (the Descriptor pointed to by the Available Ring) in
used_desc_heads
. - Updating the
Used Ring
with the information of processed Descriptors' indexes and the total amount of data stored. - Writing to the
eventfd
associated with the irq to trigger an interrupt and delegate processing to the guest.
On the guest side, the following steps are expected:
- Check the index of the
Used Ring
, and if there is a difference between the index of processed entries and the previously recorded index, check and process the Descriptor indexes to fill this gap. - The Descriptors pointed to by these Descriptor indexes have been processed on the host side, so they are returned to the chain of free Descriptors, and the recorded Descriptor numbers are updated.
- Repeat steps 1 and 2 until there is no difference between the index of the
Used Ring
and the recorded index position.
This completes the Tx processing.
Rx (Host -> Guest)
Next, let's explain the communication from the host to the guest (Rx) while referring to the implementation.
In the case of Rx, the Descriptor Table
, Available Ring
, and Used Ring
function as follows:
Descriptor Table
: It contains descriptors that point to received data, allowing the Guest to access the received data from the Tap.Available Ring
: It is used for the transfer of completed empty descriptors from the Guest's side.Used Ring
: It stores the index of descriptors pointing to received data, which the Guest reads to process the necessary descriptors.
Comparing Rx to Tx, you can see that the roles of the Available Ring
and Used Ring
are reversed.
Unlike Tx, Rx requires handling two types of event triggers: incoming packets from the Tap device and completion notifications from the guest for the Rx Virtqueue. Handling Rx is more complex compared to Tx due to the need to manage these two types of event triggers.
First, let's discuss the basic Rx processing flow, followed by considerations for cooperative behavior.
Basic Rx Processing Flow
The host receives data from the Tap device and needs to notify the guest by filling the Rx Virtqueue with data. To do this, some basic setup is required for the Rx Virtqueue, such as knowing where to place the data. It's important to remember that, from the perspective of ToyVMM, each element of the Virtqueue consists only of guest memory addresses, and necessary operations are performed based on Virtqueue memory access.
Returning to the guest, the following steps are expected:
- After initializing Descriptor chains and other settings, the guest assigns the index of the head of the free Descriptor chain to the empty entry pointed to by the
Available Ring
index. - The guest increments the
Available Ring
index. - To notify the host, the guest writes to MMIO
QueueNotify
.
On the host side, when Rx Virtqueue notification is received from the guest, it interprets this as the Rx data space is ready for address access.
Suppose the Tap device receives a packet at this point. By detecting the trigger of the Tap's file descriptor, the NetEpollHandler
is dispatched, and it performs event processing for RX_TAP_EVENT
. This processing mainly involves calling the process_rx
function. However, there are certain conditions under which this may not happen, which we will discuss later.
process_rx
proceeds as follows:
process_rx
processes as many frames as possible received from the Tap by looping until no data can be read.- If a successful read occurs from the Tap, the size of the read data is stored in
self.rx_count
, and therx_single_frame
function, which processes a single frame, is called. - In
rx_single_frame
, the first entry from the Available Ring is retrieved, and the beginning of the free Descriptor chain that this entry points to is extracted. - The received single frame's data is stored in the Descriptor, calculating the size along the way. If the received frame cannot fit into a single Descriptor, the
next
field of the Descriptor is followed to continue storing data. - The
Used Ring
of the Rx Virtqueue is updated with information about the index of the Descriptor containing Rx data and the total amount of data stored. - An interrupt is triggered by writing to the
eventfd
associated with the irq to delegate processing to the guest.
The following diagram illustrates the process of writing the received data into the Descriptor chain using the Available Ring:
Once the data from the Tap device has been written, the Used Ring
is updated, and an interrupt is sent to the guest.
On the guest side, the guest checks the Used Ring
index, references the Descriptor pointed to by new entries, retrieves and processes Rx data, and performs any necessary operations. It then updates the Available Ring, signaling to the host that it is ready to accept new data.
When Tap Trigger Occurs Without Rx Virtqueue Preparation
It is expected that there may be cases where Tap receives a packet when Rx Virtqueue is not ready. In such cases, even if data is extracted from Tap, it is impossible to obtain information about where to store it, preventing further processing.
To address this, a mechanism to delay Tap device processing until Rx Virtqueue is prepared is required. In the ToyVMM code, this is controlled using a flag called deferred_rx
.
When this flag is set, ToyVMM's Rx-related processing follows the following strategy:
- When
RX_QUEUE_EVENT
is triggered, indicating that the Rx Virtqueue is ready to receive data from the guest, data is immediately retrieved from the Tap device, and processing continues. If processing is completed at this point, the flag is cleared. - When
TAP_RX_EVENT
is triggered, processing is temporarily paused to check the status of the Rx Virtqueue. If processing can proceed, it continues, and if processing is completed, the flag is cleared. If processing cannot proceed or the amount of data in the Virtqueue is smaller than the received data, the flag is not cleared, and it waits for the Rx Virtqueue to be ready again.
When Tap Reception Exceeds the Prepared Rx Virtqueue
Another case to consider is when the data received by Tap exceeds the capacity of the prepared Virtqueue, as briefly mentioned above. In this case, the strategy is essentially the same, and the processing is temporarily interrupted until the next Virtqueue is prepared, controlled by the deferred_rx
flag. When the Virtqueue is ready, processing resumes.
Verification of virtio-net Operation
Let's test whether communication between the host and guest is possible using the implemented Virtio
mechanism and the Network Device. Below is the result of executing the ip addr
command inside the guest. eth0
is recognized as a virtual NIC.
localhost:~# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 02:32:22:01:57:33 brd ff:ff:ff:ff:ff:ff
inet6 fe80::32:22ff:fe01:5733/64 scope link
valid_lft forever preferred_lft forever
Let's also check on the host side. In ToyVMM, a Tap device is created on the host side. So assign an IP address (192.168.0.10/24
).
140: vmtap0: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
link/ether c6:69:6d:65:05:cf brd ff:ff:ff:ff:ff:ff
inet 192.168.0.10/24 brd 192.168.0.255 scope global vmtap0
valid_lft forever preferred_lft forever
Additionally, assign an IP address to the guest side. Here, an address within the same subnet range as the host is assigned.
localhost:~# ip addr add 192.168.0.11/24 dev eth0
Now that everything is set up, let's ping the IP address of the host's Tap interface from within the guest. You should receive responses as follows:
localhost:~# ping -c 5 192.168.0.10
PING 192.168.0.10 (192.168.0.10): 56 data bytes
64 bytes from 192.168.0.10: seq=0 ttl=64 time=0.394 ms
64 bytes from 192.168.0.10: seq=1 ttl=64 time=0.335 ms
64 bytes from 192.168.0.10: seq=2 ttl=64 time=0.334 ms
64 bytes from 192.168.0.10: seq=3 ttl=64 time=0.321 ms
64 bytes from 192.168.0.10: seq=4 ttl=64 time=0.330 ms
--- 192.168.0.10 ping statistics ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max = 0.321/0.342/0.394 ms
Conversely, if you ping the IP address of the virtio-net
interface in the guest from the host, you should also receive responses:
[mmichish@mmichish ~]$ ping -c 5 192.168.0.11
PING 192.168.0.11 (192.168.0.11) 56(84) bytes of data.
64 bytes from 192.168.0.11: icmp_seq=1 ttl=64 time=0.410 ms
64 bytes from 192.168.0.11: icmp_seq=2 ttl=64 time=0.366 ms
64 bytes from 192.168.0.11: icmp_seq=3 ttl=64 time=0.385 ms
64 bytes from 192.168.0.11: icmp_seq=4 ttl=64 time=0.356 ms
64 bytes from 192.168.0.11: icmp_seq=5 ttl=64 time=0.376 ms
--- 192.168.0.11 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4114ms
rtt min/avg/max/mdev = 0.356/0.378/0.410/0.028 ms
Although this is a simple confirmation using ICMP, it confirms that communication is functioning properly!
Reference
Implement virtio-blk device
In this section, we will implement the Block device used by the guest's virtio-blk
. The specification is similar to Virtio and is officially published by OASIS in the Block Device section. However, please note that this implementation may not be fully compliant with this specification.
Before proceeding with this section, make sure you have read the previous sections as the concepts introduced earlier will be used without further explanation here. Additionally, please read the Implement virtio-net device from the previous section as this section may omit details that overlap with the virtio-net
implementation.
Mechanism of virtio-blk
In virtio-blk
, a single Virtqueue is used to represent DISK Read/Write from the guest. Unlike virtio-net
, there are no external factors (such as receiving data from Tap), and it is purely driven by I/O requests from the guest. Therefore, it operates with a minimum of one Virtqueue. Although the specification allows for scaling the number of Virtqueues, we have not implemented it for simplicity.
In the following sections, we will explain the implementation details based on code examples.
Implementation Details of virtio-blk
The implementation of virtio-blk
can be found in block.rs
. The roles and relationships of various structures are shown in the following diagram:
As mentioned earlier, the concrete implementation depends on the specific device. However, it is abstracted by the VirtioDevice
trait, so everything else, apart from the details of various device implementations, works the same as shown for virtio-net
. Therefore, this diagram mostly resembles the internal details of the Block Device, with minor differences.
During initialization, queries such as Device Type
and Features
are responded to by the specific implementation of the Block
device. Similar to the Net
device, the addresses of the Virtqueue on the Guest's address space are set up and provided. Once the initialization steps are completed, the activate
function is executed.
For the Block
device, like the Net
device, various file descriptors are registered with epoll
during initialization. Handlers (BlockEpollHandler
) are set up to be executed when epoll
triggers, just like in the case of the Net
device. In the Block
device, to emulate I/O, a host-side file (to be operated as a BlockDevice
) is opened, and read/write requests from the guest are performed on it. The file descriptors registered with epoll
include an eventfd
for the Virtqueue
and another eventfd
for stopping the system in case of unexpected situations, making a total of two file descriptors.
In comparison to the Net
device, you can see that the Tap device has been replaced by a file, and the number of eventfds has changed. However, apart from these changes, there are no significant differences in the behavior.
For the Block
device, a single Virtqueue is associated with the firing of an eventfd. Therefore, we will focus on this process in the following sections.
I/O Requests in virtio-blk
Before delving into the implementation details, let's explain the I/O requests in virtio-blk
.
As mentioned earlier, virtio-blk
handles I/O requests from the guest through a single Virtqueue. However, guest-originated I/O requests can be broadly categorized into two types: Read
and Write
, and the processing required for each of them is significantly different. The host must determine how to recognize these requests and emulate the I/O correctly.
To explain this, we need to understand how the Descriptor Table
is used in virtio-blk
. The data sent by the guest to Virtqueue follows the structure below:
#![allow(unused)] fn main() { struct virtio_blk_req { le32 type; le32 reserved; le64 sector; u8 data[]; u8 status; }; }
Source: Block Device: Device Operation
In practice, this is created as three entries in the Descriptor Table
, with each entry being linked by the next
field.
- The first
Descriptor
entry points to the addresses containing thetype
,reserved
, andsector
data. - The second
Descriptor
entry points to the beginning of the data area where data is written. - The third
Descriptor
entry points to the address where thestatus
will be written by Host.
The type
field indicates the type of I/O (e.g., read
, write
, or other I/O requests). By examining this value, the host can determine how to behave differently.
In the case of a read
, the second Descriptor
entry points to the area where the host should store the data it reads from the Disk. The host can determine the sector to read from based on the sector
value and read the necessary amount of data (desc.len
of the second Descriptor
).
In the case of a write
, the second Descriptor
entry contains the data that should be written to the Disk. The host reads the data and writes it to the sector specified by the sector
value.
The third Descriptor
entry is used to write status information, indicating whether the I/O was successful or failed.
In summary, the type of Disk I/O and the necessary data or buffers are provided through Virtqueue. It is the responsibility of the host to interpret this according to the specification, emulate the I/O correctly, and provide the appropriate status.
Implementation of Disk I/O in ToyVMM
Let's explain the guest-originated Disk I/O requests in the context of the implementation. Everything else is essentially the same as the Tx
case of the Net
Device, so let's start with the point where the processing is delegated to the host through QueueNotify
.
Writing to MMIO's QueueNotify
triggers an EventFd, which is picked up by epoll
monitoring. Specifically, the handler for QUEUE_AVAIL_EVENT
is executed. In practice, the process_queue
function is called, and if its return value is true
, the signal_used_queue
function is called.
The signal_used_queue
function simply sends an interrupt to the guest, so the important part to examine is the process_queue
function.
In the process_queue
function, the following steps are performed:
- Initialize necessary variables:
used_desc_heads[(u16, u32), 256]
: Stores the index and data length of processedDescriptors
. This is used to populate theused_ring
at the end ofprocess_queue
.used_count
: Keeps track of how many I/O requests from the guest have been processed.
- Iterate through Virtqueue until it stops, repeating steps 2 to 4.
- Retrieve the
Descriptor
pointed to by theAvailable Ring
, parse it according to thevirtio-blk
specification, and create aRequest
structure. TheRequest
structure contains parsed information such as request type, sector information, data address, data length, and status address. - Call the
execute
function, which performs the I/O request based on the content of theRequest
structure. For successful I/O, it returns the length of data read (for Read) or 0 (for Write and other types). This value is used to write to theused_ring
. - Write the status (success or failure of I/O) to the status address and write necessary information to the
used_ring
. - If one or more requests have been processed, return
true
as the function's return value.
The following diagrams illustrate the process when the guest-originated I/O request is a Read:
And here's the process when the guest-originated I/O request is a Write:
Verification of virtio-blk Operation
Now, let's perform a practical verification to demonstrate the functionality. Instead of using initrd.img
, we will use an Ubuntu rootfs image similar to Firecracker, allowing us to boot the Ubuntu OS directly. With the implementation of the virtio-blk
BlockDevice, we can recognize the Ubuntu rootfs image as /dev/vda
in the VM. To boot from this Ubuntu image, we need to specify root=/dev/vda
in the VM's kernel cmdline.
# Run ToyVMM with kernel and rootfs (no initrd.img)
$ sudo -E cargo run -- boot_kernel -k vmlinux.bin -r ubuntu-18.04.ext4
...
# You can verify that the launched VM is ubuntu-based.
root@7e47bb8f2f0a:~# uname -r
4.14.174
root@7e47bb8f2f0a:~# cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.1 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.1 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
# And you can also find that this VM mount /dev/vda as rootfs.
root@7e47bb8f2f0a:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 254:0 0 384M 0 disk /
root@7e47bb8f2f0a:~# ls -lat /
total 36
drwxr-xr-x 12 root root 360 Aug 14 13:47 run
drwxr-xr-x 11 root root 2460 Aug 14 13:46 dev
dr-xr-xr-x 12 root root 0 Aug 14 13:46 sys
drwxrwxrwt 7 root root 1024 Aug 14 13:46 tmp
dr-xr-xr-x 57 root root 0 Aug 14 13:46 proc
drwxr-xr-x 2 root root 3072 Jul 20 2021 sbin
drwxr-xr-x 2 root root 1024 Dec 16 2020 home
drwxr-xr-x 48 root root 4096 Dec 16 2020 etc
drwxr-xr-x 2 root root 1024 Dec 16 2020 lib64
drwxr-xr-x 2 root root 5120 May 28 2020 bin
drwxr-xr-x 20 root root 1024 May 13 2020 .
drwxr-xr-x 20 root root 1024 May 13 2020 ..
drwxr-xr-x 2 root root 1024 May 13 2020 mnt
drwx------ 4 root root 1024 Apr 7 2020 root
drwxr-xr-x 2 root root 1024 Apr 3 2019 srv
drwxr-xr-x 6 root root 1024 Apr 3 2019 var
drwxr-xr-x 10 root root 1024 Apr 3 2019 usr
drwxr-xr-x 9 root root 1024 Apr 3 2019 lib
drwx------ 2 root root 12288 Apr 3 2019 lost+found
drwxr-xr-x 2 root root 1024 Aug 21 2018 opt
As mentioned above, it can be seen that the VM is running the Ubuntu-based OS passed as /dev/vda, and after logging in, it is confirmed that it is an Ubuntu-based OS, and the rootfs is correctly mounted as intended. Furthermore, unlike the previous initrd.img, which had volatile rootfs, in this case, the rootfs persisted as a DISK is used for booting the VM, allowing files created within the VM to be retained across VM reboots.
# Create a sample file (hello.txt) in the first VM boot and reboot.
root@7e47bb8f2f0a:~# echo "HELLO UBUNTU" > ./hello.txt
root@7e47bb8f2f0a:~# cat hello.txt
HELLO UBUNTU
root@7e47bb8f2f0a:~# reboot -f
Rebooting.
# After the second boot, you can also find 'hello.txt'.
Ubuntu 18.04.1 LTS 7e47bb8f2f0a ttyS0
7e47bb8f2f0a login: root
Password:
Last login: Mon Aug 14 13:57:27 UTC 2023 on ttyS0
Welcome to Ubuntu 18.04.1 LTS (GNU/Linux 4.14.174 x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
This system has been minimized by removing packages and content that are
not required on a system that users do not log into.
To restore this content, you can run the 'unminimize' command.
root@7e47bb8f2f0a:~# cat hello.txt
HELLO UBUNTU
With the implementation of both virtio-net
and virtio-blk
devices, you have successfully created a minimal VM with the necessary functionality.