01. Running Tiny Code in VM
Sorry, this contents is currently available only in English
Tiny code execution is no longer supported in the current latest commit.
You may be able to verify it by checking out past commits, but please be aware that resolving package dependencies may be challenging.
This chapter is documented in a way that you can get a sense of its behavior without actually running it, so please feel reassured about that.
DeepDive ToyVMM instruction and how to run tiny code in VM
This main
function is a program that starts a VM using the KVM mechanism and executes the following small code inside the VM
#![allow(unused)] fn main() { code = &[ 0xba, 0xf8, 0x03, /* mov $0x3f8, %dx */ 0x00, 0xd8, /* add %bl, %al */ 0x04, b'0', /* add $'0', %al */ 0xee, /* out %al, (%dx) */ 0xb0, '\n', /* mov $'\n', %al */ 0xee, /* out %al, (%dx) */ 0xf4, /* hlt */ ]; }
This code perform several register operations, but the initial state of the CPU regisers for this VM is set as follows.
#![allow(unused)] fn main() { regs.rip = 0x1000; regs.rax = 2; regs.rbx = 2; regs.rflags = 0x2; vcpu.set_sregs(&sregs).unwrap(); vcpu.set_regs(®s).unwrap(); }
This will output the result of calculations (2 + 2) inside the VM from the IO Port, followed by a newline code as well.
As you can see the result of running ToyVMM, hex value 0x34 (= '4') and 0xa (= New Line) are catched from I/O port
How's work above code with rust-vmm libraries
Now, the following crate provided by rust-vmm is used to run these codes.
# Please see Cargo.toml
kvm-bindings
kvm-ioctls
vmm-sys-util
vm-memory
I omit to describe about vmm-sys-util because it is only used to create temporary files at this point, so there is nothing special to mention about it.
I will go through the code in order and describe how each crate is related to that.
In this explanation, we will focus primary on the perspective of what ioctl is performed as a result of a function call (This is because the interface to manipulate KVM from the user space relies on the iocl system call)
Also, please note that explanations of unimportant variables may be omitted.
It should be noted that what is described here is not only the ToyVMM implementation, but also the firecracker implementation in a similaer form.
First, we need to open /dev/kvm
and acquire the file descriptor. This can be done by Kvm::new()
of kvm_ioctls
crate. Following this process, the Kvm::open_with_cloexec
function issues an open
system call as follows, returns a file descriptor as Kvm
structure
#![allow(unused)] fn main() { let ret = unsafe { open("/dev/kvm\0".as_ptr() as *const c_char, open_flags) }; }
The result obtained from above is used to call the method create_vm
, which results in the following ioctl
being issued
#![allow(unused)] fn main() { vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0) where vmfd: from /dev/kvm }
Please keep in mind that the file descriptor returned from above function will be used later when preparing the CPU.
Anyway, we finish to crete a VM but it has no memory, cpu.
Now, the next step is to prepare memory!
In kvm_ioctls
's example, memory is prepared as follows
#![allow(unused)] fn main() { // First, setup Guest Memory using mmap let load_addr: *mut u8 = unsafe { libc::mmap( null_mut(), mem_size, // 0x4000 libc::PROT_READ | libc::PROT_WRITE, libc::MAP_ANONYMOUS | libc::MAP_SHARED | libc::MAP_NORESERVE, -1, 0, ) as *mut u8 }; // Second, setup kvm_userspace_memory_region sructure using above memory // kvm_userspace_memory_region is defined in kvm_bindings crate let mem_region = kvm_userspace_memory_region { slot, guest_phys_addr: guest_addr, // 0x1000 memory_size: mem_size as u64, // 0x4000 userspace_addr: load_addr as u64, flags: KVM_MEM_LOG_DIRTY_PAGES, }; unsafe { vm.set_user_memory_region(mem_region).unwrap() }; // retrieve slice from pointer and length (slice::form_raw_parts_mut) // > https://doc.rust-lang.org/beta/std/slice/fn.from_raw_parts_mut.html // and write asm_code into this slice (&[u8], &mut [u8], Vec<u8> impmenent the Write trait!) // > https://doc.rust-lang.org/std/primitive.slice.html#impl-Write unsafe { let mut slice = slice::from_raw_parts_mut(load_addr, mem_size); slice.write(&asm_code).unwrap(); } }
Check set_user_memory_region
. This function will issue the following ioctl as a result, attach the memory to VM
#![allow(unused)] fn main() { ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &mem_region) }
ToyVMM, on the other hand, provides a utility functions for memory preparation.
This difference is due to the fact that ToyVMM's implementation is similaer to firecracker's, but they are essentially doing the same thing.
Let's look at the whole implementation first
#![allow(unused)] fn main() { // The following `create_region` functions operate based on file descriptor, so first, create a temporary file and write asm_code to it. let mut file = TempFile::new().unwrap().into_file(); assert_eq!(unsafe { libc::ftruncate(file.as_raw_fd(), 4096 * 10) }, 0); let code: &[u8] = &[ 0xba, 0xf8, 0x03, /* mov $0x3f8, %dx */ 0x00, 0xd8, /* add %bl, %al */ 0x04, b'0', /* add $'0', %al */ 0xee, /* out %al, %dx */ 0xb0, b'\n', /* mov $'\n', %al */ 0xee, /* out %al, %dx */ 0xf4, /* hlt */ ]; file.write_all(code).expect("Failed to write code to tempfile"); // create_region funcion create GuestRegion (The details are described in the following) let mut mmap_regions = Vec::with_capacity(1); let region = create_region( Some(FileOffset::new(file, 0)), 0x1000, libc::PROT_READ | libc::PROT_WRITE, libc::MAP_NORESERVE | libc::MAP_PRIVATE, false, ).unwrap(); // Vec named 'mmap_regions' contains the entry of GuestRegionMmap mmap_regions.push(GuestRegionMmap::new(region, GuestAddress(0x1000)).unwrap()); // guest_memory represents as the vec of GuestRegion let guest_memory = GuestMemoryMmap::from_regions(mmap_regions).unwrap(); let track_dirty_page = false; // setup Guest Memory vm.memory_init(&guest_memory, kvm.get_nr_memslots(), track_dirty_page).unwrap(); }
The create_vm
consequently performs a mmap in the following way and returns a part of the structure (GuestMmapRegion) representing the GuestMemory
#![allow(unused)] fn main() { pub fn create_region( maybe_file_offset: Option<FileOffset>, size: usize, prot: i32, flags: i32, track_dirty_pages: bool, ) -> Result<GuestMmapRegion, MmapRegionError> { ... let region_addr = unsafe { libc::mmap( region_start_addr as *mut libc::c_void, size, prot, flags | libc::MAP_FIXED, fd, offset as libc::off_t, ) }; let bitmap = match track_dirty_pages { true => Some(AtomicBitmap::with_len(size)), false => None, }; unsafe { MmapRegionBuilder::new_with_bitmap(size, bitmap) .with_raw_mmap_pointer(region_addr as *mut u8) .with_mmap_prot(prot) .with_mmap_flags(flags) .build() } }
Let's check the structure about Memory here.
In src/kvm/memory.rs
, the following Memory structure is defined based on vm-memory crate
pub type GuestMemoryMmap = vm_memory::GuestMemoryMmap<Option<AtomicBitmap>>;
pub type GuestRegionMmap = vm_memory::GuestRegionMmap<Option<AtomicBitmap>>;
pub type GuestMmapRegion = vm_memory::MmapRegion<Option<AtomicBitmap>>;
The MmapRegionBuilder
is also defined in the vm-memory
crate, and this build
method creates the MmapRegion
.
This time, since we have performed the mmap myself in advance and passed that address to with_raw_mmap_pointer
, use that area to initialize. Otherwise, mmap is performed in the build
method. In any case, this build
method will get the MmapRegion
structure, but defines a synonym as described above, which is returned as the GuestMmapRegion
. By calling the create_region
function once, you can allocate and obtain one region of GuestMemory based on the information(size
, flags
, ...etc) specified in the argument.
The region allocated here is only mmapped from the virtual address space of the VMM process, and no further information is available. To use this area as Guest Memory, a GuestRegionMmap
structure is created from this area. This is simple, specify the corresponding GuestAddress
for this region and initialize GuestRegionMmap
with a tuple of mmapped area and GuestAddress. In following code, the initialized GuestRegionMmap
is pushed to Vec for subsequent processing.
#![allow(unused)] fn main() { map_regions.push(GuestRegionMmap::new(region, GuestAddress(0x1000)).unwrap()); }
Now, the mmap_regions: Vec<GuestRegionMmap>
created as above represents the entire memory of the Guest VM, and each region that makes up the guest memory holds information on the area allocated by the VMM for the region and the top address of the Guest side.
The GuestMemoryMmap
structure representing the Guest Memory is initialized from this Vec information and set to VM by the memory_init
method.
#![allow(unused)] fn main() { let guest_memory = GuestMemoryMmap::from_regions(mmap_regions).unwrap(); vm.memory_init(&guest_memory, kvm.get_nr_memslots(), track_dirty_page).unwrap(); }
Next, let's check the operation of this memory_init
. This calls set_kvm_memory_regions
and the actual process is described there.
#![allow(unused)] fn main() { pub fn set_kvm_memory_regions( &self, guest_mem: &GuestMemoryMmap, track_dirty_pages: bool, ) -> Result<()> { let mut flags = 0u32; if track_dirty_pages { flags |= KVM_MEM_LOG_DIRTY_PAGES; } guest_mem .iter() .enumerate() .try_for_each(|(index, region)| { let memory_region = kvm_userspace_memory_region { slot: index as u32, guest_phys_addr: region.start_addr().raw_value() as u64, memory_size: region.len() as u64, userspace_addr: guest_mem.get_host_address(region.start_addr()).unwrap() as u64, flags, }; unsafe { self.fd.set_user_memory_region(memory_region) } }) .map_err(Error::SetUserMemoryRegion)?; Ok(()) } }
Here we can see that set_user_memory_region
is called using the necessary information while iterating the region.
In other words, it is processing the same as the example code except that there may be more than one region.
Now that we've gone through the explanation of memory preparation, let's take a look at the vm-memory
crate!
The information presented here is only the minimum required, so please refer to Design or other sources for more details.
This will also be related to the above iteration, where we were able to call methods such as sart_addr()
and len()
to construct the necessary information for set_user_memory_region
.
#![allow(unused)] fn main() { GuestAddress (struct) : Represent Guest Physicall Address (GPA) FileOffset(struct) : Represents the start point within a 'File' that backs a 'GuestMemoryRegion' GuestMemoryRegion(trait) : Represents a continuous region of guest physical memory / trait GuestMemory(trait) : Represents a container for a immutable collection of GuestMemoryRegion object / trait MmapRegion(struct) : Helper structure for working with mmaped memory regions GuestRegionMmap(struct & implement GuestMemoryRegion trait) : Represents a continuous region of the guest's physical memory that is backed by a mapping in the virtual address space of the calling process GuestMemoryMmap(struct & implement GuestMemory trait) : Represents the entire physical memory of the guest by tracking all its memory regions }
Since GuestRegionMmap
implements the GuestMemoryRegion
trait, there are implementations of functions such as start_addr()
and len()
, which were used in the above interation.
The following figure briefly summarizes the relationship between these structures
As you can see, what is being done is essentially the same.
The final step is prepareing vCPU (vCPU is a CPU to be attached to a virtual machine).
Currently, a VM has been created and memory containing instructions has been inserted, but these is no CPU, so the instructions can't be executed. Therefore, let's create a vCPU, associate it with the VM, and execute the instruction by running the vCPU!
Using the file descriptor obtained during VM creaion (vmfd
), the resulting ioctl
will be issued as follows.
#![allow(unused)] fn main() { vcpufd = ioctl(vmfd, KVM_CREATE_VCPU, 0) }
The create_vm
method that was just issued to obtain the vmfd
is designed to return a kvm_ioctls::VmFd
strucure as a result, and by execuing the create_vcpu
method, which is a method of this structure, the above ioctl is consequently issued and returns the result as a kvm_ioctls::VcpuFd
structure.
VcpuFd
provides utilities for getting and setting various CPU states.
For example, if you want o get/set a register set from the vCPU, you would normally issue the following ioctl
#![allow(unused)] fn main() { ioctl(vcpufd, KVM_GET_SREGS, &sregs); ioctl(vcpufd, KVM_SET_SREGS, &sregs); }
For these, the following methods are available in kvm_ioctls::VcpuFd
#![allow(unused)] fn main() { get_sregs(&self) -> Result<kvm_sregs> set_sregs(&self, sregs: &kvm_sregs) -> Result<()> }
VcpuFd
also provids a method called run
, which issues the following insructions to actually run the vCPU.
#![allow(unused)] fn main() { ioctl(vcpufd, KVM_RUN, NULL) }
and then, we can aquire return values that has the type Result<VcpuExit>
resulting this method.
When running vCPU, exit occurs for various reasons. This is an instruction that the CPU cannot handle, and the OS usually tries to deal with it by invoking the corresponding handler.
If this type of exit comes back from the VM's vCPU, as in the case, it will be necessary to write the appropriate code to handle the situation.
VcpuExit
is defined in kvm_ioctls::VcpuExit
as enum.
When Exit are occurred on several reasons in running vCPU, the exit reasons that are defined in kvm.h in linux kernel are wrapped to VcpuExit
.
Therefore, it is sufficient to write a process that pattern matches this result and appropriately handles the error to be handled.
Now, there is a instruction that execute outputting values through I/O port and this will occur the KVM_EXIT_IO_OUT
.
VcpuExit
wrap this exit reason as IoOut
.
Originally (in C programm as example), we require to calculate appropriate offset to get output data from I/O port, but now, this process are implemented in run
method and returned as VcpuExit that contains necessary values.
So, we don't have to write these unsafe code (pointer offset calculation) and handle these exit as you will.
#![allow(unused)] fn main() { loop { match vcpu.run().expect("vcpu run failed") { kvm_ioctls::VcpuExit::IoOut(addr, data) => { println!( "Recieved I/O out exit. \ Address: {:#x}, Data(hex): {:#x}", addr, data[0], ); }, kvm_ioctls::VcpuExit::Hlt => { break; } exit => panic!("unexpected exit reason: {:?}", exit), } } }
In above, only handle KVM_EXIT_IO_OUT
and KVM_EXIT_HLT
, and the others will be processed as panic. (Although all exits should be handled, I want to focus on the description of KVM API example and keep it simply)
Since we are here, let's take a look at the processing of the run
method in some detail.
Let's check the processing of KVM_EXIT_IO_OUT
.
If you look at the LWN article, you will see that it calculates the offset and outputs the necessary information in the following way.
#![allow(unused)] fn main() { case KVM_EXIT_IO: if (run->io.direction == KVM_EXIT_IO_OUT && run->io.size == 1 && run->io.port == 0x3f8 && run->io.count == 1) putchar(*(((char *)run) + run->io.data_offset)); else errx(1, "unhandled KVM_EXIT_IO"); break; }
On the other hand, run
method implemented in kvm_ioctl::VcpuFd
is like bellow
#![allow(unused)] fn main() { ... let run = self.kvm_run_ptr.as_mut_ref(); match run.exit_reason { ... KVM_EXIT_IO => { let run_start = run as *mut kvm_run as *mut u8; // Safe because the exit_reason (which comes from the kernel) told us which // union field to use. let io = unsafe { run.__bindgen_anon_1.io }; let port = io.port; let data_size = io.count as usize * io.size as usize; // The data_offset is defined by the kernel to be some number of bytes into the // kvm_run stucture, which we have fully mmap'd. let data_ptr = unsafe { run_start.offset(io.data_offset as isize) }; // The slice's lifetime is limited to the lifetime of this vCPU, which is equal // to the mmap of the `kvm_run` struct that this is slicing from. let data_slice = unsafe { std::slice::from_raw_parts_mut::<u8>(data_ptr as *mut u8, data_size) }; match u32::from(io.direction) { KVM_EXIT_IO_IN => Ok(VcpuExit::IoIn(port, data_slice)), KVM_EXIT_IO_OUT => Ok(VcpuExit::IoOut(port, data_slice)), _ => Err(errno::Error::new(EINVAL)), } } ... }
Let me explain a little. The kvm_run
is provided by the kvm-bindings
crate, which is a structure automatically generated from a header file using bindgen
, so it is a structure like the linux kernel's kvm_run
converted directory to Rust.
First, kvm_run
is obtained in the form of a pointer, a method of obtaining a pointer often used in Rust.
This correspoinds to the first address of the kvm_run
structure which is bound to run_start
variable.
And the information corresponding to run->io(.member)
can be obtained from run.__bindgen_anon_1.io
, although it is a bit tricky. The field named __bindgen_anon_1
is the effect of automatic generation by bindgen
.
The data we want is at the first address of kvm_run
plus io.data_offset
. This process is performed in run_start.offset(io.data_offset as isize)
. And the data size can be calculated from io->size
and io->count
(in the LWN example, it is 1byte, so it's taken directory from the offset by putchar). This part is calculated and stored in the value data_size
, and std::slice::from_raw_parts_mut
actually retrieves the data using this size.
Finally, checking io.direction
, we change the wrap type for KVM_EXIT_IO_IN
or KVM_EXIT_IO_OUT
respectively, and return the descired information such as port
and data_slice
together.
As can be seen from the above, what is being done is clear.
However, it still contains many unsafe operations because it involves pointer manipuration.
We can see that by using these libraries, we are able to implement VMM on a stable implementation.
Well, it's ben a long time comming, but let's take a look back at the rust-vmm crates we're using again.
#![allow(unused)] fn main() { kvm-bindings : Library that includes structures automatically generated from kvm.h by bindgen. kvm-ioctls : Library that hides ioctl and unsafe processes related to kvm operations and provides user-friendly sructures, functions and methods. vm-memory : Library that provides structures and operations to the Memory }
This knowledge will come up again and again in future discussion and is basic and important.