Implement virtio-net device
In this section, we will proceed with the implementation of a Network Device as a specific Virtio Device. While the specification can be found in the official OASIS documentation Network Device, please note that this implementation may not align perfectly with the specification. If you haven't read the previous sections, be sure to review them before continuing with this section.
virtio-net Mechanism
In virtio-net
, three types of Virtqueues are typically used: Transmit Queue, Receive Queue, and Control Queue.
The Transmit Queue is used for data transmission from the guest to the host, the Receive Queue for data transmission from the host to the guest. The Control Queue is used for guest-to-host operations related to NIC settings, such as setting promiscuous mode, enabling/disabling broadcast reception, and multicast reception.
For the sake of brevity, we are omitting the implementation of the Control Queue in this section. It is worth noting that the specification allows scaling the number of Virtqueues, but for simplicity, we are not implementing that.
In the following sections, we will provide detailed implementation-based explanations of the network device.
Network Device Implementation Details
The implementation of virtio-net
can be found in net.rs
.
We will break down and explain the initialization phase and the post-initialization phase.
The following diagram primarily focuses on the initialization process of the Network Device:
The Net
struct implements the VirtioDevice
Trait and is associated with the MmioTransport
. As mentioned earlier, device-specific operations during MMIO Transport initialization depend on the implementation of this Net
struct.
For example, during initialization, a query about the Device Type
occurs. According to the specification, for a Net
device, this should return 0x01
, and the Net
struct implements it as follows:
impl VirtioDevice for Net {
fn device_type(&self) -> u32 {
// types::NETWORK_CARD:u32 = 0x01
types::NETWORK_CARD
}
...
}
Similarly, queries about the Device Feature
should be implemented to return device-specific values. Additionally, during initialization, the guest OS initializes the Descriptor Table
, Available Ring
, and Used Ring
of the Virtqueues, and the addresses for each of these are notified. This allows the addresses to be stored for each queue so they can be referenced during actual processing.
Once the initialization steps are completed and the status is updated to a specific value, ToyVMM executes the activate
function implemented in the device. In the case of the Net
device, within this activate
function, various file descriptors are registered with epoll
, and the setup of handlers (e.g., NetEpollHandler
) triggered by epoll
is performed. The Net
device emulates I/O by creating a Tap device on the host side and writing data received from the guest via Virtqueue to the Tap device for transmission (Tx
), and writing incoming data from the Tap device to Virtqueue to notify the guest (Rx
). Four file descriptors are registered with epoll
: the fd
of the tap
device, an eventfd
for notification of the Tx Virtqueue, an eventfd
for notification of the Rx Virtqueue, and an eventfd
for halting in unexpected situations.
Next, we will provide a detailed diagram of the Network Device in its activated state:
When one of the registered file descriptors in epoll
triggers an event, it dispatches the NetEpollHandler
for event processing. NetEpollHandler
varies its actions based on the event triggered. In any case, within the NetEpollHandler
, it references the Virtqueue and performs I/O emulation.
One important point to note is that the initialization process of the device, based on KVM_EXIT_MMIO
, is a processing call within the thread handling vCPU. In ToyVMM, this is done in a separate thread from the main thread. However, the thread responsible for executing I/O is also separate from the vCPU processing thread (currently handled in the main thread). To facilitate communication between these threads, channels are used to send the initialized NetEpollHandler
. This allows I/O to be processed while the guest VM is running and CPU emulation is conducted in a separate thread.
As mentioned earlier, communication between the host and guest is primarily triggered by events related to Virtqueue Eventfds and Tap device file descriptors. In the following sections, we will provide more detailed explanations of how processing occurs in both the Tx and Rx cases.
Tx (Guest -> Host)
Let's start by examining the implementation of communication in the Guest -> Host direction (Tx) and provide a detailed explanation. Once again, for Tx, the Descriptor Table
, Available Ring
, and Used Ring
function as follows:
Descriptor Table
: It contains descriptors that point to the data the Guest is trying to transmit.Available Ring
: It stores the index of the descriptor pointing to the transmit data. The Host reads this index and processes Tx emuration.Used Ring
: It stores the index of descriptors that have been processed on the Host side. The Guest reads this index to collect processed descriptors.
Tx initiates when the guest (guest device driver) prepares a packet, and control is transferred to ToyVMM when a Write operation occurs on QueueNotify
.
Specifically, in the guest, the following steps are expected:
- The guest sets the data address and length in the first Descriptor's
addr
andlen
fields. - The guest stores the index of the
Descriptor
pointing to the transmit data in theAvailable Ring
entry, pointed to by theAvailable Ring
index. - The guest increments the
Available Ring
index. - To notify the host of unprocessed data, the guest writes to the MMIO
QueueNotify
.
Now, let's shift our focus to the host side, which is handled by ToyVMM. The EventFd triggered by the write to MMIO's QueueNotify
is picked up by epoll monitoring. It triggers the NetEpollHandle's handler processing, specifically, the execution of the TX_QUEUE_EVENT
corresponding operation.
The implementation calls the process_tx
function.
In process_tx
, the processing proceeds as follows:
- Initialization of necessary variables, including:
frame[0u8; 65562]
: A buffer to copy the data prepared by the guest.used_desc_heads[0u16; 256]
: Data for storing the index of processed Descriptors and for updating the Used Ring at the end.used_count
: A counter to keep track of how much data has been read from the guest.
- Iteration over the TX Virtqueue until it stops, repeating steps 3 to 5.
- Reading the data information (located at
addr
) pointed to by the Descriptor and loading it into the buffer. If thenext
field points to another Descriptor, it is followed, and the data is read out. - Writing the read data to the Tap device.
- Storing the index of processed Descriptors (the Descriptor pointed to by the Available Ring) in
used_desc_heads
. - Updating the
Used Ring
with the information of processed Descriptors' indexes and the total amount of data stored. - Writing to the
eventfd
associated with the irq to trigger an interrupt and delegate processing to the guest.
On the guest side, the following steps are expected:
- Check the index of the
Used Ring
, and if there is a difference between the index of processed entries and the previously recorded index, check and process the Descriptor indexes to fill this gap. - The Descriptors pointed to by these Descriptor indexes have been processed on the host side, so they are returned to the chain of free Descriptors, and the recorded Descriptor numbers are updated.
- Repeat steps 1 and 2 until there is no difference between the index of the
Used Ring
and the recorded index position.
This completes the Tx processing.
Rx (Host -> Guest)
Next, let's explain the communication from the host to the guest (Rx) while referring to the implementation.
In the case of Rx, the Descriptor Table
, Available Ring
, and Used Ring
function as follows:
Descriptor Table
: It contains descriptors that point to received data, allowing the Guest to access the received data from the Tap.Available Ring
: It is used for the transfer of completed empty descriptors from the Guest's side.Used Ring
: It stores the index of descriptors pointing to received data, which the Guest reads to process the necessary descriptors.
Comparing Rx to Tx, you can see that the roles of the Available Ring
and Used Ring
are reversed.
Unlike Tx, Rx requires handling two types of event triggers: incoming packets from the Tap device and completion notifications from the guest for the Rx Virtqueue. Handling Rx is more complex compared to Tx due to the need to manage these two types of event triggers.
First, let's discuss the basic Rx processing flow, followed by considerations for cooperative behavior.
Basic Rx Processing Flow
The host receives data from the Tap device and needs to notify the guest by filling the Rx Virtqueue with data. To do this, some basic setup is required for the Rx Virtqueue, such as knowing where to place the data. It's important to remember that, from the perspective of ToyVMM, each element of the Virtqueue consists only of guest memory addresses, and necessary operations are performed based on Virtqueue memory access.
Returning to the guest, the following steps are expected:
- After initializing Descriptor chains and other settings, the guest assigns the index of the head of the free Descriptor chain to the empty entry pointed to by the
Available Ring
index. - The guest increments the
Available Ring
index. - To notify the host, the guest writes to MMIO
QueueNotify
.
On the host side, when Rx Virtqueue notification is received from the guest, it interprets this as the Rx data space is ready for address access.
Suppose the Tap device receives a packet at this point. By detecting the trigger of the Tap's file descriptor, the NetEpollHandler
is dispatched, and it performs event processing for RX_TAP_EVENT
. This processing mainly involves calling the process_rx
function. However, there are certain conditions under which this may not happen, which we will discuss later.
process_rx
proceeds as follows:
process_rx
processes as many frames as possible received from the Tap by looping until no data can be read.- If a successful read occurs from the Tap, the size of the read data is stored in
self.rx_count
, and therx_single_frame
function, which processes a single frame, is called. - In
rx_single_frame
, the first entry from the Available Ring is retrieved, and the beginning of the free Descriptor chain that this entry points to is extracted. - The received single frame's data is stored in the Descriptor, calculating the size along the way. If the received frame cannot fit into a single Descriptor, the
next
field of the Descriptor is followed to continue storing data. - The
Used Ring
of the Rx Virtqueue is updated with information about the index of the Descriptor containing Rx data and the total amount of data stored. - An interrupt is triggered by writing to the
eventfd
associated with the irq to delegate processing to the guest.
The following diagram illustrates the process of writing the received data into the Descriptor chain using the Available Ring:
Once the data from the Tap device has been written, the Used Ring
is updated, and an interrupt is sent to the guest.
On the guest side, the guest checks the Used Ring
index, references the Descriptor pointed to by new entries, retrieves and processes Rx data, and performs any necessary operations. It then updates the Available Ring, signaling to the host that it is ready to accept new data.
When Tap Trigger Occurs Without Rx Virtqueue Preparation
It is expected that there may be cases where Tap receives a packet when Rx Virtqueue is not ready. In such cases, even if data is extracted from Tap, it is impossible to obtain information about where to store it, preventing further processing.
To address this, a mechanism to delay Tap device processing until Rx Virtqueue is prepared is required. In the ToyVMM code, this is controlled using a flag called deferred_rx
.
When this flag is set, ToyVMM's Rx-related processing follows the following strategy:
- When
RX_QUEUE_EVENT
is triggered, indicating that the Rx Virtqueue is ready to receive data from the guest, data is immediately retrieved from the Tap device, and processing continues. If processing is completed at this point, the flag is cleared. - When
TAP_RX_EVENT
is triggered, processing is temporarily paused to check the status of the Rx Virtqueue. If processing can proceed, it continues, and if processing is completed, the flag is cleared. If processing cannot proceed or the amount of data in the Virtqueue is smaller than the received data, the flag is not cleared, and it waits for the Rx Virtqueue to be ready again.
When Tap Reception Exceeds the Prepared Rx Virtqueue
Another case to consider is when the data received by Tap exceeds the capacity of the prepared Virtqueue, as briefly mentioned above. In this case, the strategy is essentially the same, and the processing is temporarily interrupted until the next Virtqueue is prepared, controlled by the deferred_rx
flag. When the Virtqueue is ready, processing resumes.
Verification of virtio-net Operation
Let's test whether communication between the host and guest is possible using the implemented Virtio
mechanism and the Network Device. Below is the result of executing the ip addr
command inside the guest. eth0
is recognized as a virtual NIC.
localhost:~# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 02:32:22:01:57:33 brd ff:ff:ff:ff:ff:ff
inet6 fe80::32:22ff:fe01:5733/64 scope link
valid_lft forever preferred_lft forever
Let's also check on the host side. In ToyVMM, a Tap device is created on the host side. So assign an IP address (192.168.0.10/24
).
140: vmtap0: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
link/ether c6:69:6d:65:05:cf brd ff:ff:ff:ff:ff:ff
inet 192.168.0.10/24 brd 192.168.0.255 scope global vmtap0
valid_lft forever preferred_lft forever
Additionally, assign an IP address to the guest side. Here, an address within the same subnet range as the host is assigned.
localhost:~# ip addr add 192.168.0.11/24 dev eth0
Now that everything is set up, let's ping the IP address of the host's Tap interface from within the guest. You should receive responses as follows:
localhost:~# ping -c 5 192.168.0.10
PING 192.168.0.10 (192.168.0.10): 56 data bytes
64 bytes from 192.168.0.10: seq=0 ttl=64 time=0.394 ms
64 bytes from 192.168.0.10: seq=1 ttl=64 time=0.335 ms
64 bytes from 192.168.0.10: seq=2 ttl=64 time=0.334 ms
64 bytes from 192.168.0.10: seq=3 ttl=64 time=0.321 ms
64 bytes from 192.168.0.10: seq=4 ttl=64 time=0.330 ms
--- 192.168.0.10 ping statistics ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max = 0.321/0.342/0.394 ms
Conversely, if you ping the IP address of the virtio-net
interface in the guest from the host, you should also receive responses:
[mmichish@mmichish ~]$ ping -c 5 192.168.0.11
PING 192.168.0.11 (192.168.0.11) 56(84) bytes of data.
64 bytes from 192.168.0.11: icmp_seq=1 ttl=64 time=0.410 ms
64 bytes from 192.168.0.11: icmp_seq=2 ttl=64 time=0.366 ms
64 bytes from 192.168.0.11: icmp_seq=3 ttl=64 time=0.385 ms
64 bytes from 192.168.0.11: icmp_seq=4 ttl=64 time=0.356 ms
64 bytes from 192.168.0.11: icmp_seq=5 ttl=64 time=0.376 ms
--- 192.168.0.11 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4114ms
rtt min/avg/max/mdev = 0.356/0.378/0.410/0.028 ms
Although this is a simple confirmation using ICMP, it confirms that communication is functioning properly!