Device Passthrough

A critical part of virtualization is virtualizing devices: exposing all aspects of a device including its I/O, interrupts, DMA, and configuration. There are three typical device virtualization methods: emulation, para-virtualization, and passthrough. Both emulation and passthrough are used in ACRN project. Device emulation is discussed in I/O Emulation high-level design and device passthrough will be discussed here.

In the ACRN project, device emulation means emulating all existing hardware resource through a software component device model running in the Service OS (SOS). Device emulation must maintain the same SW interface as a native device, providing transparency to the VM software stack. Passthrough implemented in hypervisor assigns a physical device to a VM so the VM can access the hardware device directly with minimal (if any) VMM involvement.

The difference between device emulation and passthrough is shown in Figure 81. You can notice device emulation has a longer access path which causes worse performance compared with passthrough. Passthrough can deliver near-native performance, but can’t support device sharing.

../../_images/passthru-image30.png

Figure 81 Difference between Emulation and passthrough

Passthrough in the hypervisor provides the following functionalities to allow VM to access PCI devices directly:

  • DMA Remapping by VT-d for PCI device: hypervisor will setup DMA remapping during VM initialization phase.
  • MMIO Remapping between virtual and physical BAR
  • Device configuration Emulation
  • Remapping interrupts for PCI device
  • ACPI configuration Virtualization
  • GSI sharing violation check

The following diagram details passthrough initialization control flow in ACRN:

../../_images/passthru-image22.png

Figure 82 Passthrough devices initialization control flow

Passthrough Device status

Most common devices on supported platforms are enabled for passthrough, as detailed here:

../../_images/passthru-image77.png

Figure 83 Passthrough Device Status

DMA Remapping

To enable passthrough, for VM DMA access the VM can only support GPA, while physical DMA requires HPA. One work-around is building identity mapping so that GPA is equal to HPA, but this is not recommended as some VM don’t support relocation well. To address this issue, Intel introduces VT-d in chipset to add one remapping engine to translate GPA to HPA for DMA operations.

Each VT-d engine (DMAR Unit), maintains a remapping structure similar to a page table with device BDF (Bus/Dev/Func) as input and final page table for GPA/HPA translation as output. The GPA/HPA translation page table is similar to a normal multi-level page table.

VM DMA depends on Intel VT-d to do the translation from GPA to HPA, so we need to enable VT-d IOMMU engine in ACRN before we can passthrough any device. SOS in ACRN is a VM running in non-root mode which also depends on VT-d to access a device. In SOS DMA remapping engine settings, GPA is equal to HPA.

ACRN hypervisor checks DMA-Remapping Hardware unit Definition (DRHD) in host DMAR ACPI table to get basic info, then sets up each DMAR unit. For simplicity, ACRN reuses EPT table as the translation table in DMAR unit for each passthrough device. The control flow is shown in the following figures:

../../_images/passthru-image72.png

Figure 84 DMA Remapping control flow during HV init

../../_images/passthru-image86.png

Figure 85 ptdev assignment control flow

../../_images/passthru-image42.png

Figure 86 ptdev de-assignment control flow

MMIO Remapping

For PCI MMIO BAR, hypervisor builds EPT mapping between virtual BAR and physical BAR, then VM can access MMIO directly.

Device configuration emulation

PCI configuration is based on access of port 0xCF8/CFC. ACRN implements PCI configuration emulation to handle 0xCF8/CFC to control PCI device through two paths: implemented in hypervisor or in SOS device model.

  • When configuration emulation is in the hypervisor, the interception of 0xCF8/CFC port and emulation of PCI configuration space access are tricky and unclean. Therefore the final solution is to reuse the PCI emulation infrastructure of SOS device model. The hypervisor routes the UOS 0xCF8/CFC access to device model, and keeps blind to the physical PCI devices. Upon receiving UOS PCI configuration space access request, device model needs to emulate some critical space, for instance, BAR, MSI capability, and INTLINE/INTPIN.
  • For other access, device model reads/writes physical configuration space on behalf of UOS. To do this, device model is linked with lib pci access to access physical PCI device.

Interrupt Remapping

When the physical interrupt of a passthrough device happens, hypervisor has to distribute it to the relevant VM according to interrupt remapping relationships. The structure ptirq_remapping_info is used to define the subordination relation between physical interrupt and VM, the virtual destination, etc. See the following figure for details:

../../_images/passthru-image91.png

Figure 87 Remapping of physical interrupts

There are two different types of interrupt source: IOAPIC and MSI. The hypervisor will record different information for interrupt distribution: physical and virtual IOAPIC pin for IOAPIC source, physical and virtual BDF and other info for MSI source.

SOS passthrough is also in the scope of interrupt remapping which is done on-demand rather than on hypervisor initialization.

../../_images/passthru-image102.png

Figure 88 Initialization of remapping of virtual IOAPIC interrupts for SOS

Figure 88 above illustrates how remapping of (virtual) IOAPIC interrupts are remapped for SOS. VM exit occurs whenever SOS tries to unmask an interrupt in (virtual) IOAPIC by writing to the Redirection Table Entry (or RTE). The hypervisor then invokes the IOAPIC emulation handler (refer to I/O Emulation high-level design for details on I/O emulation) which calls APIs to set up a remapping for the to-be-unmasked interrupt.

Remapping of (virtual) PIC interrupts are set up in a similar sequence:

../../_images/passthru-image98.png

Figure 89 Initialization of remapping of virtual MSI for SOS

This figure illustrates how mappings of MSI or MSIX are set up for SOS. SOS is responsible for issuing an hypercall to notify the hypervisor before it configures the PCI configuration space to enable an MSI. The hypervisor takes this opportunity to set up a remapping for the given MSI or MSIX before it is actually enabled by SOS.

When the UOS needs to access the physical device by passthrough, it uses the following steps:

  • UOS gets a virtual interrupt
  • VM exit happens and the trapped vCPU is the target where the interrupt will be injected.
  • Hypervisor will handle the interrupt and translate the vector according to ptirq_remapping_info.
  • Hypervisor delivers the interrupt to UOS.

When the SOS needs to use the physical device, the passthrough is also active because the SOS is the first VM. The detail steps are:

  • SOS get all physical interrupts. It assigns different interrupts for different VMs during initialization and reassign when a VM is created or deleted.
  • When physical interrupt is trapped, an exception will happen after VMCS has been set.
  • Hypervisor will handle the vm exit issue according to ptirq_remapping_info and translates the vector.
  • The interrupt will be injected the same as a virtual interrupt.

ACPI Virtualization

ACPI virtualization is designed in ACRN with these assumptions:

  • HV has no knowledge of ACPI,
  • SOS owns all physical ACPI resources,
  • UOS sees virtual ACPI resources emulated by device model.

Some passthrough devices require physical ACPI table entry for initialization. The device model will create such device entry based on the physical one according to vendor ID and device ID. Virtualization is implemented in SOS device model and not in scope of the hypervisor.

GSI Sharing Violation Check

All the PCI devices that are sharing the same GSI should be assigned to the same VM to avoid physical GSI sharing between multiple VMs. For devices that don’t support MSI, ACRN DM shares the same GSI pin to a GSI sharing group. The devices in the same group should be assigned together to the current VM, otherwise, none of them should be assigned to the current VM. A device that violates the rule will be rejected to be passthrough. The checking logic is implemented in Device Mode and not in scope of hypervisor.

Data structures and interfaces

The following APIs are provided to initialize interrupt remapping for SOS:

int32_t ptirq_intx_pin_remap(struct acrn_vm *vm, uint32_t virt_pin, uint32_t vpin_src)

INTx remapping for passthrough device.

Set up the remapping of the given virtual pin for the given vm. This is the main entry for PCI/Legacy device assignment with INTx, calling from vIOAPIC or vPIC.

Return
  • 0: on success
  • -ENODEV:
    • for SOS, the entry already be held by others
    • for UOS, no pre-hold mapping found.
Pre
vm != NULL
Parameters
  • vm: pointer to acrn_vm
  • virt_pin: virtual pin number associated with the passthrough device
  • vpin_src: ioapic or pic

int32_t ptirq_msix_remap(struct acrn_vm *vm, uint16_t virt_bdf, uint16_t entry_nr, struct ptirq_msi_info *info)

MSI/MSI-x remapping for passthrough device.

Main entry for PCI device assignment with MSI and MSI-X. MSI can up to 8 vectors and MSI-X can up to 1024 Vectors.

Return
  • 0: on success
  • -ENODEV:
    • for SOS, the entry already be held by others
    • for UOS, no pre-hold mapping found.
Pre
vm != NULL
Pre
info != NULL
Parameters
  • vm: pointer to acrn_vm
  • virt_bdf: virtual bdf associated with the passthrough device
  • entry_nr: indicate coming vectors, entry_nr = 0 means first vector
  • info: structure used for MSI/MSI-x remapping

The following APIs are provided to manipulate the interrupt remapping for UOS.

int32_t ptirq_add_intx_remapping(struct acrn_vm *vm, uint32_t virt_pin, uint32_t phys_pin, bool pic_pin)

Add an interrupt remapping entry for INTx as pre-hold mapping.

Except sos_vm, Device Model should call this function to pre-hold ptdev intx The entry is identified by phys_pin, one entry vs. one phys_pin. Currently, one phys_pin can only be held by one pin source (vPIC or vIOAPIC).

Return
  • 0: on success
  • -EINVAL: invalid virt_pin value
  • -ENODEV: failed to add the remapping entry
Pre
vm != NULL
Parameters
  • vm: pointer to acrn_vm
  • virt_pin: virtual pin number associated with the passthrough device
  • phys_pin: physical pin number associated with the passthrough device
  • pic_pin: true for pic, false for ioapic

void ptirq_remove_intx_remapping(struct acrn_vm *vm, uint32_t virt_pin, bool pic_pin)

Remove an interrupt remapping entry for INTx.

Deactivate & remove mapping entry of the given virt_pin for given vm.

Return
None
Pre
vm != NULL
Parameters
  • vm: pointer to acrn_vm
  • virt_pin: virtual pin number associated with the passthrough device
  • pic_pin: true for pic, false for ioapic

int32_t ptirq_add_msix_remapping(struct acrn_vm *vm, uint16_t virt_bdf, uint16_t phys_bdf, uint32_t vector_count)

Add interrupt remapping entry/entries for MSI/MSI-x as pre-hold mapping.

Add pre-hold mapping of the given number of vectors between the given physical and virtual BDF for the given vm. Except sos_vm, Device Model should call this function to pre-hold ptdev MSI/MSI-x. The entry is identified by phys_bdf:msi_idx, one entry vs. one phys_bdf:msi_idx.

Return
  • 0: on success
  • -ENODEV: failed to add the remapping entry
Pre
vm != NULL
Parameters
  • vm: pointer to acrn_vm
  • virt_bdf: virtual bdf associated with the passthrough device
  • phys_bdf: physical bdf associated with the passthrough device
  • vector_count: number of vectors

void ptirq_remove_msix_remapping(const struct acrn_vm *vm, uint16_t virt_bdf, uint32_t vector_count)

Remove interrupt remapping entry/entries for MSI/MSI-x.

Remove the mapping of given number of vectors of the given virtual BDF for the given vm.

Return
None
Pre
vm != NULL
Parameters
  • vm: pointer to acrn_vm
  • virt_bdf: virtual bdf associated with the passthrough device
  • vector_count: number of vectors

The following APIs are provided to acknowledge a virtual interrupt.

void ptirq_intx_ack(struct acrn_vm *vm, uint32_t virt_pin, uint32_t vpin_src)

Acknowledge a virtual interrupt for passthrough device.

Acknowledge a virtual legacy interrupt for a passthrough device.

Return
None
Pre
vm != NULL
Parameters
  • vm: pointer to acrn_vm
  • virt_pin: virtual pin number associated with the passthrough device
  • vpin_src: ioapic or pic