VFIO Platform Passthrough: how to integrate a new device

This page aims at explaining how to integrate a new platform device for KVM passthrough. Indeed, allowing a new platform device passthrough requires some small adaptations, in both QEMU and VFIO driver.

The first integrated platform device is the Calxeda Midway xgmac. This should be used as an example code.

QEMU integration

The adaptations are related to the dynamic instantiation of the device in the QEMU machine. The VFIO platform device is supposed to be dynamically instantiated using the "-device" option. The rationale is we don't want the VFIO device to be always instantiated in the machine file as opposed to the common minimal set of devices (especially on ARM virt which is supposed to be a minimalist machine): HW may not be present or may not be candidate for passthrough.

For dynamic instantiation to be possible, your machine must instantiate the platform bus (TYPE_PLATFORM_BUS_DEVICE). Then the VFIO platform device is automatically plugged onto this platform bus. The end-user does not have to care about the MMIO/IRQ range the VFIO platform device is plugged onto. The binding is done automatically by the platform bus, on ranges that were granted to him by the machine. Binding happens on machine init done notifier.

Along with the dynamic instantiation in the machine file, a device tree node must be created to be presented to the guest. This dt node creation must happen after the above binding to know the MMIO/IRQ ranges that are used. This also happens in a machine init done notifier, executed after the "binding" one. That way the dynamic sysbus device nodes are added to the fdt on machine init done where standard dt nodes are added at machine creation.

Creation of the derived device

This takes place in hw/vfio. The base VFIO platform device is abstract; it cannot be instantiated directly. Instead a derived device must be created. Look at calxeda-xgmac.c as a starting point. The name of this device will be used for its dynamic instantiation. Not much is done here besides setting the compat string this device matches.

Device Tree Node Creation Specialization

Your machine must instantiate the platform bus to allow dynamic instantiation (see create_platform_bus in hw/arm/virt.c).

On ARM, platform bus device tree node creation specialization takes place in hw/arm/sysbus-fdt.c.

In the "Device Specific Section" of the file you need to add a function that builds the device tree node that will be exposed to the guest. It has the static int (*add_device_tree_node)(SysBusDevice *sbdev, void *opaque) signature. Then register the new function in the add_fdt_node_functions whitelist. With that addition, launching qemu with -device <name> will eventually call that function.

the opaque pointer corresponds to a PlatformBusFDTData handle. It might happen that this struct needs to be augmented to build more complex device tree nodes. Another solution (currently untested) is to pass pass the complete device tree (feature the complex VFIO node) in qemu command line.

This adaptation isn't really specific to VFIO but rather linked to the dynamic instantiation of the QEMU platform device.

Kernel integration

If you don't do any VFIO platform kernel adaptation, in situations where the userspace driver is stopped abnormally and the VFIO platform device is released, the assigned HW device is left running. As a consequence the HW device might continue issuing interrupts and performing DMA accesses. This obsviously depends on the device. Some devices may stop DMA/IRQs if commands are not sent anymore, some may continue.

The fact the HW still lives whereas it is not passthrough'ed anymore is not an issue: no physical IRQ handler is setup anymore and the DMA buffers are unmapped leading to IOMMU aborts. However it becomes an issue when assigning that HW device again to another userspace driver: this latter might face some unexpected IRQs and DMA accesses, which are the result of the previous assignment.

In virtualization use-case, a VM newly granted with that HW device may be impacted by the assignment of that device to a previous VM:

  • IRQs may be injected very early when booting the new guest, even before the guest driver has initialized leading to possible driver state inconsistency.
  • DMA accesses may hit the newly mapped VM address space at addresses that may jeopardize the integrity of the newly installed VM.

If you cope with such a situation, you need to add a VFIO platform reset module in /drivers/vfio/platform/reset. Take the example of vfio_platform_calxedaxgmac.c to implement your own VFIO reset module. This reset module only aims at stopping interrupts and DMA transfers. That code usually can be inherited from the native driver.

Then you need to register the reset module, its exported reset method and the compat string it matches in the vfio platform driver reset_lookup_table whitelist (drivers/vfio/platform/vfio_platform_common.c).

LEG/Engineering/Virtualization/VFIOPlatformHowToIntegrateNewDevice (last modified 2015-06-26 08:47:50)