Setting up a Virtual Workstation in OpenShift with VFIO Passthrough¶

Feb 27, 2023

25 min read

This guide explains how to configure OpenShift as a workstation with GPU PCI passthrough using Container Native Virtualization (CNV) on a single OpenShift node (SNO). This setup delivers near-native performance for GPU-intensive applications while leveraging Kubernetes orchestration capabilities.

Key Benefits:

Run containerized workloads and virtual machines on the same hardware
Use a single GPU for both Kubernetes pods and VMs by switching driver binding
Achieve near-native performance for gaming and professional applications in VMs
Maintain Kubernetes/OpenShift flexibility for other workloads

In testing, this configuration successfully ran Microsoft Flight Simulator in a Windows VM with performance comparable to a bare metal Windows installation.

Hardware Used:

Component	Specification
CPU	AMD Ryzen 9 3950X (16-Core, 32-Threads)
Memory	64GB DDR4 3200MHz
GPU	Nvidia RTX 3080 FE 10GB
Storage	2x 2TB NVMe Disks (VM storage) 1x 500GB SSD Disk (OpenShift root system)
Network	10Gbase-CX4 Mellanox Ethernet

Similar configurations with Intel CPUs should work with minor adjustments noted throughout this guide.

Installing OpenShift SNO¶

Before installation, be sure to back up any existing partition data.

Backup Existing System Partitions¶

The OpenShift assisted installer formats the first 512 bytes of any disk with a bootable partition. Back up and remove any existing partition tables you want to preserve.

OpenShift Installation¶

Note

You can use the OpenShift web UI installer in Red Hat Hybrid Cloud Console. This guides you through installation with the Assisted Installer service:

https://console.redhat.com/openshift/assisted-installer/clusters

This also provides an automated way to install multiple Operators from Day 0.

Relevant Operators for this setup:

Logical Volume Manager Storage
NMState
Node Feature Discovery
NVIDIA GPU
OpenShift Virtualization

After backing up existing file systems and removing bootable partitions, proceed with the OpenShift Single Node installation.

CoreOS (the underlying operating system) requires an entire disk for installation:

500GB SSD for the OpenShift operating system
Two 2TB NVMe disks for persistent volumes as LVM Physical volumes in the same Volume Group
This setup enables flexible VM storage management while keeping the system installation separate

#!/bin/bash

OCP_VERSION=latest-4.10

curl -k https://mirror.openshift.com/pub/openshift-v4/clients/ocp/latest/openshift-client-linux.tar.gz > oc.tar.gz
tar zxf oc.tar.gz
chmod +x oc && mv oc ~/.local/bin/

curl -k https://mirror.openshift.com/pub/openshift-v4/clients/ocp/$OCP_VERSION/openshift-install-linux.tar.gz > openshift-install-linux.tar.gz
tar zxvf openshift-install-linux.tar.gz
chmod +x openshift-install && mv openshift-install ~/.local/bin/

curl $(openshift-install coreos print-stream-json | grep location | grep x86_64 | grep iso | cut -d\" -f4) > rhcos-live.x86_64.iso

install-config.yaml¶

# This file contains the configuration for an OpenShift cluster installation.

apiVersion: v1

# The base domain for the cluster.
baseDomain: epheo.eu

# Configuration for the compute nodes.
compute:
- name: worker
  replicas: 0 

# Configuration for the control plane nodes.
controlPlane:
  name: master
  replicas: 1 

# Metadata for the cluster.
metadata:
  name: da2

# Networking configuration for the cluster.
networking:
  networkType: OVNKubernetes
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  serviceNetwork:
  - 172.30.0.0/16

# Platform configuration for the cluster.
platform:
  none: {}

# Configuration for bootstrapping the cluster.
bootstrapInPlace:
  installationDisk: /dev/sda

# Pull secret for accessing the OpenShift registry.
pullSecret: '{"auths":{"cloud.openshift.com":{"auth":"XXXXXXXX"}}}' 

# SSH key for accessing the cluster nodes.
sshKey: |
  ssh-rsa AAAAB3XXXXXXXXXXXXXXXXXXXXXXXXX

Generate OpenShift Container Platform assets¶

mkdir ocp && cp install-config.yaml ocp
openshift-install --dir=ocp create single-node-ignition-config

Embed the ignition data into the RHCOS ISO:¶

alias coreos-installer='podman run --privileged --rm \
      -v /dev:/dev -v /run/udev:/run/udev -v $PWD:/data \
      -w /data quay.io/coreos/coreos-installer:release'
cp ocp/bootstrap-in-place-for-live-iso.ign iso.ign
coreos-installer iso ignition embed -fi iso.ign rhcos-live.x86_64.iso
dd if=discovery_image_sno.iso of=/dev/usbkey status=progress

After copying the ISO to a USB drive, boot your workstation from it to install OpenShift.

Installing CNV Operator¶

Enable Intel VT or AMD-V hardware virtualization extensions in your BIOS/UEFI settings.

cnv-resources.yaml¶

# This YAML file contains Kubernetes resources for installing the KubeVirt Hyperconverged Operator (HCO) on the OpenShift Container Platform.
# It creates a namespace named "openshift-cnv", an operator group named "kubevirt-hyperconverged-group" in the "openshift-cnv" namespace, and a subscription named "hco-operatorhub" in the "openshift-cnv" namespace.
# The subscription specifies the source, source namespace, name, starting CSV, and channel for the KubeVirt Hyperconverged Operator.

apiVersion: v1
kind: Namespace
metadata:
  name: openshift-cnv
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: kubevirt-hyperconverged-group
  namespace: openshift-cnv
spec:
  targetNamespaces:
    - openshift-cnv
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: hco-operatorhub
  namespace: openshift-cnv
spec:
  source: redhat-operators
  sourceNamespace: openshift-marketplace
  name: kubevirt-hyperconverged
  startingCSV: kubevirt-hyperconverged-operator.v4.10.0
  channel: "stable"

oc apply -f cnv-resources.yaml

Installing Virtctl client on your desktop¶

subscription-manager repos --enable cnv-4.10-for-rhel-8-x86_64-rpms
dnf install kubevirt-virtctl

Configuring OpenShift for GPU Passthrough¶

Since we’re working with a single GPU, additional configuration is required.

We’ll use MachineConfig to configure our node. In a single-node OpenShift setup, all MachineConfig changes apply to the master machineset. In multi-node clusters, these would apply to workers instead.

Setting Kernel Boot Arguments¶

To enable GPU passthrough, we need to pass several kernel arguments at boot time via the MachineConfigOperator:

amd_iommu=on: Enables IOMMU support for AMD platforms (use intel_iommu=on for Intel CPUs)
vga=off: Disables VGA console output during boot
rdblaclist=nouveau: Blacklists the Nouveau open-source NVIDIA driver
video=efifb:off: Disables EFI framebuffer console output

Setting Kernel Arguments at boot time.¶

variant: openshift
version: 4.10.0
metadata:
  name: 100-vfio
  labels:
    machineconfiguration.openshift.io/role: master
openshift:
  kernel_arguments:
    - amd_iommu=on
    - vga=off
    - rdblaclist=nouveau
    - 'video=efifb:off'

cd articles/openshift-workstation/machineconfig/build
butane -d . vfio-prepare.bu -o ../vfio-prepare.yaml
oc patch MachineConfig 100-vfio --type=merge -p ../vfio-prepare.yaml

Note

Intel CPU users: use intel_iommu=on instead of amd_iommu=on.

Installing the NVIDIA GPU Operator¶

The NVIDIA GPU Operator simplifies GPU management in Kubernetes environments.

Step 1: Install the Operator¶

Via OpenShift web console: 1. Go to Operators → OperatorHub 2. Search for “NVIDIA GPU Operator” 3. Select the operator and click Install 4. Keep default settings and click Install

Or via CLI:

oc create -f https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/deployments/git/operator-namespace.yaml
oc create -f https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/deployments/git/operator-source.yaml

Step 2: Configure the ClusterPolicy¶

Set sandboxWorkloads.enabled to true to enable the components needed for GPU passthrough:

sandboxWorkloadsEnabled.yaml¶

kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  sandboxWorkloads:
    defaultWorkload: container
    enabled: true

oc patch ClusterPolicy gpu-cluster-policy --type=merge -p sandboxWorkloadsEnabled.yaml

The NVIDIA GPU Operator doesn’t officially support consumer-grade GPUs and won’t automatically bind the GPU audio device to the vfio-pci driver. We’ll handle this manually with the following machine config:

vfio-prepare.bu¶

variant: openshift
version: 4.10.0
metadata:
  name: 100-vfio
  labels:
    machineconfiguration.openshift.io/role: master
storage:
  files:
  - path: /usr/local/bin/vfio-prepare
    mode: 0755
    overwrite: true
    contents:
      local: ./vfio-prepare.sh
  - path: /etc/modules-load.d/vfio-pci.conf
    mode: 0644
    overwrite: true
    contents:
      inline: vfio-pci
systemd:
  units:
    - name: vfioprepare.service
      enabled: true
      contents: |
       [Unit]
       Description=Prepare vfio devices
       After=ignition-firstboot-complete.service
       Before=kubelet.service crio.service

       [Service]
       Type=oneshot
       ExecStart=/usr/local/bin/vfio-prepare

       [Install]
       WantedBy=kubelet.service

vfio-prepare.sh¶

#!/bin/bash

vfio_attach () {
  if [ -f "${path}/driver/unbind" ]; then
    echo $address > ${path}/driver/unbind
  fi
  echo vfio-pci > ${path}/driver_override
  echo $address > /sys/bus/pci/drivers/vfio-pci/bind || \
  echo $name > /sys/bus/pci/drivers/vfio-pci/new_id ||true
}

# 0a:00.1 Audio device [0403]: NVIDIA Corporation GA102 High Definition Audio Controller [10de:1aef] (rev a1)
address=0000:0a:00.1
path=/sys/bus/pci/devices/0000\:0a\:00.1
name="10de 1467"
vfio_attach

cd articles/openshift-workstation/machineconfig/build
butane -d . vfio-prepare.bu -o ../vfio-prepare.yaml
oc patch MachineConfig 100-vfio --type=merge -p ../vfio-prepare.yaml

Dynamically Switching GPU Drivers¶

A key advantage of this setup is using a single GPU for both container workloads and VMs without rebooting.

Use Case Scenario¶

Single NVIDIA GPU shared between container workloads and VMs
Container workloads require the NVIDIA kernel driver
VMs with GPU passthrough require the VFIO-PCI driver
Switching between modes without rebooting

Driver Switching Using Node Labels¶

The NVIDIA GPU Operator with sandbox workloads enabled lets you switch driver bindings using node labels:

For container workloads (NVIDIA driver):

# Replace 'da2' with your node name
oc label node da2 --overwrite nvidia.com/gpu.workload.config=container

For VM passthrough (VFIO-PCI driver):

# Replace 'da2' with your node name
oc label node da2 --overwrite nvidia.com/gpu.workload.config=vm-passthrough

Notes on Driver Switching¶

Driver switching takes a few minutes
Verify current driver with lspci -nnk | grep -A3 NVIDIA
Stop all GPU workloads before switching drivers
No reboot is usually required
Can be occasionally unreliable and may require a reboot

Adding GPU as a Hardware Device¶

First, identify the GPU’s Vendor and Product ID:

lspci -nnk |grep VGA

Then, identify the device name provided by gpu-feature-discovery:

oc get nodes da2 -ojson |jq .status.capacity |grep nvidia

Now, add the GPU to the permitted host devices:

kind: HyperConverged
metadata:
  name: kubevirt-hyperconverged
  namespace: openshift-cnv
spec:
  permittedHostDevices:
    pciHostDevices:
    - externalResourceProvider: true
      pciDeviceSelector: 10DE:2206
      resourceName: nvidia.com/GA102_GEFORCE_RTX_3080

oc patch hyperconverged kubevirt-hyperconverged -n openshift-cnv --type=merge -f hyperconverged.yaml

The pciDeviceSelector specifies the vendor:device ID, while resourceName specifies the resource name in Kubernetes/OpenShift.

Passthrough USB Controllers to VMs¶

For a complete desktop experience, you’ll want to pass through an entire USB controller to your VM for better performance and flexibility.

Identifying a Suitable USB Controller¶

List all USB controllers:
```
lspci -nnk | grep -i usb
```
Example output: ` 0b:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller [1022:149c] `
Note the PCI address (e.g., 0b:00.3) and device ID (1022:149c).

Check the IOMMU group:

find /sys/kernel/iommu_groups/ -iname "*0b:00.3*"
# Shows which IOMMU group contains this device

ls /sys/kernel/iommu_groups/27/devices/
# Lists all devices in the same IOMMU group

Important: For clean passthrough, the USB controller should ideally be alone in its IOMMU group. If other devices share the group, you’ll need to pass those through as well.

Adding the USB Controller as a Permitted Device¶

Add the controller’s Vendor and Product IDs to permitted host devices:

kind: HyperConverged
metadata:
  name: kubevirt-hyperconverged
  namespace: openshift-cnv
spec:
  permittedHostDevices:
    pciHostDevices:
      - pciDeviceSelector: 1022:149C
        resourceName: devices.kubevirt.io/USB3_Controller
      - pciDeviceSelector: 8086:2723
        resourceName: intel.com/WIFI_Controller

oc patch hyperconverged kubevirt-hyperconverged -n openshift-cnv --type=merge -f hyperconverged.yaml

Binding the USB Controller to VFIO-PCI Driver¶

vfio-prepare.bu¶

variant: openshift
version: 4.10.0
metadata:
  name: 100-vfio
  labels:
    machineconfiguration.openshift.io/role: master
storage:
  files:
  - path: /usr/local/bin/vfio-prepare
    mode: 0755
    overwrite: true
    contents:
      local: ./vfio-prepare.sh
  - path: /etc/modules-load.d/vfio-pci.conf
    mode: 0644
    overwrite: true
    contents:
      inline: vfio-pci
  - path: /etc/modprobe.d/vfio.conf
    mode: 0644
    overwrite: true
    contents:
      inline: |
        options vfio-pci ids=8086:2723,1022:149c
systemd:
  units:
    - name: vfioprepare.service
      enabled: true
      contents: |
       [Unit]
       Description=Prepare vfio devices
       After=ignition-firstboot-complete.service
       Before=kubelet.service crio.service

       [Service]
       Type=oneshot
       ExecStart=/usr/local/bin/vfio-prepare

       [Install]
       WantedBy=kubelet.service
openshift:
  kernel_arguments:
    - amd_iommu=on
    - vga=off
    - rdblaclist=nouveau
    - 'video=efifb:off'

Create a script to unbind the USB controller from its current driver and bind it to vfio-pci:

vfio-prepare.sh¶

#!/bin/bash

vfio_attach () {
  if [ -f "${path}/driver/unbind" ]; then
    echo $address > ${path}/driver/unbind
  fi
  echo vfio-pci > ${path}/driver_override
  echo $address > /sys/bus/pci/drivers/vfio-pci/bind || \
  echo $name > /sys/bus/pci/drivers/vfio-pci/new_id ||true
}

# 0a:00.1 Audio device [0403]: NVIDIA Corporation GA102 High Definition Audio Controller [10de:1aef] (rev a1)
address=0000:0a:00.1
path=/sys/bus/pci/devices/0000\:0a\:00.1
name="10de 1467"
vfio_attach

# Bind "useless" device to vfio-pci to satisfy IOMMU group
address=0000:07:00.0
path=/sys/bus/pci/devices/0000\:07\:00.0
name="1043 87c0"
vfio_attach

# Unbind USB switch and handle via vfio-pci kernel driver
address=0000:07:00.1
path=/sys/bus/pci/devices/0000\:07\:00.1
name="1043 87c0"
vfio_attach

# Unbind USB switch and handle via vfio-pci kernel driver
address=0000:07:00.3
path=/sys/bus/pci/devices/0000\:07\:00.3
name="1022 149c"
vfio_attach

# Unbind USB switch and handle via vfio-pci kernel driver
address=0000:0c:00.3
path=/sys/bus/pci/devices/0000\:0c\:00.3
name="1022 148c"
vfio_attach

cd articles/openshift-workstation/machineconfig/build
butane -d . vfio-prepare.bu -o ../vfio-prepare.yaml
oc patch MachineConfig 100-vfio --type=merge -p ../vfio-prepare.yaml

Creating VMs with GPU Passthrough¶

This section explains how to create VMs that can use GPU passthrough, using existing LVM Logical Volumes with UEFI boot.

Creating Persistent Volumes from LVM Disks¶

First, make LVM volumes available to OpenShift via Persistent Volume Claims (PVCs). This assumes you have the Local Storage Operator installed.

Create a YAML file for each VM disk:

fedora_pvc.yaml¶

---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: fedora35
spec:
  accessModes:
  - ReadWriteOnce
  volumeMode: Block
  resources:
    requests:
      storage: 100Gi 
  storageClassName: lvms-vg1

Apply the YAML:

oc apply -f fedora35.yaml

Verify the PV and PVC are bound:

oc get pv
oc get pvc -n <your-namespace>

Defining VMs with GPU Passthrough¶

Key configuration elements for desktop VMs with GPU passthrough:

GPU Passthrough: Pass the entire physical GPU to the VM

See also

https://kubevirt.io/user-guide/virtual_machines/host-devices/#pci-passthrough
Disable Virtual VGA: Remove the emulated VGA device

See also

https://kubevirt.io/api-reference/master/definitions.html#_v1_devices
USB Controller Passthrough: For connecting peripherals directly
UEFI Boot: For compatibility with modern OSes and GPU drivers
CPU/Memory Configuration: Based on workload requirements

fedora.yaml¶

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: fedora
  namespace: epheo
spec:
  runStrategy: Halted
  template:
    metadata:
      labels:
        kubevirt.io/domain: fedora
    spec:
      architecture: amd64
      domain:
        cpu:
          cores: 8
          model: host-passthrough
          sockets: 2
          threads: 1
        features:
          acpi: {}
          smm:
            enabled: true 
        firmware:
          bootloader:
            efi:
              secureBoot: false # For Nvidia Driver...
        devices:
          disks:
            - bootOrder: 1
              disk:
                bus: virtio
              name: pvdisk
            - disk:
                bus: virtio
              name: cloudinitdisk
          autoattachGraphicsDevice: false
          gpus:
          - deviceName: nvidia.com/GA102_GEFORCE_RTX_3080
            name: gpuvideo
          hostDevices:
          - deviceName: devices.kubevirt.io/USB3_Controller
            name: usbcontroller
          - deviceName: devices.kubevirt.io/USB3_Controller
            name: usbcontroller2
          - deviceName: intel.com/WIFI_Controller
            name: wificontroller
          interfaces:
          - masquerade: {}
            name: default
          - bridge: {}
            model: virtio
            name: nic-0
          networkInterfaceMultiqueue: true
          rng: {}
        machine:
          type: q35
        resources:
          requests:
            memory: 16G
      hostname: fedora
      networks:
      - name: default
        pod: {}
      - multus:
          networkName: br1
        name: nic-0
      terminationGracePeriodSeconds: 0
      volumes:
        - persistentVolumeClaim:
            claimName: 'fedora35'
          name: pvdisk
        - cloudInitNoCloud:
            userData: |-
              #cloud-config
              password: fedora
              chpasswd: { expire: False }
          name: cloudinitdisk

windows.yaml¶

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  annotations:
    vm.kubevirt.io/os: windows10
    vm.kubevirt.io/workload: desktop
  name: windows
spec:
  runStrategy: Manual
  template:
    metadata:
      labels:
        kubevirt.io/domain: windows
    spec:
      architecture: amd64
      domain:
        clock:
          timer:
            hpet:
              present: false
            hyperv: {}
            pit:
              tickPolicy: delay
            rtc:
              tickPolicy: catchup
          utc: {}
        cpu:
          cores: 8
          dedicatedCpuPlacement: true
          sockets: 2
          threads: 1
        devices:
          autoattachGraphicsDevice: false
          disks:
          - cdrom:
              bus: sata
            name: windows-guest-tools
          - bootOrder: 1
            disk:
              bus: virtio
            name: pvdisk
          - disk:
              bus: virtio
            name: pvdisk1
          gpus:
          - deviceName: nvidia.com/GA102_GEFORCE_RTX_3080
            name: gpuvideo
          hostDevices:
          - deviceName: devices.kubevirt.io/USB3_Controller
            name: usbcontroller
          - deviceName: devices.kubevirt.io/USB3_Controller
            name: usbcontroller2
          - deviceName: intel.com/WIFI_Controller
            name: wificontroller
          interfaces:
          - bridge: {}
            model: virtio
            name: nic-0
          networkInterfaceMultiqueue: true
          rng: {}
          tpm: {}
        features:
          acpi: {}
          apic: {}
          hyperv:
            frequencies: {}
            ipi: {}
            reenlightenment: {}
            relaxed: {}
            reset: {}
            runtime: {}
            spinlocks:
              spinlocks: 8191
            synic: {}
            synictimer:
              direct: {}
            tlbflush: {}
            vapic: {}
            vpindex: {}
          smm: {}
        firmware:
          bootloader:
            efi:
              secureBoot: true
        machine:
          type: q35
        memory:
          hugepages:
            pageSize: 1Gi
        resources:
          requests:
            memory: 32Gi
      evictionStrategy: None
      hostname: windows
      networks:
      - multus:
          networkName: br1
        name: nic-0
      terminationGracePeriodSeconds: 3600
      volumes:
      - containerDisk:
          image: registry.redhat.io/container-native-virtualization/virtio-win-rhel9@sha256:0c536c7aba76eb9c1e75a8f2dc2bbfa017e90314d55b242599ea41f42ba4434f
        name: windows-guest-tools
      - name: pvdisk
        persistentVolumeClaim:
          claimName: windows
      - name: pvdisk1
        persistentVolumeClaim:
          claimName: windowsdata

Future Improvements¶

Some potential future improvements to this setup:

Using MicroShift instead of OpenShift to reduce Control Plane footprint
Running Linux Desktop in containers instead of VMs
Implementing more efficient resource allocation with CPU pinning and huge pages

Troubleshooting¶

IOMMU Group Issues¶

Problem: VM fails to start with:

{"component":"virt-launcher","level":"error","msg":"Failed to start VirtualMachineInstance",
"reason":"virError... vfio 0000:07:00.1: group 19 is not viable
Please ensure all devices within the iommu_group are bound to their vfio bus driver."}

Diagnosis: Not all devices in the IOMMU group are bound to vfio-pci. Check:

# Check devices in the IOMMU group
ls /sys/kernel/iommu_groups/19/devices/

# Check what these devices are
lspci -nnks 07:00.0

Solution: Bind all devices in the IOMMU group to vfio-pci:

# Add to vfio-prepare.sh
echo "vfio-pci" > /sys/bus/pci/devices/0000:03:08.0/driver_override
echo "vfio-pci" > /sys/bus/pci/devices/0000:07:00.0/driver_override
echo "vfio-pci" > /sys/bus/pci/devices/0000:07:00.1/driver_override
echo "vfio-pci" > /sys/bus/pci/devices/0000:07:00.3/driver_override

# Unbind from current drivers, then bind to vfio-pci

Common Issues and Solutions¶

No display output after GPU passthrough:

Disable virtual VGA in VM spec
Pass through both GPU and audio device
Install proper GPU drivers inside VM

Performance issues in Windows VM:

Configure CPU pinning correctly
Enable huge pages for better memory performance
Install latest NVIDIA drivers in VM
Disable Windows Game Bar and overlay software

GPU driver switching fails:

Stop all GPU workloads before switching
Check GPU operator logs: oc logs -n nvidia-gpu-operator <pod-name>
Verify IOMMU is enabled in BIOS/UEFI

For further troubleshooting, check logs:

virt-handler: oc logs -n openshift-cnv virt-handler-<hash>
virt-launcher: oc logs -n <namespace> virt-launcher-<vm-name>-<hash>
nvidia-driver-daemonset: oc logs -n nvidia-gpu-operator nvidia-driver-daemonset-<hash>