KVM Network Performance Tuning

Overview

This document records a complete solution for achieving near-bare-metal network performance on KVM virtual machines running on AWS c5n.metal instances, through a series of host-level, VM XML configuration, and Guest OS tuning techniques.

Key Constraint: AWS environments do not support SR-IOV / macvtap NIC passthrough. All optimizations are based on virtio-net + NAT bridge mode.

Tuning Results Summary

iperf3 Network Performance

Test Scenario	Single-Stream Bandwidth	Multi-Stream P=16 Bandwidth
ec2A bare metal → ec2B	4.94 Gbits/sec	23.1 Gbits/sec
KVM Guest baseline → ec2B	4.76 Gbits/sec (96.4%)	9.24 Gbits/sec (40.0%)
KVM Guest optimized → ec2B	4.94 Gbits/sec (100%)	13.4 Gbits/sec (58.0%)

Single-stream performance: improved from 96.4% to 100%, completely eliminating virtualization overhead
Multi-stream performance: improved from 40.0% to 58.0%, a 45% improvement

fio Disk Performance

fio performance shows no regression before and after optimization (4k randread/randwrite both within margin of error), confirming that the tuning only affects the network path and does not impact disk I/O.

Detailed Tuning Solutions

Part 1: Host Kernel Modules

1. Load vhost-net Module

sudo modprobe vhost_net
        echo "vhost_net" | sudo tee /etc/modules-load.d/vhost-net.conf

Rationale: vhost-net moves virtio-net packet processing from the QEMU userspace process into kernel-space vhost worker threads, eliminating the context-switch overhead between userspace and kernelspace. This is the most fundamental and impactful step in KVM network performance optimization.

Effect: Enabling this alone can improve network throughput by 20–30%.

2. Load and Disable Bridge Netfilter

sudo modprobe br_netfilter
        echo 0 | sudo tee /proc/sys/net/bridge/bridge-nf-call-iptables
        echo 0 | sudo tee /proc/sys/net/bridge/bridge-nf-call-ip6tables
        echo 0 | sudo tee /proc/sys/net/bridge/bridge-nf-call-arptables

Persistent configuration:

cat << EOF | sudo tee /etc/sysctl.d/99-bridge-nf.conf
        net.bridge.bridge-nf-call-iptables = 0
        net.bridge.bridge-nf-call-ip6tables = 0
        net.bridge.bridge-nf-call-arptables = 0
        EOF

Rationale: By default, the Linux bridge sends all bridged packets through iptables/ip6tables/arptables for filtering, which is unnecessary overhead for KVM NAT scenarios. Disabling this allows packets to be forwarded directly, reducing CPU cycle consumption.

Part 2: Host Memory Optimization

3. Hugepages


        # Allocate 8192 × 2MB = 16GB hugepages (must exceed VM memory of 8GB)
        
        echo 8192 | sudo tee /proc/sys/vm/nr_hugepages

Persistent configuration:

echo "vm.nr_hugepages = 8192" | sudo tee /etc/sysctl.d/99-hugepages.conf

Rationale: Using 2MB hugepages instead of default 4KB pages reduces TLB (Translation Lookaside Buffer) miss rates by hundreds of times. For memory-intensive network packet processing, reducing page table lookup overhead can significantly improve performance.

VM XML Configuration:

<memoryBacking>
          <hugepages/>
        </memoryBacking>

Part 3: VM XML Configuration Optimization

4. CPU Pinning (vcpupin)

<vcpu placement='static'>8</vcpu>
        <cputune>
          <vcpupin vcpu='0' cpuset='0'/>
          <vcpupin vcpu='1' cpuset='1'/>
          <vcpupin vcpu='2' cpuset='2'/>
          <vcpupin vcpu='3' cpuset='3'/>
          <vcpupin vcpu='4' cpuset='4'/>
          <vcpupin vcpu='5' cpuset='5'/>
          <vcpupin vcpu='6' cpuset='6'/>
          <vcpupin vcpu='7' cpuset='7'/>
          <emulatorpin cpuset='8-9'/>
        </cputune>

Rationale:

vcpupin: Pins virtual CPUs to specific physical CPU cores, preventing vCPU migration between physical cores which would cause L1/L2 cache invalidations
emulatorpin: Pins QEMU emulator threads to dedicated physical cores (8–9), preventing emulator threads from competing with vCPUs for CPU resources
Has the greatest impact on multi-stream network tests, as multiple network processing threads can execute in parallel without interfering with each other

5. CPU host-passthrough Mode

<cpu mode='host-passthrough' check='none'>
          <topology sockets='1' dies='1' cores='8' threads='1'/>
        </cpu>

Rationale:

host-passthrough: Exposes all host CPU features (SSE4.2, AVX2, AES-NI, etc.) directly to the Guest, avoiding CPU instruction emulation overhead
An explicit CPU topology definition (1 socket × 8 cores) allows the Guest kernel to correctly identify the NUMA topology, optimizing memory access and interrupt distribution

6. Q35 Machine Type

<os>
          <type arch='x86_64' machine='pc-q35-rhel9.4.0'>hvm</type>
        </os>

Rationale: Q35 is a modern PCIe-based virtual chipset. Compared to the legacy i440fx:

Supports native PCIe bus; virtio devices are presented as PCIe devices, reducing bus emulation overhead
More efficient interrupt routing (MSI-X)
Better IOMMU support

7. virtio-net Multi-Queue + vhost Driver

<interface type='network'>
          <source network='default'/>
          <model type='virtio'/>
          <driver name='vhost' queues='8'/>
        </interface>

Rationale:

name='vhost': Specifies at the XML level that the vhost-net kernel driver should handle network I/O
queues='8': Enables multi-queue virtio-net, allowing multiple CPU cores to simultaneously process network send/receive queues
Multi-queue is key to improving multi-stream network performance; with a single queue, all network interrupts can only be handled by one CPU

8. Disk I/O Optimization

<disk type='block' device='disk'>
          <driver name='qemu' type='raw' cache='none' io='native'/>
          <source dev='/dev/nvme1n1'/>
          <target dev='vda' bus='virtio'/>
        </disk>

Rationale:

cache='none': Bypasses the host page cache; Guest I/O goes directly to the block device, avoiding double caching
io='native': Uses Linux native asynchronous I/O (AIO), which is more efficient than QEMU’s thread pool mode
type='raw': Raw device passthrough with no qcow2 metadata overhead

Part 4: Host Network Parameter Tuning

9. Kernel Network Buffers


        # Increase TCP/Socket buffer sizes
        
        sudo sysctl -w net.core.rmem_max=16777216      # 16MB
        sudo sysctl -w net.core.wmem_max=16777216      # 16MB
        sudo sysctl -w net.core.rmem_default=1048576   # 1MB
        sudo sysctl -w net.core.wmem_default=1048576   # 1MB
        
        # TCP auto-tuning range
        
        sudo sysctl -w net.ipv4.tcp_rmem="4096 1048576 16777216"
        sudo sysctl -w net.ipv4.tcp_wmem="4096 1048576 16777216"
        
        # Network queue depth
        
        sudo sysctl -w net.core.netdev_max_backlog=5000
        sudo sysctl -w net.core.somaxconn=65535

Persistent configuration:

cat << EOF | sudo tee /etc/sysctl.d/99-net-perf.conf
        net.core.rmem_max = 16777216
        net.core.wmem_max = 16777216
        net.core.rmem_default = 1048576
        net.core.wmem_default = 1048576
        net.core.netdev_max_backlog = 5000
        net.core.somaxconn = 65535
        net.ipv4.tcp_rmem = 4096 1048576 16777216
        net.ipv4.tcp_wmem = 4096 1048576 16777216
        EOF
        sudo sysctl -p /etc/sysctl.d/99-net-perf.conf

Rationale: Default kernel network buffers (typically 128KB–256KB) become a bottleneck in high-bandwidth scenarios. Increasing the buffer sizes allows the TCP window to auto-scale to larger values, fully utilizing high-bandwidth links.

10. virbr0 MTU Adjustment

sudo ip link set virbr0 mtu 9000

Rationale: Increasing MTU to 9000 (Jumbo Frame) reduces the number of packets that need to be processed per GB of data by approximately 6x (1500→9000), directly lowering CPU interrupt handling and protocol stack overhead.

Part 5: Guest OS Internal Tuning

11. Enable Multi-Queue

sudo ethtool -L eth0 combined 8

Rationale: Although 8 queues are defined in the VM XML, they must be activated inside the Guest using ethtool to take effect. Once activated, network interrupts are distributed across multiple CPU cores, enabling true multi-core parallel network processing.

12. Guest MTU Setting

sudo ip link set eth0 mtu 9000

Rationale: Must match the MTU of the host’s virbr0, otherwise packets larger than 1500 bytes will be fragmented, which would actually decrease performance.

13. Guest Kernel Network Parameters

sudo sysctl -w net.core.rmem_max=16777216
        sudo sysctl -w net.core.wmem_max=16777216
        sudo sysctl -w net.core.rmem_default=1048576
        sudo sysctl -w net.core.wmem_default=1048576
        sudo sysctl -w net.core.netdev_max_backlog=5000
        sudo sysctl -w net.core.somaxconn=65535
        sudo sysctl -w net.ipv4.tcp_rmem="4096 1048576 16777216"
        sudo sysctl -w net.ipv4.tcp_wmem="4096 1048576 16777216"

Rationale: The Guest kernel has its own independent network parameter space, which must be tuned in sync with the host. Otherwise, small buffers on the Guest side will become the bottleneck.

Optimization Impact Breakdown

Optimization	Primary Impact	Significance
vhost-net	Single/multi-stream throughput	★★★★★
Multi-queue virtio-net	Multi-stream throughput	★★★★★
CPU pinning	Multi-stream throughput, latency stability	★★★★☆
Hugepages	Overall performance, latency	★★★☆☆
host-passthrough	CPU-intensive processing	★★★☆☆
Bridge NF disabled	Throughput	★★☆☆☆
Kernel network buffers	High-bandwidth scenarios	★★☆☆☆
MTU 9000	Large-block transfer throughput	★★☆☆☆
Q35 machine type	Interrupt efficiency	★☆☆☆☆
Disk native I/O	Disk performance	★☆☆☆☆

Non-Applied Optimizations and Reasons

Item	Reason Not Used
SR-IOV NIC passthrough	The ENA NIC on AWS c5n.metal does not support SR-IOV passthrough within KVM
macvtap passthrough	AWS ENA driver does not support macvtap/macvlan mode
DPDK (Data Plane Development Kit)	Requires dedicated application-layer support; incompatible with iperf3
AF_XDP	Requires application-layer adaptation
OVS-DPDK bridge	Architecturally heavy and requires application-layer cooperation

Analysis of Why Multi-Stream Performance Does Not Reach 100%

After optimization, multi-stream performance reaches 58% of bare metal (13.4 vs 23.1 Gbits/sec). The reasons it cannot reach 100%:

NAT overhead: All Guest traffic must pass through virbr0 NAT translation, adding CPU instructions per packet
Virtual switch overhead: virbr0 is a Linux bridge; each packet requires table lookups, forwarding, and address translation
Interrupt coalescing efficiency: virtio-net interrupt coalescing is less efficient than hardware interrupt coalescing on a physical ENA NIC
vhost thread scheduling: vhost worker threads still require management by the kernel scheduler, introducing context switch overhead
QEMU control plane: Although the data plane is handled by vhost, QEMU still participates in some control plane operations

To further close this gap, technologies such as SR-IOV passthrough or DPDK that bypass the kernel network stack would be needed, but these are not available in the current AWS environment.

Complete Optimization Checklist (One-Shot Deployment)

Host Configuration Script

#!/bin/bash
        
        # host-optimize.sh - KVM host network performance optimization
        
        # 1. Load kernel modules
        
        modprobe vhost_net
        echo "vhost_net" > /etc/modules-load.d/vhost-net.conf
        
        # 2. Disable bridge netfilter
        
        modprobe br_netfilter
        sysctl -w net.bridge.bridge-nf-call-iptables=0
        sysctl -w net.bridge.bridge-nf-call-ip6tables=0
        sysctl -w net.bridge.bridge-nf-call-arptables=0
        
        # 3. Configure hugepages (adjust based on VM memory; must be > VM RAM)
        
        echo 8192 > /proc/sys/vm/nr_hugepages
        
        # 4. Network buffer tuning
        
        sysctl -w net.core.rmem_max=16777216
        sysctl -w net.core.wmem_max=16777216
        sysctl -w net.core.rmem_default=1048576
        sysctl -w net.core.wmem_default=1048576
        sysctl -w net.core.netdev_max_backlog=5000
        sysctl -w net.core.somaxconn=65535
        sysctl -w net.ipv4.tcp_rmem="4096 1048576 16777216"
        sysctl -w net.ipv4.tcp_wmem="4096 1048576 16777216"
        
        # 5. virbr0 MTU
        
        ip link set virbr0 mtu 9000
        
        echo "Host optimization complete."

Guest Configuration Script

#!/bin/bash
        
        # guest-optimize.sh - KVM Guest network performance optimization
        
        # 1. Enable multi-queue (queue count must match VM XML definition)
        
        ethtool -L eth0 combined 8
        
        # 2. MTU (must match host virbr0)
        
        ip link set eth0 mtu 9000
        
        # 3. Network buffer tuning
        
        sysctl -w net.core.rmem_max=16777216
        sysctl -w net.core.wmem_max=16777216
        sysctl -w net.core.rmem_default=1048576
        sysctl -w net.core.wmem_default=1048576
        sysctl -w net.core.netdev_max_backlog=5000
        sysctl -w net.core.somaxconn=65535
        sysctl -w net.ipv4.tcp_rmem="4096 1048576 16777216"
        sysctl -w net.ipv4.tcp_wmem="4096 1048576 16777216"
        
        echo "Guest optimization complete."

Conclusion

Through 10 tuning measures, without using SR-IOV/macvtap hardware passthrough:

Single-stream network performance: improved from 96.4% to 100%, completely eliminating virtualization overhead
Multi-stream network performance: improved from 40.0% to 58.0%, a 45% improvement
Disk I/O performance: unchanged, no regression

The three most impactful optimizations are: vhost-net kernel module, multi-queue virtio-net, and CPU pinning. These three combined address the core bottlenecks of KVM network virtualization — userspace/kernelspace context switching, single-queue bottlenecks, and CPU cache invalidations.

The fundamental reason multi-stream performance cannot reach 100% is: under the virtio-net + NAT bridge architecture, each packet must still traverse the complete software path of Guest virtio driver → vhost kernel thread → virbr0 Linux bridge → iptables NAT translation → physical ENA NIC. This is at least 3 additional layers of software processing compared to a bare-metal machine sending directly through the ENA NIC. These overheads scale linearly with the number of concurrent streams. To completely eliminate this gap, the only option is to use SR-IOV or macvtap to pass the physical NIC directly to the Guest, bypassing the host’s entire network stack — but AWS ENA NICs do not support these passthrough modes. Therefore, 58% is already close to the theoretical ceiling for the virtio-net NAT architecture in the current environment.