← Back to Index

KVM Network Performance Tuning

Overview

This document records a complete solution for achieving near-bare-metal network performance on KVM virtual machines running on AWS c5n.metal instances, through a series of host-level, VM XML configuration, and Guest OS tuning techniques.

Key Constraint: AWS environments do not support SR-IOV / macvtap NIC passthrough. All optimizations are based on virtio-net + NAT bridge mode.


Tuning Results Summary

iperf3 Network Performance

Test Scenario Single-Stream Bandwidth Multi-Stream P=16 Bandwidth
ec2A bare metal → ec2B 4.94 Gbits/sec 23.1 Gbits/sec
KVM Guest baseline → ec2B 4.76 Gbits/sec (96.4%) 9.24 Gbits/sec (40.0%)
KVM Guest optimized → ec2B 4.94 Gbits/sec (100%) 13.4 Gbits/sec (58.0%)
  • Single-stream performance: improved from 96.4% to 100%, completely eliminating virtualization overhead
  • Multi-stream performance: improved from 40.0% to 58.0%, a 45% improvement

fio Disk Performance

fio performance shows no regression before and after optimization (4k randread/randwrite both within margin of error), confirming that the tuning only affects the network path and does not impact disk I/O.


Detailed Tuning Solutions

Part 1: Host Kernel Modules

1. Load vhost-net Module

sudo modprobe vhost_net
echo "vhost_net" | sudo tee /etc/modules-load.d/vhost-net.conf

Rationale: vhost-net moves virtio-net packet processing from the QEMU userspace process into kernel-space vhost worker threads, eliminating the context-switch overhead between userspace and kernelspace. This is the most fundamental and impactful step in KVM network performance optimization.

Effect: Enabling this alone can improve network throughput by 20–30%.

2. Load and Disable Bridge Netfilter

sudo modprobe br_netfilter
echo 0 | sudo tee /proc/sys/net/bridge/bridge-nf-call-iptables
echo 0 | sudo tee /proc/sys/net/bridge/bridge-nf-call-ip6tables
echo 0 | sudo tee /proc/sys/net/bridge/bridge-nf-call-arptables

Persistent configuration:

cat << EOF | sudo tee /etc/sysctl.d/99-bridge-nf.conf
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-arptables = 0
EOF

Rationale: By default, the Linux bridge sends all bridged packets through iptables/ip6tables/arptables for filtering, which is unnecessary overhead for KVM NAT scenarios. Disabling this allows packets to be forwarded directly, reducing CPU cycle consumption.

Part 2: Host Memory Optimization

3. Hugepages

# Allocate 8192 × 2MB = 16GB hugepages (must exceed VM memory of 8GB)
echo 8192 | sudo tee /proc/sys/vm/nr_hugepages

Persistent configuration:

echo "vm.nr_hugepages = 8192" | sudo tee /etc/sysctl.d/99-hugepages.conf

Rationale: Using 2MB hugepages instead of default 4KB pages reduces TLB (Translation Lookaside Buffer) miss rates by hundreds of times. For memory-intensive network packet processing, reducing page table lookup overhead can significantly improve performance.

VM XML Configuration:

<memoryBacking>
  <hugepages/>
</memoryBacking>

Part 3: VM XML Configuration Optimization

4. CPU Pinning (vcpupin)

<vcpu placement='static'>8</vcpu>
<cputune>
  <vcpupin vcpu='0' cpuset='0'/>
  <vcpupin vcpu='1' cpuset='1'/>
  <vcpupin vcpu='2' cpuset='2'/>
  <vcpupin vcpu='3' cpuset='3'/>
  <vcpupin vcpu='4' cpuset='4'/>
  <vcpupin vcpu='5' cpuset='5'/>
  <vcpupin vcpu='6' cpuset='6'/>
  <vcpupin vcpu='7' cpuset='7'/>
  <emulatorpin cpuset='8-9'/>
</cputune>

Rationale:

  • vcpupin: Pins virtual CPUs to specific physical CPU cores, preventing vCPU migration between physical cores which would cause L1/L2 cache invalidations
  • emulatorpin: Pins QEMU emulator threads to dedicated physical cores (8–9), preventing emulator threads from competing with vCPUs for CPU resources
  • Has the greatest impact on multi-stream network tests, as multiple network processing threads can execute in parallel without interfering with each other

5. CPU host-passthrough Mode

<cpu mode='host-passthrough' check='none'>
  <topology sockets='1' dies='1' cores='8' threads='1'/>
</cpu>

Rationale:

  • host-passthrough: Exposes all host CPU features (SSE4.2, AVX2, AES-NI, etc.) directly to the Guest, avoiding CPU instruction emulation overhead
  • An explicit CPU topology definition (1 socket × 8 cores) allows the Guest kernel to correctly identify the NUMA topology, optimizing memory access and interrupt distribution

6. Q35 Machine Type

<os>
  <type arch='x86_64' machine='pc-q35-rhel9.4.0'>hvm</type>
</os>

Rationale: Q35 is a modern PCIe-based virtual chipset. Compared to the legacy i440fx:

  • Supports native PCIe bus; virtio devices are presented as PCIe devices, reducing bus emulation overhead
  • More efficient interrupt routing (MSI-X)
  • Better IOMMU support

7. virtio-net Multi-Queue + vhost Driver

<interface type='network'>
  <source network='default'/>
  <model type='virtio'/>
  <driver name='vhost' queues='8'/>
</interface>

Rationale:

  • name='vhost': Specifies at the XML level that the vhost-net kernel driver should handle network I/O
  • queues='8': Enables multi-queue virtio-net, allowing multiple CPU cores to simultaneously process network send/receive queues
  • Multi-queue is key to improving multi-stream network performance; with a single queue, all network interrupts can only be handled by one CPU

8. Disk I/O Optimization

<disk type='block' device='disk'>
  <driver name='qemu' type='raw' cache='none' io='native'/>
  <source dev='/dev/nvme1n1'/>
  <target dev='vda' bus='virtio'/>
</disk>

Rationale:

  • cache='none': Bypasses the host page cache; Guest I/O goes directly to the block device, avoiding double caching
  • io='native': Uses Linux native asynchronous I/O (AIO), which is more efficient than QEMU’s thread pool mode
  • type='raw': Raw device passthrough with no qcow2 metadata overhead

Part 4: Host Network Parameter Tuning

9. Kernel Network Buffers

# Increase TCP/Socket buffer sizes
sudo sysctl -w net.core.rmem_max=16777216      # 16MB
sudo sysctl -w net.core.wmem_max=16777216      # 16MB
sudo sysctl -w net.core.rmem_default=1048576   # 1MB
sudo sysctl -w net.core.wmem_default=1048576   # 1MB

# TCP auto-tuning range
sudo sysctl -w net.ipv4.tcp_rmem="4096 1048576 16777216"
sudo sysctl -w net.ipv4.tcp_wmem="4096 1048576 16777216"

# Network queue depth
sudo sysctl -w net.core.netdev_max_backlog=5000
sudo sysctl -w net.core.somaxconn=65535

Persistent configuration:

cat << EOF | sudo tee /etc/sysctl.d/99-net-perf.conf
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 1048576
net.core.wmem_default = 1048576
net.core.netdev_max_backlog = 5000
net.core.somaxconn = 65535
net.ipv4.tcp_rmem = 4096 1048576 16777216
net.ipv4.tcp_wmem = 4096 1048576 16777216
EOF
sudo sysctl -p /etc/sysctl.d/99-net-perf.conf

Rationale: Default kernel network buffers (typically 128KB–256KB) become a bottleneck in high-bandwidth scenarios. Increasing the buffer sizes allows the TCP window to auto-scale to larger values, fully utilizing high-bandwidth links.

10. virbr0 MTU Adjustment

sudo ip link set virbr0 mtu 9000

Rationale: Increasing MTU to 9000 (Jumbo Frame) reduces the number of packets that need to be processed per GB of data by approximately 6x (1500→9000), directly lowering CPU interrupt handling and protocol stack overhead.

Part 5: Guest OS Internal Tuning

11. Enable Multi-Queue

sudo ethtool -L eth0 combined 8

Rationale: Although 8 queues are defined in the VM XML, they must be activated inside the Guest using ethtool to take effect. Once activated, network interrupts are distributed across multiple CPU cores, enabling true multi-core parallel network processing.

12. Guest MTU Setting

sudo ip link set eth0 mtu 9000

Rationale: Must match the MTU of the host’s virbr0, otherwise packets larger than 1500 bytes will be fragmented, which would actually decrease performance.

13. Guest Kernel Network Parameters

sudo sysctl -w net.core.rmem_max=16777216
sudo sysctl -w net.core.wmem_max=16777216
sudo sysctl -w net.core.rmem_default=1048576
sudo sysctl -w net.core.wmem_default=1048576
sudo sysctl -w net.core.netdev_max_backlog=5000
sudo sysctl -w net.core.somaxconn=65535
sudo sysctl -w net.ipv4.tcp_rmem="4096 1048576 16777216"
sudo sysctl -w net.ipv4.tcp_wmem="4096 1048576 16777216"

Rationale: The Guest kernel has its own independent network parameter space, which must be tuned in sync with the host. Otherwise, small buffers on the Guest side will become the bottleneck.


Optimization Impact Breakdown

Optimization Primary Impact Significance
vhost-net Single/multi-stream throughput ★★★★★
Multi-queue virtio-net Multi-stream throughput ★★★★★
CPU pinning Multi-stream throughput, latency stability ★★★★☆
Hugepages Overall performance, latency ★★★☆☆
host-passthrough CPU-intensive processing ★★★☆☆
Bridge NF disabled Throughput ★★☆☆☆
Kernel network buffers High-bandwidth scenarios ★★☆☆☆
MTU 9000 Large-block transfer throughput ★★☆☆☆
Q35 machine type Interrupt efficiency ★☆☆☆☆
Disk native I/O Disk performance ★☆☆☆☆

Non-Applied Optimizations and Reasons

Item Reason Not Used
SR-IOV NIC passthrough The ENA NIC on AWS c5n.metal does not support SR-IOV passthrough within KVM
macvtap passthrough AWS ENA driver does not support macvtap/macvlan mode
DPDK (Data Plane Development Kit) Requires dedicated application-layer support; incompatible with iperf3
AF_XDP Requires application-layer adaptation
OVS-DPDK bridge Architecturally heavy and requires application-layer cooperation

Analysis of Why Multi-Stream Performance Does Not Reach 100%

After optimization, multi-stream performance reaches 58% of bare metal (13.4 vs 23.1 Gbits/sec). The reasons it cannot reach 100%:

  1. NAT overhead: All Guest traffic must pass through virbr0 NAT translation, adding CPU instructions per packet
  2. Virtual switch overhead: virbr0 is a Linux bridge; each packet requires table lookups, forwarding, and address translation
  3. Interrupt coalescing efficiency: virtio-net interrupt coalescing is less efficient than hardware interrupt coalescing on a physical ENA NIC
  4. vhost thread scheduling: vhost worker threads still require management by the kernel scheduler, introducing context switch overhead
  5. QEMU control plane: Although the data plane is handled by vhost, QEMU still participates in some control plane operations

To further close this gap, technologies such as SR-IOV passthrough or DPDK that bypass the kernel network stack would be needed, but these are not available in the current AWS environment.


Complete Optimization Checklist (One-Shot Deployment)

Host Configuration Script

#!/bin/bash
# host-optimize.sh - KVM host network performance optimization

# 1. Load kernel modules
modprobe vhost_net
echo "vhost_net" > /etc/modules-load.d/vhost-net.conf

# 2. Disable bridge netfilter
modprobe br_netfilter
sysctl -w net.bridge.bridge-nf-call-iptables=0
sysctl -w net.bridge.bridge-nf-call-ip6tables=0
sysctl -w net.bridge.bridge-nf-call-arptables=0

# 3. Configure hugepages (adjust based on VM memory; must be > VM RAM)
echo 8192 > /proc/sys/vm/nr_hugepages

# 4. Network buffer tuning
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
sysctl -w net.core.rmem_default=1048576
sysctl -w net.core.wmem_default=1048576
sysctl -w net.core.netdev_max_backlog=5000
sysctl -w net.core.somaxconn=65535
sysctl -w net.ipv4.tcp_rmem="4096 1048576 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 1048576 16777216"

# 5. virbr0 MTU
ip link set virbr0 mtu 9000

echo "Host optimization complete."

Guest Configuration Script

#!/bin/bash
# guest-optimize.sh - KVM Guest network performance optimization

# 1. Enable multi-queue (queue count must match VM XML definition)
ethtool -L eth0 combined 8

# 2. MTU (must match host virbr0)
ip link set eth0 mtu 9000

# 3. Network buffer tuning
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
sysctl -w net.core.rmem_default=1048576
sysctl -w net.core.wmem_default=1048576
sysctl -w net.core.netdev_max_backlog=5000
sysctl -w net.core.somaxconn=65535
sysctl -w net.ipv4.tcp_rmem="4096 1048576 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 1048576 16777216"

echo "Guest optimization complete."

Conclusion

Through 10 tuning measures, without using SR-IOV/macvtap hardware passthrough:

  • Single-stream network performance: improved from 96.4% to 100%, completely eliminating virtualization overhead
  • Multi-stream network performance: improved from 40.0% to 58.0%, a 45% improvement
  • Disk I/O performance: unchanged, no regression

The three most impactful optimizations are: vhost-net kernel module, multi-queue virtio-net, and CPU pinning. These three combined address the core bottlenecks of KVM network virtualization — userspace/kernelspace context switching, single-queue bottlenecks, and CPU cache invalidations.

The fundamental reason multi-stream performance cannot reach 100% is: under the virtio-net + NAT bridge architecture, each packet must still traverse the complete software path of Guest virtio driver → vhost kernel thread → virbr0 Linux bridge → iptables NAT translation → physical ENA NIC. This is at least 3 additional layers of software processing compared to a bare-metal machine sending directly through the ENA NIC. These overheads scale linearly with the number of concurrent streams. To completely eliminate this gap, the only option is to use SR-IOV or macvtap to pass the physical NIC directly to the Guest, bypassing the host’s entire network stack — but AWS ENA NICs do not support these passthrough modes. Therefore, 58% is already close to the theoretical ceiling for the virtio-net NAT architecture in the current environment.