KVM Network Performance Tuning
Overview
This document records a complete solution for achieving near-bare-metal network performance on KVM virtual machines running on AWS c5n.metal instances, through a series of host-level, VM XML configuration, and Guest OS tuning techniques.
Key Constraint: AWS environments do not support SR-IOV / macvtap NIC passthrough. All optimizations are based on virtio-net + NAT bridge mode.
Tuning Results Summary
iperf3 Network Performance
| Test Scenario | Single-Stream Bandwidth | Multi-Stream P=16 Bandwidth |
|---|---|---|
| ec2A bare metal β ec2B | 4.94 Gbits/sec | 23.1 Gbits/sec |
| KVM Guest baseline β ec2B | 4.76 Gbits/sec (96.4%) | 9.24 Gbits/sec (40.0%) |
| KVM Guest optimized β ec2B | 4.94 Gbits/sec (100%) | 13.4 Gbits/sec (58.0%) |
- Single-stream performance: improved from 96.4% to 100%, completely eliminating virtualization overhead
- Multi-stream performance: improved from 40.0% to 58.0%, a 45% improvement
fio Disk Performance
fio performance shows no regression before and after optimization (4k randread/randwrite both within margin of error), confirming that the tuning only affects the network path and does not impact disk I/O.
Detailed Tuning Solutions
Part 1: Host Kernel Modules
1. Load vhost-net Module
sudo modprobe vhost_net
echo "vhost_net" | sudo tee /etc/modules-load.d/vhost-net.confRationale: vhost-net moves virtio-net packet processing from the QEMU userspace process into kernel-space vhost worker threads, eliminating the context-switch overhead between userspace and kernelspace. This is the most fundamental and impactful step in KVM network performance optimization.
Effect: Enabling this alone can improve network throughput by 20β30%.
2. Load and Disable Bridge Netfilter
sudo modprobe br_netfilter
echo 0 | sudo tee /proc/sys/net/bridge/bridge-nf-call-iptables
echo 0 | sudo tee /proc/sys/net/bridge/bridge-nf-call-ip6tables
echo 0 | sudo tee /proc/sys/net/bridge/bridge-nf-call-arptablesPersistent configuration:
cat << EOF | sudo tee /etc/sysctl.d/99-bridge-nf.conf
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-arptables = 0
EOFRationale: By default, the Linux bridge sends all bridged packets through iptables/ip6tables/arptables for filtering, which is unnecessary overhead for KVM NAT scenarios. Disabling this allows packets to be forwarded directly, reducing CPU cycle consumption.
Part 2: Host Memory Optimization
3. Hugepages
# Allocate 8192 Γ 2MB = 16GB hugepages (must exceed VM memory of 8GB)
echo 8192 | sudo tee /proc/sys/vm/nr_hugepagesPersistent configuration:
echo "vm.nr_hugepages = 8192" | sudo tee /etc/sysctl.d/99-hugepages.confRationale: Using 2MB hugepages instead of default 4KB pages reduces TLB (Translation Lookaside Buffer) miss rates by hundreds of times. For memory-intensive network packet processing, reducing page table lookup overhead can significantly improve performance.
VM XML Configuration:
<memoryBacking>
<hugepages/>
</memoryBacking>Part 3: VM XML Configuration Optimization
4. CPU Pinning (vcpupin)
<vcpu placement='static'>8</vcpu>
<cputune>
<vcpupin vcpu='0' cpuset='0'/>
<vcpupin vcpu='1' cpuset='1'/>
<vcpupin vcpu='2' cpuset='2'/>
<vcpupin vcpu='3' cpuset='3'/>
<vcpupin vcpu='4' cpuset='4'/>
<vcpupin vcpu='5' cpuset='5'/>
<vcpupin vcpu='6' cpuset='6'/>
<vcpupin vcpu='7' cpuset='7'/>
<emulatorpin cpuset='8-9'/>
</cputune>Rationale:
vcpupin: Pins virtual CPUs to specific physical CPU cores, preventing vCPU migration between physical cores which would cause L1/L2 cache invalidationsemulatorpin: Pins QEMU emulator threads to dedicated physical cores (8β9), preventing emulator threads from competing with vCPUs for CPU resources- Has the greatest impact on multi-stream network tests, as multiple network processing threads can execute in parallel without interfering with each other
5. CPU host-passthrough Mode
<cpu mode='host-passthrough' check='none'>
<topology sockets='1' dies='1' cores='8' threads='1'/>
</cpu>Rationale:
host-passthrough: Exposes all host CPU features (SSE4.2, AVX2, AES-NI, etc.) directly to the Guest, avoiding CPU instruction emulation overhead- An explicit CPU topology definition (1 socket Γ 8 cores) allows the Guest kernel to correctly identify the NUMA topology, optimizing memory access and interrupt distribution
6. Q35 Machine Type
<os>
<type arch='x86_64' machine='pc-q35-rhel9.4.0'>hvm</type>
</os>Rationale: Q35 is a modern PCIe-based virtual chipset. Compared to the legacy i440fx:
- Supports native PCIe bus; virtio devices are presented as PCIe devices, reducing bus emulation overhead
- More efficient interrupt routing (MSI-X)
- Better IOMMU support
7. virtio-net Multi-Queue + vhost Driver
<interface type='network'>
<source network='default'/>
<model type='virtio'/>
<driver name='vhost' queues='8'/>
</interface>Rationale:
name='vhost': Specifies at the XML level that the vhost-net kernel driver should handle network I/Oqueues='8': Enables multi-queue virtio-net, allowing multiple CPU cores to simultaneously process network send/receive queues- Multi-queue is key to improving multi-stream network performance; with a single queue, all network interrupts can only be handled by one CPU
8. Disk I/O Optimization
<disk type='block' device='disk'>
<driver name='qemu' type='raw' cache='none' io='native'/>
<source dev='/dev/nvme1n1'/>
<target dev='vda' bus='virtio'/>
</disk>Rationale:
cache='none': Bypasses the host page cache; Guest I/O goes directly to the block device, avoiding double cachingio='native': Uses Linux native asynchronous I/O (AIO), which is more efficient than QEMUβs thread pool modetype='raw': Raw device passthrough with no qcow2 metadata overhead
Part 4: Host Network Parameter Tuning
9. Kernel Network Buffers
# Increase TCP/Socket buffer sizes
sudo sysctl -w net.core.rmem_max=16777216 # 16MB
sudo sysctl -w net.core.wmem_max=16777216 # 16MB
sudo sysctl -w net.core.rmem_default=1048576 # 1MB
sudo sysctl -w net.core.wmem_default=1048576 # 1MB
# TCP auto-tuning range
sudo sysctl -w net.ipv4.tcp_rmem="4096 1048576 16777216"
sudo sysctl -w net.ipv4.tcp_wmem="4096 1048576 16777216"
# Network queue depth
sudo sysctl -w net.core.netdev_max_backlog=5000
sudo sysctl -w net.core.somaxconn=65535Persistent configuration:
cat << EOF | sudo tee /etc/sysctl.d/99-net-perf.conf
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 1048576
net.core.wmem_default = 1048576
net.core.netdev_max_backlog = 5000
net.core.somaxconn = 65535
net.ipv4.tcp_rmem = 4096 1048576 16777216
net.ipv4.tcp_wmem = 4096 1048576 16777216
EOF
sudo sysctl -p /etc/sysctl.d/99-net-perf.confRationale: Default kernel network buffers (typically 128KBβ256KB) become a bottleneck in high-bandwidth scenarios. Increasing the buffer sizes allows the TCP window to auto-scale to larger values, fully utilizing high-bandwidth links.
10. virbr0 MTU Adjustment
sudo ip link set virbr0 mtu 9000Rationale: Increasing MTU to 9000 (Jumbo Frame) reduces the number of packets that need to be processed per GB of data by approximately 6x (1500β9000), directly lowering CPU interrupt handling and protocol stack overhead.
Part 5: Guest OS Internal Tuning
11. Enable Multi-Queue
sudo ethtool -L eth0 combined 8Rationale: Although 8 queues are defined in the VM XML, they must be activated inside the Guest using ethtool to take effect. Once activated, network interrupts are distributed across multiple CPU cores, enabling true multi-core parallel network processing.
12. Guest MTU Setting
sudo ip link set eth0 mtu 9000Rationale: Must match the MTU of the hostβs virbr0, otherwise packets larger than 1500 bytes will be fragmented, which would actually decrease performance.
13. Guest Kernel Network Parameters
sudo sysctl -w net.core.rmem_max=16777216
sudo sysctl -w net.core.wmem_max=16777216
sudo sysctl -w net.core.rmem_default=1048576
sudo sysctl -w net.core.wmem_default=1048576
sudo sysctl -w net.core.netdev_max_backlog=5000
sudo sysctl -w net.core.somaxconn=65535
sudo sysctl -w net.ipv4.tcp_rmem="4096 1048576 16777216"
sudo sysctl -w net.ipv4.tcp_wmem="4096 1048576 16777216"Rationale: The Guest kernel has its own independent network parameter space, which must be tuned in sync with the host. Otherwise, small buffers on the Guest side will become the bottleneck.
Optimization Impact Breakdown
| Optimization | Primary Impact | Significance |
|---|---|---|
| vhost-net | Single/multi-stream throughput | β β β β β |
| Multi-queue virtio-net | Multi-stream throughput | β β β β β |
| CPU pinning | Multi-stream throughput, latency stability | β β β β β |
| Hugepages | Overall performance, latency | β β β ββ |
| host-passthrough | CPU-intensive processing | β β β ββ |
| Bridge NF disabled | Throughput | β β βββ |
| Kernel network buffers | High-bandwidth scenarios | β β βββ |
| MTU 9000 | Large-block transfer throughput | β β βββ |
| Q35 machine type | Interrupt efficiency | β ββββ |
| Disk native I/O | Disk performance | β ββββ |
Non-Applied Optimizations and Reasons
| Item | Reason Not Used |
|---|---|
| SR-IOV NIC passthrough | The ENA NIC on AWS c5n.metal does not support SR-IOV passthrough within KVM |
| macvtap passthrough | AWS ENA driver does not support macvtap/macvlan mode |
| DPDK (Data Plane Development Kit) | Requires dedicated application-layer support; incompatible with iperf3 |
| AF_XDP | Requires application-layer adaptation |
| OVS-DPDK bridge | Architecturally heavy and requires application-layer cooperation |
Analysis of Why Multi-Stream Performance Does Not Reach 100%
After optimization, multi-stream performance reaches 58% of bare metal (13.4 vs 23.1 Gbits/sec). The reasons it cannot reach 100%:
- NAT overhead: All Guest traffic must pass through virbr0 NAT translation, adding CPU instructions per packet
- Virtual switch overhead: virbr0 is a Linux bridge; each packet requires table lookups, forwarding, and address translation
- Interrupt coalescing efficiency: virtio-net interrupt coalescing is less efficient than hardware interrupt coalescing on a physical ENA NIC
- vhost thread scheduling: vhost worker threads still require management by the kernel scheduler, introducing context switch overhead
- QEMU control plane: Although the data plane is handled by vhost, QEMU still participates in some control plane operations
To further close this gap, technologies such as SR-IOV passthrough or DPDK that bypass the kernel network stack would be needed, but these are not available in the current AWS environment.
Complete Optimization Checklist (One-Shot Deployment)
Host Configuration Script
#!/bin/bash
# host-optimize.sh - KVM host network performance optimization
# 1. Load kernel modules
modprobe vhost_net
echo "vhost_net" > /etc/modules-load.d/vhost-net.conf
# 2. Disable bridge netfilter
modprobe br_netfilter
sysctl -w net.bridge.bridge-nf-call-iptables=0
sysctl -w net.bridge.bridge-nf-call-ip6tables=0
sysctl -w net.bridge.bridge-nf-call-arptables=0
# 3. Configure hugepages (adjust based on VM memory; must be > VM RAM)
echo 8192 > /proc/sys/vm/nr_hugepages
# 4. Network buffer tuning
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
sysctl -w net.core.rmem_default=1048576
sysctl -w net.core.wmem_default=1048576
sysctl -w net.core.netdev_max_backlog=5000
sysctl -w net.core.somaxconn=65535
sysctl -w net.ipv4.tcp_rmem="4096 1048576 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 1048576 16777216"
# 5. virbr0 MTU
ip link set virbr0 mtu 9000
echo "Host optimization complete."Guest Configuration Script
#!/bin/bash
# guest-optimize.sh - KVM Guest network performance optimization
# 1. Enable multi-queue (queue count must match VM XML definition)
ethtool -L eth0 combined 8
# 2. MTU (must match host virbr0)
ip link set eth0 mtu 9000
# 3. Network buffer tuning
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
sysctl -w net.core.rmem_default=1048576
sysctl -w net.core.wmem_default=1048576
sysctl -w net.core.netdev_max_backlog=5000
sysctl -w net.core.somaxconn=65535
sysctl -w net.ipv4.tcp_rmem="4096 1048576 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 1048576 16777216"
echo "Guest optimization complete."Conclusion
Through 10 tuning measures, without using SR-IOV/macvtap hardware passthrough:
- Single-stream network performance: improved from 96.4% to 100%, completely eliminating virtualization overhead
- Multi-stream network performance: improved from 40.0% to 58.0%, a 45% improvement
- Disk I/O performance: unchanged, no regression
The three most impactful optimizations are: vhost-net kernel module, multi-queue virtio-net, and CPU pinning. These three combined address the core bottlenecks of KVM network virtualization β userspace/kernelspace context switching, single-queue bottlenecks, and CPU cache invalidations.
The fundamental reason multi-stream performance cannot reach 100% is: under the virtio-net + NAT bridge architecture, each packet must still traverse the complete software path of Guest virtio driver β vhost kernel thread β virbr0 Linux bridge β iptables NAT translation β physical ENA NIC. This is at least 3 additional layers of software processing compared to a bare-metal machine sending directly through the ENA NIC. These overheads scale linearly with the number of concurrent streams. To completely eliminate this gap, the only option is to use SR-IOV or macvtap to pass the physical NIC directly to the Guest, bypassing the hostβs entire network stack β but AWS ENA NICs do not support these passthrough modes. Therefore, 58% is already close to the theoretical ceiling for the virtio-net NAT architecture in the current environment.