Mellanox CX6 vdpa 硬件卸载 ovs-kernel 方式

本文来讲解,使用mellanox CX6 dx 网卡,实现vdpa硬件卸载。

视频讲解:

vdpa 硬件卸载介绍

既然说到了vdpa卸载,那么我们先简单介绍一下他是什么。

vDPA (virtio data path acceleration) 是一个内核框架,在2020年正式引入内核,NIC厂家会做vDPA网卡,意思是datapath遵循virtio规范,而控制面由厂家驱动提供。

以下是vDPA在虚拟机平台部署时的架构图:

以下是vDPA在k8s平台中部署是的架构图:

上面的架构图,是借用红帽介绍vdpa背景的文章。我们这次的实验,是按照mellanox的文档来做,从mellanox角度看,vdpa有2种方式来做

  1. 配置ovs-dpdk, ovs配置vdpa端口,同时创建socket。vm通过socket挂载vdpa设备。
  2. 配置ovs-kernel,启动vdpa-dpdk程序,同时创建socket。vm通过socket挂载vdpa设备。

第一种方法,由于ovs-dpdk,mellanox官方文档说只支持到rhel/centos 7 , 我们的环境是rhel/rocky 8.4,所以我们用后面一种方法。

在这里,背景介绍的很简单,以下是参考链接,可以更深入的学习:

有一个dpdk特殊概念,vf representor,dpdk文档有说,简单理解,是给控制面准备的vf的分身。

  • https://doc.dpdk.org/guides-18.11/prog_guide/switch_representation.html
.-------------. .-------------. .-------------. | hypervisor | | VM 1 | | VM 2 | | application | | application | | application | `--+---+---+--' `----------+--' `--+----------' | | | | | | | `-------------------. | | | `---------. | | | | | | | | .-----+-----. .-----+-----. .-----+-----. | | | port_id 3 | | port_id 4 | | port_id 5 | | | `-----+-----' `-----+-----' `-----+-----' | | | | | | | .-+--. .-----+-----. .-----+-----. .---+--. .--+---. | PF | | VF 1 rep. | | VF 2 rep. | | VF 1 | | VF 2 | `-+--' `-----+-----' `-----+-----' `---+--' `--+---' | | | | | | | .---------' | | `-----. | | .-----------------' | | | | | .---------------------' | | | | | .--+-------+---+---+---+--. | managed interconnection | `------------+------------' | .----+-----. | physical | | port 0 | `----------'

本次实验的架构图如下:

系统安装

export VAR_HOST='rl_panlab105' # 按照完了操作系统以后,添加kernel参数,主要是intel_iommu=on iommu=pt,然后重启 cp /etc/default/grub /etc/default/grub.bak sed -i "/GRUB_CMDLINE_LINUX/s/resume=[^[:space:]]*//" /etc/default/grub sed -i "/GRUB_CMDLINE_LINUX/s/rd.lvm.lv=${VAR_HOST}\\/swap//" /etc/default/grub # https://unix.stackexchange.com/questions/403706/sed-insert-text-after-nth-character-preceding-following-a-given-string sed -i '/GRUB_CMDLINE_LINUX/s/"/ intel_iommu=on iommu=pt default_hugepagesz=1G hugepagesz=1G hugepages=16 rdblacklist=nouveau"/2' /etc/default/grub grub2-mkconfig -o /boot/efi/EFI/rocky/grub.cfg grub2-mkconfig -o /boot/grub2/grub.cfg # 添加kvm cpu host mode模式的支持,可以不做 cat << EOF > /etc/modprobe.d/kvm-nested.conf options kvm_intel nested=1 options kvm-intel enable_shadow_vmcs=1 options kvm-intel enable_apicv=1 options kvm-intel ept=1 EOF # 默认的操作系统安装,有swap, home分区,我们是测试系统,全都删了吧。 umount /home swapoff /dev/$VAR_HOST/swap cp /etc/fstab /etc/fstab.bak sed -i 's/^[^#]*home/#&/' /etc/fstab sed -i 's/^[^#]*swap/#&/' /etc/fstab lvremove -f /dev/$VAR_HOST/home lvremove -f /dev/$VAR_HOST/swap lvextend -l +100%FREE /dev/$VAR_HOST/root xfs_growfs /dev/$VAR_HOST/root # 至此,开始安装网卡驱动 # 103 driver install # https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed mkdir -p /data/down/ cd /data/down/ dnf groupinstall -y 'Development Tools' dnf groupinstall -y "Server with GUI" wget https://www.mellanox.com/downloads/ofed/MLNX_OFED-5.4-3.0.3.0/MLNX_OFED_LINUX-5.4-3.0.3.0-rhel8.4-x86_64.tgz tar zvxf *.tgz cd /data/down/MLNX_OFED_LINUX-5.4-3.0.3.0-rhel8.4-x86_64 dnf install -y tcl tk kernel-modules-extra python36 make gcc-gfortran tcsh unbound ./mlnxofedinstall --all --force --distro rhel8.4 # ./mlnxofedinstall --dpdk --ovs-dpdk --upstream-libs --add-kernel-support --force --distro rhel8.4 reboot systemctl enable --now mst systemctl enable --now openibd cat << EOF > /etc/yum.repos.d/mlx.repo [mlnx_ofed] name=MLNX_OFED Repository baseurl=file:///data/down/MLNX_OFED_LINUX-5.4-3.0.3.0-rhel8.4-x86_64/RPMS enabled=1 gpgcheck=0 EOF dnf makecache # 开始安装dpdk相关的软件 mkdir -p /data/soft cd /data/soft dnf config-manager --set-enabled powertools dnf install -y ninja-build meson # 装mlnx版本的dpdk组件和ovs软件 # dnf group list # dnf groupinstall -y 'Development Tools' # install dpdk dnf install -y mlnx-dpdk mlnx-dpdk-devel numactl-devel openvswitch openvswitch-selinux-policy libnl3-devel openssl-devel zlib-devel libpcap-devel elfutils-libelf-devel # https://doc.dpdk.org/guides/linux_gsg/sys_reqs.html#compilation-of-the-dpdk pip3 install --user pyelftools systemctl enable --now openvswitch export PATH=$PATH:/opt/mellanox/dpdk/bin/ echo 'export PATH=$PATH:/opt/mellanox/dpdk/bin/' >> ~/.bash_profile # 编译上游的dpdk软件包,因为我们要用里面的vdpa sample程序 cd /data/soft/ wget https://fast.dpdk.org/rel/dpdk-20.11.3.tar.xz tar vxf dpdk-20.11.3.tar.xz # https://core.dpdk.org/doc/quick-start/ cd /data/soft/dpdk-stable-20.11.3/ # meson -Dexamples=all build meson --reconfigure -Dexamples=all build ninja -C build export PKG_CONFIG_PATH=/opt/mellanox/dpdk/lib64/pkgconfig/ cd /data/soft/dpdk-stable-20.11.3/examples/vdpa make -j # 按照kvm相关软件包 # install kvm with qemu # dnf -y groupinstall "Server with GUI" dnf -y install qemu-kvm libvirt libguestfs-tools virt-install virt-viewer virt-manager tigervnc-server systemctl disable --now firewalld systemctl enable --now libvirtd # 最后,设置mlx网卡参数,激活sriov # pci地址,使用 lspci -D | grep -i mell 或者 lshw -c network -businfo 得到 lspci -D | grep -i mell # 0000:04:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx] # 0000:04:00.1 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx] lshw -c network -businfo # Bus info Device Class Description # ======================================================= # pci@0000:02:00.0 eno3 network NetXtreme BCM5720 2-port Gigabit Ethernet PCIe # pci@0000:02:00.1 eno4 network NetXtreme BCM5720 2-port Gigabit Ethernet PCIe # pci@0000:01:00.0 eno1 network NetXtreme BCM5720 2-port Gigabit Ethernet PCIe # pci@0000:01:00.1 eno2 network NetXtreme BCM5720 2-port Gigabit Ethernet PCIe # pci@0000:04:00.0 enp4s0f0 network MT2892 Family [ConnectX-6 Dx] # pci@0000:04:00.1 enp4s0f1 network MT2892 Family [ConnectX-6 Dx] # UCTX_EN is for enable DevX # DevX allows to access firmware objects mlxconfig -y -d 0000:04:00.0 set SRIOV_EN=1 UCTX_EN=1 NUM_OF_VFS=8

ovs-kernel 方案

网卡设置脚本

# mlx默认的ovs,缺少一些selinux的配置,在此补上 # 项目上,可以根据需要,自行补充缺少的selinux配置 semodule -i wzh-mellanox-ovs-dpdk.pp # 这里做了一个配置和启动ovs的脚步,逻辑是先清空ovs配置,再配置网卡模式,然后启动ovs cat << 'EOF' > /data/ovs-offload-env.sh #!/usr/bin/env bash set -e set -x systemctl restart openvswitch ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=try systemctl restart openvswitch ip link set dev ${IFNAME} down || true ip link set dev ${IFNAME}_0 down || true ip link set dev ${IFNAME}_1 down || true ip link set dev ${IFNAME}v0 down || true ip link set dev ${IFNAME}v1 down || true ovs-vsctl del-port ovs-sriov ${IFNAME} || true ovs-vsctl del-port ovs-sriov ${IFNAME}_0 || true ovs-vsctl del-port ovs-sriov ${IFNAME}_1 || true ovs-vsctl del-br ovs-sriov || true ovs-vsctl del-port br0-ovs pf0vf0 || true ovs-vsctl del-port br0-ovs pf0vf1 || true ovs-vsctl del-port br0-ovs pf0 || true ovs-vsctl del-br br0-ovs || true ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=false ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-extra=" " ovs-vsctl --no-wait set Open_vSwitch . other_config={} # Turn off SR-IOV on the PF device. echo 0 > /sys/class/net/$IFNAME/device/sriov_numvfs cat /sys/class/net/$IFNAME/device/sriov_numvfs # 0 systemctl restart openvswitch # Turn ON SR-IOV on the PF device. echo 2 > /sys/class/net/$IFNAME/device/sriov_numvfs cat /sys/class/net/$IFNAME/device/sriov_numvfs # 2 ip link set $IFNAME vf 0 mac ${VF1MAC} ip link set $IFNAME vf 1 mac ${VF2MAC} echo ${PCINUM%%.*}.2 > /sys/bus/pci/drivers/mlx5_core/unbind || true echo ${PCINUM%%.*}.3 > /sys/bus/pci/drivers/mlx5_core/unbind || true devlink dev eswitch set pci/$PCINUM mode switchdev devlink dev eswitch show pci/$PCINUM # # pci/0000:43:00.0: mode switchdev inline-mode none encap-mode basic echo ${PCINUM%%.*}.2 > /sys/bus/pci/drivers/mlx5_core/bind echo ${PCINUM%%.*}.3 > /sys/bus/pci/drivers/mlx5_core/bind # systemctl enable --now openvswitch # systemctl restart openvswitch # Create an OVS bridge (here it's named ovs-sriov). ovs-vsctl add-br ovs-sriov ovs-vsctl set Open_vSwitch . other_config:hw-offload=true systemctl restart openvswitch ovs-vsctl add-port ovs-sriov ${IFNAME} ovs-vsctl add-port ovs-sriov ${IFNAME}_0 ovs-vsctl add-port ovs-sriov ${IFNAME}_1 ip link set dev ${IFNAME} up ip link set dev ${IFNAME}_0 up ip link set dev ${IFNAME}_1 up ip link set dev ${IFNAME}v0 up ip link set dev ${IFNAME}v1 up # systemctl restart openvswitch # ip addr add ${VF1IP} dev ${IFNAME}v0 # ip addr add ${VF2IP} dev ${IFNAME}v1 EOF # for 103 # export IFNAME=enp4s0f0 # export PCINUM=0000:04:00.0 # export VF1MAC=e4:11:22:33:44:50 # export VF2MAC=e4:11:22:33:44:51 # export VF1IP=192.168.55.21/24 # export VF2IP=192.168.55.22/24 # bash /data/ovs-offload-env.sh # 设置一下环境变量,就可以执行脚本,启动ovs了。 # for 105 export IFNAME=enp67s0f0 export PCINUM=0000:43:00.0 export VF1MAC=e4:11:22:33:55:60 export VF2MAC=e4:11:22:33:55:61 # export VF1IP=192.168.55.31/24 # export VF2IP=192.168.55.32/24 bash /data/ovs-offload-env.sh # 我们还需要启动一个DPDK的程序,做vdpa的功能,并接到vf上去。 /data/soft/dpdk-stable-20.11.3/examples/vdpa/build/vdpa -w ${PCINUM%%.*}.2,class=vdpa --log-level=pmd,info -- -i create /tmp/sock-virtio0 0000:43:00.2 # EAL: Detected 24 lcore(s) # EAL: Detected 2 NUMA nodes # Option -w, --pci-whitelist is deprecated, use -a, --allow option instead # EAL: Detected shared linkage of DPDK # EAL: Multi-process socket /var/run/dpdk/rte/mp_socket # EAL: Selected IOVA mode 'VA' # EAL: No available hugepages reported in hugepages-2048kB # EAL: Probing VFIO support... # EAL: Probe PCI driver: mlx5_pci (15b3:101e) device: 0000:43:00.2 (socket 1) # mlx5_vdpa: ROCE is disabled by Netlink successfully. # EAL: No legacy callbacks, legacy socket not created # Interactive-mode selected # vdpa> create /tmp/sock-virtio0 0000:43:00.2 # VHOST_CONFIG: vhost-user server: socket created, fd: 112 # VHOST_CONFIG: bind to /tmp/sock-virtio0 # vdpa> vdpa> list # device name queue num supported features # 0000:43:00.2 256 0x114c60180b vdpa> stats 0000:43:00.2 0 # Device 0000:43:00.2: # Virtq 0: # received_descriptors 1024 # completed_descriptors 39 # bad descriptor errors 0 # exceed max chain 0 # invalid buffer 0 # completion errors 0

kvm

接下来,我们就要创建一个kvm,来使用我们的vdpa通道。

由于我们创建了一个socket,需要qemu有权限读取这个socket,所以我们需要把qemu的用户改为root。

sed -i.bak 's/#user = "root"/user = "root"/' /etc/libvirt/qemu.conf # 我们还需要创建一个网桥,让kvm能接住宿主机的网口能上网。方便访问和管理。 mkdir -p /data/kvm cat << 'EOF' > /data/kvm/bridge.sh #!/usr/bin/env bash PUB_CONN='eno1' PUB_IP='172.21.6.103/24' PUB_GW='172.21.6.254' PUB_DNS='172.21.1.1' nmcli con down "$PUB_CONN" nmcli con delete "$PUB_CONN" nmcli con down baremetal nmcli con delete baremetal # RHEL 8.1 appends the word "System" in front of the connection,delete in case it exists nmcli con down "System $PUB_CONN" nmcli con delete "System $PUB_CONN" nmcli connection add ifname baremetal type bridge con-name baremetal ipv4.method 'manual' \ ipv4.address "$PUB_IP" \ ipv4.gateway "$PUB_GW" \ ipv4.dns "$PUB_DNS" nmcli con add type bridge-slave ifname "$PUB_CONN" master baremetal nmcli con down "$PUB_CONN";pkill dhclient;dhclient baremetal nmcli con up baremetal EOF bash /data/kvm/bridge.sh # 我们先用标准的方法,创建,启动和安装一个kvm cd /data/kvm export DOMAIN=cx6.1 virt-install --name="${DOMAIN}" --vcpus=2 --ram=8192 \ --cputune vcpupin0.vcpu=14,vcpupin1.vcpu=16 \ --memorybacking hugepages.page0.size=1,hugepages.page0.unit=GiB \ --cpu host-model \ --disk path=/data/kvm/${DOMAIN}.qcow2,bus=virtio,size=30 \ --os-variant rhel8.4 \ --network bridge=baremetal,model=virtio \ --graphics vnc,port=59000 \ --boot menu=on --location /data/kvm/Rocky-8.4-x86_64-minimal.iso \ --initrd-inject helper-ks-rocky.cfg --extra-args "inst.ks=file:/helper-ks-rocky.cfg" # 接下来,配置这个kvm,把vdpa的通道加入到kvm里面。 # https://unix.stackexchange.com/questions/235414/libvirt-how-to-pass-qemu-command-line-args # virt-xml $DOMAIN --edit --confirm --qemu-commandline 'env=MY-ENV=1234' virt-xml $DOMAIN --edit --qemu-commandline='-chardev socket,id=charnet1,path=/tmp/sock-virtio0' virt-xml $DOMAIN --edit --qemu-commandline='-netdev vhost-user,chardev=charnet1,queues=16,id=hostnet1' virt-xml $DOMAIN --edit --qemu-commandline='-device virtio-net-pci,mq=on,vectors=6,netdev=hostnet1,id=net1,mac=e4:11:c6:d3:45:f2,bus=pcie.0,addr=0x6,page-per-vq=on,rx_queue_size=1024,tx_queue_size=1024'

接下来,要手动修改如下的配置配置,注意这里cpu binding的核,都应该在一个numa上面。

virsh edit cx6.1
<cputune> <vcpupin vcpu='0' cpuset='14'/> <vcpupin vcpu='1' cpuset='16'/> </cputune> <cpu mode='host-model' check='partial'> <numa> <cell id='0' cpus='0-1' memory='8388608' unit='KiB' memAccess='shared'/> </numa> </cpu>

最后的配置样例如下,项目中,可以根据以下例子排错。

virsh dumpxml cx6.1
<domain type='kvm' id='11' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'> <name>cx6.1</name> <uuid>5cbb6f7c-7122-4fc4-9706-ff46aed3bf25</uuid> <metadata> <libosinfo:libosinfo xmlns:libosinfo="http://libosinfo.org/xmlns/libvirt/domain/1.0"> <libosinfo:os id="http://redhat.com/rhel/8.4"/> </libosinfo:libosinfo> </metadata> <memory unit='KiB'>8388608</memory> <currentMemory unit='KiB'>8388608</currentMemory> <memoryBacking> <hugepages> <page size='1048576' unit='KiB'/> </hugepages> </memoryBacking> <vcpu placement='static'>2</vcpu> <cputune> <vcpupin vcpu='0' cpuset='14'/> <vcpupin vcpu='1' cpuset='16'/> </cputune> <resource> <partition>/machine</partition> </resource> <os> <type arch='x86_64' machine='pc-q35-rhel8.2.0'>hvm</type> <boot dev='hd'/> <bootmenu enable='yes'/> </os> <features> <acpi/> <apic/> </features> <cpu mode='custom' match='exact' check='full'> <model fallback='forbid'>IvyBridge-IBRS</model> <vendor>Intel</vendor> <feature policy='require' name='ss'/> <feature policy='require' name='vmx'/> <feature policy='require' name='pdcm'/> <feature policy='require' name='pcid'/> <feature policy='require' name='hypervisor'/> <feature policy='require' name='arat'/> <feature policy='require' name='tsc_adjust'/> <feature policy='require' name='umip'/> <feature policy='require' name='md-clear'/> <feature policy='require' name='stibp'/> <feature policy='require' name='arch-capabilities'/> <feature policy='require' name='ssbd'/> <feature policy='require' name='xsaveopt'/> <feature policy='require' name='pdpe1gb'/> <feature policy='require' name='ibpb'/> <feature policy='require' name='ibrs'/> <feature policy='require' name='amd-stibp'/> <feature policy='require' name='amd-ssbd'/> <feature policy='require' name='skip-l1dfl-vmentry'/> <feature policy='require' name='pschange-mc-no'/> <numa> <cell id='0' cpus='0-1' memory='8388608' unit='KiB' memAccess='shared'/> </numa> </cpu> <clock offset='utc'> <timer name='rtc' tickpolicy='catchup'/> <timer name='pit' tickpolicy='delay'/> <timer name='hpet' present='no'/> </clock> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>destroy</on_crash> <pm> <suspend-to-mem enabled='no'/> <suspend-to-disk enabled='no'/> </pm> <devices> <emulator>/usr/libexec/qemu-kvm</emulator> <disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/data/kvm/cx6.1.qcow2' index='2'/> <backingStore/> <target dev='vda' bus='virtio'/> <alias name='virtio-disk0'/> <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/> </disk> <disk type='file' device='cdrom'> <driver name='qemu'/> <target dev='sda' bus='sata'/> <readonly/> <alias name='sata0-0-0'/> <address type='drive' controller='0' bus='0' target='0' unit='0'/> </disk> <controller type='usb' index='0' model='qemu-xhci' ports='15'> <alias name='usb'/> <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> </controller> <controller type='sata' index='0'> <alias name='ide'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/> </controller> <controller type='pci' index='0' model='pcie-root'> <alias name='pcie.0'/> </controller> <controller type='pci' index='1' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='1' port='0x10'/> <alias name='pci.1'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0' multifunction='on'/> </controller> <controller type='pci' index='2' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='2' port='0x11'/> <alias name='pci.2'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x1'/> </controller> <controller type='pci' index='3' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='3' port='0x12'/> <alias name='pci.3'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x2'/> </controller> <controller type='pci' index='4' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='4' port='0x13'/> <alias name='pci.4'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x3'/> </controller> <controller type='pci' index='5' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='5' port='0x14'/> <alias name='pci.5'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x4'/> </controller> <controller type='pci' index='6' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='6' port='0x15'/> <alias name='pci.6'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x5'/> </controller> <controller type='pci' index='7' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='7' port='0x16'/> <alias name='pci.7'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x6'/> </controller> <controller type='virtio-serial' index='0'> <alias name='virtio-serial0'/> <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/> </controller> <interface type='bridge'> <mac address='52:54:00:8d:b6:8e'/> <source bridge='baremetal'/> <target dev='vnet2'/> <model type='virtio'/> <alias name='net0'/> <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </interface> <serial type='pty'> <source path='/dev/pts/6'/> <target type='isa-serial' port='0'> <model name='isa-serial'/> </target> <alias name='serial0'/> </serial> <console type='pty' tty='/dev/pts/6'> <source path='/dev/pts/6'/> <target type='serial' port='0'/> <alias name='serial0'/> </console> <channel type='unix'> <source mode='bind' path='/var/lib/libvirt/qemu/channel/target/domain-11-cx6.1/org.qemu.guest_agent.0'/> <target type='virtio' name='org.qemu.guest_agent.0' state='disconnected'/> <alias name='channel0'/> <address type='virtio-serial' controller='0' bus='0' port='1'/> </channel> <input type='tablet' bus='usb'> <alias name='input0'/> <address type='usb' bus='0' port='1'/> </input> <input type='mouse' bus='ps2'> <alias name='input1'/> </input> <input type='keyboard' bus='ps2'> <alias name='input2'/> </input> <graphics type='vnc' port='59000' autoport='no' listen='127.0.0.1'> <listen type='address' address='127.0.0.1'/> </graphics> <video> <model type='qxl' ram='65536' vram='65536' vgamem='16384' heads='1' primary='yes'/> <alias name='video0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> </video> <memballoon model='virtio'> <stats period='5'/> <alias name='balloon0'/> <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/> </memballoon> <rng model='virtio'> <backend model='random'>/dev/urandom</backend> <alias name='rng0'/> <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/> </rng> </devices> <seclabel type='dynamic' model='selinux' relabel='yes'> <label>system_u:system_r:svirt_t:s0:c46,c926</label> <imagelabel>system_u:object_r:svirt_image_t:s0:c46,c926</imagelabel> </seclabel> <seclabel type='dynamic' model='dac' relabel='yes'> <label>+0:+0</label> <imagelabel>+0:+0</imagelabel> </seclabel> <qemu:commandline> <qemu:arg value='-chardev'/> <qemu:arg value='socket,id=charnet1,path=/tmp/sock-virtio0'/> <qemu:arg value='-netdev'/> <qemu:arg value='vhost-user,chardev=charnet1,queues=16,id=hostnet1'/> <qemu:arg value='-device'/> <qemu:arg value='virtio-net-pci,mq=on,vectors=6,netdev=hostnet1,id=net1,mac=e4:11:c6:d3:45:f2,bus=pcie.0,addr=0x6,page-per-vq=on,rx_queue_size=1024,tx_queue_size=1024'/> </qemu:commandline> </domain>

赶紧试试吧

接下来就进入测试和体验环节。

# in cx6.1 kvm # nmcli dev connect enp0s6 nmcli con modify enp0s6 ipv4.method manual ipv4.addresses 192.168.99.11/24 # nmcli con modify enp0s6 ipv4.method manual ipv4.addresses 192.168.55.91/24 nmcli con up enp0s6 # on peer machine (102) nmcli con modify enp66s0f0 ipv4.method manual ipv4.addresses 192.168.99.21/24 # nmcli con modify enp66s0f0 ipv4.method manual ipv4.addresses 192.168.55.92/24 # nmcli dev connect enp66s0f0 nmcli con up enp66s0f0 # run after the tcpdump is running ping 192.168.99.21 # PING 192.168.99.21 (192.168.99.21) 56(84) bytes of data. # 64 bytes from 192.168.99.21: icmp_seq=1 ttl=64 time=0.089 ms # 64 bytes from 192.168.99.21: icmp_seq=2 ttl=64 time=0.044 ms # 64 bytes from 192.168.99.21: icmp_seq=3 ttl=64 time=0.046 ms # .... # on 105 tcpdump -i enp67s0f0_0 -w dump.test # dropped privs to tcpdump # tcpdump: listening on enp67s0f0_0, link-type EN10MB (Ethernet), capture size 262144 bytes # ^C2 packets captured # 2 packets received by filter # 0 packets dropped by kernel tcpdump -i enp67s0f0 -w dump.test # dropped privs to tcpdump # tcpdump: listening on enp67s0f0, link-type EN10MB (Ethernet), capture size 262144 bytes # ^C4 packets captured # 4 packets received by filter # 0 packets dropped by kernel

用 wireshark 打开,可以看到是标准的icmp包,说明我们构建的是数据通路,而不是协议封装。另外,我们会发现,ping了很多包,但是我们只是抓到了1个,这是因为,网卡offload了,我们只能抓到第一个进入内核查流表的包,后面的都网卡offload了,就抓不到了。

以下是在pf上抓的包,抓到了4个。都是流的第一个包,后面的就都offload啦。

# ovs-dpctl dump-flows # on 105 # 看看ovs的流表,可以看到有2个arp(0x0806)的流表(0x0806),正向和方向 # 还有2个ip(0x0800)的流表,正向和反向 ovs-appctl dpctl/dump-flows type=offloaded # recirc_id(0),in_port(2),eth(src=0c:42:a1:fa:18:8e,dst=e4:11:c6:d3:45:f2),eth_type(0x0800),ipv4(frag=no), packets:149, bytes:15198, used:0.510s, actions:3 # recirc_id(0),in_port(2),eth(src=0c:42:a1:fa:18:8e,dst=e4:11:c6:d3:45:f2),eth_type(0x0806), packets:0, bytes:0, used:8.700s, actions:3 # recirc_id(0),in_port(3),eth(src=e4:11:c6:d3:45:f2,dst=0c:42:a1:fa:18:8e),eth_type(0x0800),ipv4(frag=no), packets:149, bytes:14602, used:0.510s, actions:2 # recirc_id(0),in_port(3),eth(src=e4:11:c6:d3:45:f2,dst=0c:42:a1:fa:18:8e),eth_type(0x0806), packets:0, bytes:0, used:8.701s, actions:2 # 我们再看看tc的配置,可以看到ovs把配置下发给了tc # 这里是vf的入流量,可以看到它把流量镜像给了父端口,并且规则由硬件实现 tc -s filter show dev enp67s0f0_0 ingress # filter protocol ip pref 2 flower chain 0 # filter protocol ip pref 2 flower chain 0 handle 0x1 # dst_mac 0c:42:a1:fa:18:8e # src_mac e4:11:c6:d3:45:f2 # eth_type ipv4 # ip_flags nofrag # in_hw in_hw_count 1 # action order 1: mirred (Egress Redirect to device enp67s0f0) stolen # index 4 ref 1 bind 1 installed 318 sec used 0 sec # Action statistics: # Sent 30380 bytes 310 pkt (dropped 0, overlimits 0 requeues 0) # Sent software 0 bytes 0 pkt # Sent hardware 30380 bytes 310 pkt # backlog 0b 0p requeues 0 # cookie 8be6df4d7d4c33fce08f01a46fa10a4a # no_percpu # used_hw_stats delayed # 我们再看看vf的出流量 # 有2个规则,一个是arp,一个是ip # 都会把流量镜像给了父端口,并且规则由硬件实现 tc -s filter show dev enp67s0f0_0 egress # filter ingress protocol ip pref 2 flower chain 0 # filter ingress protocol ip pref 2 flower chain 0 handle 0x1 # dst_mac 0c:42:a1:fa:18:8e # src_mac e4:11:c6:d3:45:f2 # eth_type ipv4 # ip_flags nofrag # in_hw in_hw_count 1 # action order 1: mirred (Egress Redirect to device enp67s0f0) stolen # index 4 ref 1 bind 1 installed 379 sec used 0 sec # Action statistics: # Sent 36260 bytes 370 pkt (dropped 0, overlimits 0 requeues 0) # Sent software 0 bytes 0 pkt # Sent hardware 36260 bytes 370 pkt # backlog 0b 0p requeues 0 # cookie 8be6df4d7d4c33fce08f01a46fa10a4a # no_percpu # used_hw_stats delayed # filter ingress protocol arp pref 4 flower chain 0 # filter ingress protocol arp pref 4 flower chain 0 handle 0x1 # dst_mac 0c:42:a1:fa:18:8e # src_mac e4:11:c6:d3:45:f2 # eth_type arp # in_hw in_hw_count 1 # action order 1: mirred (Egress Redirect to device enp67s0f0) stolen # index 3 ref 1 bind 1 installed 13 sec used 6 sec # Action statistics: # Sent 60 bytes 1 pkt (dropped 0, overlimits 0 requeues 0) # Sent software 0 bytes 0 pkt # Sent hardware 60 bytes 1 pkt # backlog 0b 0p requeues 0 # cookie 1fbfd56eae42f9dbe71bf99bd800cd6d # no_percpu # used_hw_stats delayed tc qdisc show dev enp67s0f0_0 # qdisc mq 0: root # qdisc fq_codel 0: parent :1 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 # qdisc ingress ffff: parent ffff:fff1 ---------------- # 最后,我们把系统环境记录一下,方便回忆和项目上对比。 # on 105 ip link # 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 # link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 # 2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master baremetal state UP mode DEFAULT group default qlen 1000 # link/ether 90:b1:1c:40:59:27 brd ff:ff:ff:ff:ff:ff # 3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000 # link/ether 90:b1:1c:40:59:28 brd ff:ff:ff:ff:ff:ff # 4: eno3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000 # link/ether 90:b1:1c:40:59:29 brd ff:ff:ff:ff:ff:ff # 5: eno4: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000 # link/ether 90:b1:1c:40:59:2a brd ff:ff:ff:ff:ff:ff # 6: enp67s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000 # link/ether 0c:42:a1:fa:18:a2 brd ff:ff:ff:ff:ff:ff # vf 0 link/ether e4:11:22:33:55:60 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off # vf 1 link/ether e4:11:22:33:55:61 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off # 7: enp67s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 # link/ether 0c:42:a1:fa:18:a3 brd ff:ff:ff:ff:ff:ff # 8: ib0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc mq state DOWN mode DEFAULT group default qlen 256 # link/infiniband 00:00:10:28:fe:80:00:00:00:00:00:00:98:03:9b:03:00:cc:71:2c brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff # 9: baremetal: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 # link/ether 90:b1:1c:40:59:27 brd ff:ff:ff:ff:ff:ff # 10: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 # link/ether 52:54:00:8f:4a:bc brd ff:ff:ff:ff:ff:ff # 11: virbr0-nic: <BROADCAST,MULTICAST> mtu 1500 qdisc fq_codel master virbr0 state DOWN mode DEFAULT group default qlen 1000 # link/ether 52:54:00:8f:4a:bc brd ff:ff:ff:ff:ff:ff # 16: enp67s0f0_0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000 # link/ether fa:cf:0f:6a:ec:45 brd ff:ff:ff:ff:ff:ff # 17: enp67s0f0_1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000 # link/ether 76:65:93:70:96:ac brd ff:ff:ff:ff:ff:ff # 18: enp67s0f0v0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 # link/ether e4:11:22:33:55:60 brd ff:ff:ff:ff:ff:ff # 19: enp67s0f0v1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 # link/ether e4:11:22:33:55:61 brd ff:ff:ff:ff:ff:ff # 20: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 # link/ether f6:e9:fd:16:8a:ea brd ff:ff:ff:ff:ff:ff # 21: ovs-sriov: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 # link/ether 0c:42:a1:fa:18:a2 brd ff:ff:ff:ff:ff:ff # 22: vnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master baremetal state UNKNOWN mode DEFAULT group default qlen 1000 # link/ether fe:54:00:8d:b6:8e brd ff:ff:ff:ff:ff:ff ip a # 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 # link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 # inet 127.0.0.1/8 scope host lo # valid_lft forever preferred_lft forever # inet6 ::1/128 scope host # valid_lft forever preferred_lft forever # 2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master baremetal state UP group default qlen 1000 # link/ether 90:b1:1c:40:59:27 brd ff:ff:ff:ff:ff:ff # 3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 # link/ether 90:b1:1c:40:59:28 brd ff:ff:ff:ff:ff:ff # 4: eno3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 # link/ether 90:b1:1c:40:59:29 brd ff:ff:ff:ff:ff:ff # 5: eno4: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 # link/ether 90:b1:1c:40:59:2a brd ff:ff:ff:ff:ff:ff # 6: enp67s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP group default qlen 1000 # link/ether 0c:42:a1:fa:18:a2 brd ff:ff:ff:ff:ff:ff # 7: enp67s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 # link/ether 0c:42:a1:fa:18:a3 brd ff:ff:ff:ff:ff:ff # 8: ib0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc mq state DOWN group default qlen 256 # link/infiniband 00:00:10:28:fe:80:00:00:00:00:00:00:98:03:9b:03:00:cc:71:2c brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff # 9: baremetal: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 # link/ether 90:b1:1c:40:59:27 brd ff:ff:ff:ff:ff:ff # inet 172.21.6.105/24 brd 172.21.6.255 scope global noprefixroute baremetal # valid_lft forever preferred_lft forever # inet6 fe80::12a7:202d:c70b:be14/64 scope link noprefixroute # valid_lft forever preferred_lft forever # 10: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000 # link/ether 52:54:00:8f:4a:bc brd ff:ff:ff:ff:ff:ff # inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0 # valid_lft forever preferred_lft forever # 11: virbr0-nic: <BROADCAST,MULTICAST> mtu 1500 qdisc fq_codel master virbr0 state DOWN group default qlen 1000 # link/ether 52:54:00:8f:4a:bc brd ff:ff:ff:ff:ff:ff # 16: enp67s0f0_0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP group default qlen 1000 # link/ether fa:cf:0f:6a:ec:45 brd ff:ff:ff:ff:ff:ff # 17: enp67s0f0_1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP group default qlen 1000 # link/ether 76:65:93:70:96:ac brd ff:ff:ff:ff:ff:ff # 18: enp67s0f0v0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 # link/ether e4:11:22:33:55:60 brd ff:ff:ff:ff:ff:ff # inet 192.168.55.31/24 scope global enp67s0f0v0 # valid_lft forever preferred_lft forever # 19: enp67s0f0v1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 # link/ether e4:11:22:33:55:61 brd ff:ff:ff:ff:ff:ff # inet 192.168.55.32/24 scope global enp67s0f0v1 # valid_lft forever preferred_lft forever # 20: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 # link/ether f6:e9:fd:16:8a:ea brd ff:ff:ff:ff:ff:ff # 21: ovs-sriov: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 # link/ether 0c:42:a1:fa:18:a2 brd ff:ff:ff:ff:ff:ff # 22: vnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master baremetal state UNKNOWN group default qlen 1000 # link/ether fe:54:00:8d:b6:8e brd ff:ff:ff:ff:ff:ff # inet6 fe80::fc54:ff:fe8d:b68e/64 scope link # valid_lft forever preferred_lft forever ovs-vsctl show # 8f3eddeb-c42c-4af4-9dc8-a46169d91a7c # Bridge ovs-sriov # Port enp67s0f0_1 # Interface enp67s0f0_1 # Port ovs-sriov # Interface ovs-sriov # type: internal # Port enp67s0f0 # Interface enp67s0f0 # Port enp67s0f0_0 # Interface enp67s0f0_0 # ovs_version: "2.14.1" # on kvm ip link # 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 # link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 # 2: enp0s6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 # link/ether e4:11:c6:d3:45:f2 brd ff:ff:ff:ff:ff:ff # 3: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000 # link/ether 52:54:00:8d:b6:8e brd ff:ff:ff:ff:ff:ff ip a # 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 # link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 # inet 127.0.0.1/8 scope host lo # valid_lft forever preferred_lft forever # inet6 ::1/128 scope host # valid_lft forever preferred_lft forever # 2: enp0s6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 # link/ether e4:11:c6:d3:45:f2 brd ff:ff:ff:ff:ff:ff # inet 192.168.99.11/24 brd 192.168.99.255 scope global noprefixroute enp0s6 # valid_lft forever preferred_lft forever # inet6 fe80::f3c:b686:1739:a748/64 scope link noprefixroute # valid_lft forever preferred_lft forever # 3: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 # link/ether 52:54:00:8d:b6:8e brd ff:ff:ff:ff:ff:ff # inet 172.21.6.11/24 brd 172.21.6.255 scope global noprefixroute enp1s0 # valid_lft forever preferred_lft forever # inet6 fe80::5054:ff:fe8d:b68e/64 scope link noprefixroute # valid_lft forever preferred_lft forever

性能测试

# on 102 dnf install -y iperf3 systemctl disable --now firewalld iperf3 -s -p 6666 # on 11 dnf install -y iperf3 iperf3 -t 20 -p 6666 -c 192.168.99.21 Connecting to host 192.168.99.21, port 6666 [ 5] local 192.168.99.11 port 50960 connected to 192.168.99.21 port 6666 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 1.40 GBytes 12.0 Gbits/sec 0 594 KBytes [ 5] 1.00-2.00 sec 1.39 GBytes 12.0 Gbits/sec 0 594 KBytes [ 5] 2.00-3.00 sec 1.39 GBytes 12.0 Gbits/sec 0 594 KBytes [ 5] 3.00-4.00 sec 1.40 GBytes 12.0 Gbits/sec 0 624 KBytes [ 5] 4.00-5.00 sec 1.40 GBytes 12.0 Gbits/sec 0 659 KBytes [ 5] 5.00-6.00 sec 1.40 GBytes 12.0 Gbits/sec 0 659 KBytes [ 5] 6.00-7.00 sec 1.40 GBytes 12.0 Gbits/sec 0 659 KBytes [ 5] 7.00-8.00 sec 1.40 GBytes 12.0 Gbits/sec 0 1.03 MBytes [ 5] 8.00-9.00 sec 1.40 GBytes 12.0 Gbits/sec 0 1.03 MBytes [ 5] 9.00-10.00 sec 1.40 GBytes 12.0 Gbits/sec 0 1.03 MBytes [ 5] 10.00-11.00 sec 1.39 GBytes 12.0 Gbits/sec 0 1.03 MBytes [ 5] 11.00-12.00 sec 1.39 GBytes 12.0 Gbits/sec 0 1.03 MBytes [ 5] 12.00-13.00 sec 1.39 GBytes 11.9 Gbits/sec 0 1.03 MBytes [ 5] 13.00-14.00 sec 1.39 GBytes 11.9 Gbits/sec 0 1.03 MBytes [ 5] 14.00-15.00 sec 1.39 GBytes 11.9 Gbits/sec 0 1.03 MBytes [ 5] 15.00-16.00 sec 1.39 GBytes 11.9 Gbits/sec 0 1.03 MBytes [ 5] 16.00-17.00 sec 1.39 GBytes 12.0 Gbits/sec 0 1.03 MBytes [ 5] 17.00-18.00 sec 1.39 GBytes 11.9 Gbits/sec 0 1.03 MBytes [ 5] 18.00-19.00 sec 1.39 GBytes 11.9 Gbits/sec 0 1.03 MBytes [ 5] 19.00-20.00 sec 1.39 GBytes 11.9 Gbits/sec 0 1.03 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-20.00 sec 27.9 GBytes 12.0 Gbits/sec 0 sender [ 5] 0.00-20.04 sec 27.9 GBytes 11.9 Gbits/sec receiver iperf Done. # on 105 systemctl disable --now irqbalance.service mlnx_affinity start # on 102 systemctl disable --now irqbalance.service mlnx_affinity start # on 102 dnf install -y qperf qperf # on 105 qperf 192.168.88.21 tcp_bw tcp_bw: bw = 2.8 GB/sec # on 101 qperf 192.168.99.21 tcp_bw tcp_bw: bw = 1.48 GB/sec