Onboarding Metax GPU on OpenShift 4.18

Introduction

This document provides a comprehensive guide for integrating Metax GPUs with an OpenShift Container Platform 4.18 cluster. Metax, a GPU vendor similar to NVIDIA, offers hardware that performs well on standard RHEL environments. However, enabling Metax GPUs on OpenShift requires specific configurations and a custom Red Hat Enterprise Linux CoreOS (RHCOS) image.

The primary objective is to build a custom RHCOS image that includes the necessary Metax drivers, deploy an OpenShift cluster using this image, and finally, install the Metax device plugin/operator to expose the GPU resources to containerized workloads. This guide covers the entire end-to-end process, from bare-metal helper node preparation to deploying a sample GPU-accelerated application.

1. DNS Configuration

A reliable DNS infrastructure is a critical prerequisite for any OpenShift installation. For this test environment, we will configure public DNS records. In a disconnected or air-gapped environment, you must provide your own internal DNS server to resolve these records.

Public DNS Records:

mirror.infra.wzhlab.top -> 192.168.99.1 (Local Registry)
api.demo-01-rhsys.wzhlab.top -> 192.168.99.21 (Cluster API)
api-int.demo-01-rhsys.wzhlab.top -> 192.168.99.21 (Internal Cluster API)
*.apps.demo-01-rhsys.wzhlab.top -> 192.168.99.22 (Wildcard for Applications)

2. Helper Node Preparation

We will utilize a bare-metal server as a “helper node.” This server will host the necessary services (DNS, registry, etc.) and run the virtual machines (VMs) that will form the OpenShift cluster.

2.1. Kernel Boot Parameters for GPU Passthrough

To pass the physical GPU cards from the bare-metal host to the guest VMs, we need to enable IOMMU (Input-Output Memory Management Unit) in the host’s kernel. This is achieved by modifying the GRUB boot parameters.

# This command enables Intel IOMMU and passthrough mode for all kernels
sudo grubby --update-kernel=ALL --args="intel_iommu=on iommu=pt"

A reboot is required for these changes to take effect.

2.2. Kernel Parameters for Overcommit and NAT

We need to adjust kernel parameters on the helper node to enable IP forwarding, which is essential for the NAT networking used by the VMs. Additionally, if the host has limited physical memory, configuring memory overcommit can provide more flexibility.

# Enable IP forwarding
cat << EOF >> /etc/sysctl.d/99-wzh-sysctl.conf
net.ipv4.ip_forward = 1
EOF

# Apply the new settings
sysctl --system 

# Verify the setting
sysctl -a | grep ip_forward
# net.ipv4.ip_forward = 1
# net.ipv4.ip_forward_update_priority = 1
# net.ipv4.ip_forward_use_pmtu = 0

2.3. LVM Configuration for VM Storage

To efficiently manage storage for our VMs, we will create an LVM (Logical Volume Manager) thin pool using the available NVMe disks. This allows for flexible and space-efficient provisioning of logical volumes, which will serve as the virtual disks for the VMs.

# --- Configurable Variables ---
VG_NAME="vgdata"
POOL_NAME="poolA"
STRIPE_SIZE_KB=256

# Discover all NVMe disks
ALL_NVME_DISKS=$(lsblk -d -n -o NAME,TYPE | grep "nvme" | grep "disk" | awk '{print "/dev/"$1}')
echo "Discovered NVMe disks: $ALL_NVME_DISKS"

PV_DEVICES=$ALL_NVME_DISKS
# Trim whitespace
PV_DEVICES=$(echo "$PV_DEVICES" | xargs)

NUM_DISKS=$(echo "$PV_DEVICES" | wc -w)
echo "Found $NUM_DISKS NVMe disks to be used: $PV_DEVICES"

echo -e "\n--- Creating Physical Volumes (PVs) and a Volume Group (VG) ---"
echo "Initializing PVs..."
sudo pvcreate -y $PV_DEVICES

echo "Creating Volume Group '$VG_NAME'..."
sudo vgcreate "$VG_NAME" $PV_DEVICES

echo "Creating LVM thin pool '$POOL_NAME'..."
sudo lvcreate --type thin-pool -i "$NUM_DISKS" -I "${STRIPE_SIZE_KB}k" -c "${STRIPE_SIZE_KB}k" -Zn -l 99%FREE -n "$POOL_NAME" "$VG_NAME"

echo "Extending the pool to use all available space..."
lvextend -l +100%FREE $VG_NAME/$POOL_NAME

2.4. KVM and Network Setup

Next, we will install KVM/libvirt packages and configure a virtual network bridge. This bridge will provide network connectivity for the OpenShift VMs and will be configured with NAT to allow outbound access.

# Install required tools and development packages
dnf -y install byobu htop jq ipmitool nmstate /usr/bin/htpasswd
dnf groupinstall -y "development" "Server with GUI"

# Install KVM and virtualization management tools
dnf -y install qemu-kvm libvirt libguestfs-tools virt-install virt-viewer virt-manager tigervnc-server

# Enable and start the libvirt daemon
systemctl enable --now libvirtd

# Prepare directory for KVM assets
mkdir -p /data/kvm
cd /data/kvm

# Define the virtual network bridge configuration
cat << EOF >  /data/kvm/virt-net.xml
<network>
  <name>br-ocp</name>
  <bridge name='br-ocp' stp='on' delay='0'/>
  <domain name='br-ocp'/>
  <ip address='192.168.99.1' netmask='255.255.255.0'>
  </ip>
  <ip address='192.168.77.1' netmask='255.255.255.0'>
  </ip>
  <ip address='192.168.88.1' netmask='255.255.255.0'>
  </ip>
</network>
EOF

# Create and start the virtual network
virsh net-define --file /data/kvm/virt-net.xml
virsh net-autostart br-ocp
virsh net-start br-ocp

# Verify the network status
virsh net-list --all
#  Name      State    Autostart   Persistent
# --------------------------------------------
#  br-ocp    active   yes         yes
#  default   active   yes         yes

# Configure services to start on boot
cat << EOF >> /etc/rc.d/rc.local
# Ensure the virtual network is started
virsh net-start br-ocp || true

# Start all defined VMs
virsh list --all --name | grep -v '^$' | xargs -I {} virsh start {} || true
EOF

chmod +x /etc/rc.d/rc.local
systemctl enable --now rc-local

# To disable autostart:
# chmod -x /etc/rc.d/rc.local
# systemctl disable --now rc-local

2.5. VNC Server Setup

To facilitate remote management and UI-based operations on the helper node, we will set up a VNC server.

# Install GUI and VNC server packages
dnf groupinstall -y "Server with GUI"
dnf -y install tigervnc-server

# Disable the firewall for simplicity in a lab environment
systemctl disable --now firewalld.service

# Set a VNC password for the root user
mkdir -p ~/.vnc/
echo "redhat" | vncpasswd -f > ~/.vnc/passwd
chmod 600 ~/.vnc/passwd

# Configure the VNC session
cat << EOF > ~/.vnc/config
session=gnome
securitytypes=vncauth,tlsvnc
geometry=1440x855
alwaysshared
EOF

# Map a display port to the root user
cat << EOF >> /etc/tigervnc/vncserver.users
:2=root
EOF

# Enable and start the VNC server service
systemctl enable --now vncserver@:2
# To manage the service:
# systemctl start vncserver@:2
# systemctl stop vncserver@:2

2.6. Certificate Generation for Local Registry

We will set up an internal container registry to cache images for the disconnected OpenShift installation. Before deploying the registry, we must generate self-signed TLS certificates to secure communication.

mkdir -p /etc/crts/ && cd /etc/crts

# Generate a self-signed Certificate Authority (CA)
# Reference: https://access.redhat.com/documentation/en-us/red_hat_codeready_workspaces/2.1/html/installation_guide/installing-codeready-workspaces-in-tls-mode-with-self-signed-certificates_crw
openssl genrsa -out /etc/crts/wzhlab.top.ca.key 4096

openssl req -x509 \
  -new -nodes \
  -key /etc/crts/wzhlab.top.ca.key \
  -sha256 \
  -days 36500 \
  -out /etc/crts/wzhlab.top.ca.crt \
  -subj /CN="Local wzh lab Signer" \
  -reqexts SAN \
  -extensions SAN \
  -config <(cat /etc/pki/tls/openssl.cnf \
      <(printf '[SAN]\nbasicConstraints=critical, CA:TRUE\nkeyUsage=keyCertSign, cRLSign, digitalSignature'))

# Generate a server key and CSR for the registry
openssl genrsa -out /etc/crts/wzhlab.top.key 2048

openssl req -new -sha256 \
    -key /etc/crts/wzhlab.top.key \
    -subj "/O=Local wzh lab /CN=*.infra.wzhlab.top" \
    -reqexts SAN \
    -config <(cat /etc/pki/tls/openssl.cnf \
        <(printf "\n[SAN]\nsubjectAltName=DNS:*.infra.wzhlab.top,DNS:*.wzhlab.top\nbasicConstraints=critical, CA:FALSE\nkeyUsage=digitalSignature, keyEncipherment, keyAgreement, dataEncipherment\nextendedKeyUsage=serverAuth")) \
    -out /etc/crts/wzhlab.top.csr

# Sign the server certificate with our CA
openssl x509 \
    -req \
    -sha256 \
    -extfile <(printf "subjectAltName=DNS:*.infra.wzhlab.top,DNS:*.wzhlab.top\nbasicConstraints=critical, CA:FALSE\nkeyUsage=digitalSignature, keyEncipherment, keyAgreement, dataEncipherment\nextendedKeyUsage=serverAuth") \
    -days 36500 \
    -in /etc/crts/wzhlab.top.csr \
    -CA /etc/crts/wzhlab.top.ca.crt \
    -CAkey /etc/crts/wzhlab.top.ca.key \
    -CAcreateserial -out /etc/crts/wzhlab.top.crt

# Verify the certificate
openssl x509 -in /etc/crts/wzhlab.top.crt -text

# Add the CA certificate to the system's trust store
/bin/cp -f /etc/crts/wzhlab.top.ca.crt /etc/pki/ca-trust/source/anchors/
update-ca-trust extract

2.7. Deploying the Mirror Registry (Quay)

With the TLS certificates in place, we can now deploy the mirror registry. We will use the mirror-registry tool provided by Red Hat, which deploys a simplified instance of Quay.

# Reference: https://docs.openshift.com/container-platform/4.10/installing/disconnected_install/installing-mirroring-creating-registry.html

mkdir -p /data/quay 
# Navigate to the directory where you downloaded the tool
# tar zvxf mirror-registry-amd64.tar.gz -C /data/quay

cd /data/quay
# Install the registry, providing the hostname and SSL certificates
./mirror-registry install -v \
-k ~/.ssh/id_ed25519 \
--initPassword redhat.. --initUser admin \
--quayHostname mirror.infra.wzhlab.top --quayRoot /data/quay \
--targetHostname mirror.infra.wzhlab.top \
--sslKey /etc/crts/wzhlab.top.key --sslCert /etc/crts/wzhlab.top.crt

# Expected output on success:
# PLAY RECAP **********************************************************************************************************************************************************************************
# root@mirror.infra.wzhlab.top : ok=46   changed=23   unreachable=0    failed=0    skipped=18   rescued=0    ignored=0
# INFO[2025-09-28 21:07:25] Quay installed successfully, config data is stored in /data/quay
# INFO[2025-09-28 21:07:25] Quay is available at https://mirror.infra.wzhlab.top:8443 with credentials (admin, redhat..)

# To uninstall the registry:
# cd /data/quay
# ./mirror-registry uninstall -v \
# -k ~/.ssh/id_ed25519 \
# --autoApprove true --quayRoot /data/quay \
# --targetHostname mirror.infra.wzhlab.top

3. OpenShift Cluster Installation

With the helper node fully configured, we can proceed with the OpenShift installation.

3.1. Initial Setup and Tooling

First, we’ll create a dedicated user for the installation process and download the required OpenShift client binaries.

# Create a dedicated user for the installation
useradd -m sno
su - sno

# Configure passwordless SSH for convenience
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 -N "" -q
cat << EOF > ~/.ssh/config
StrictHostKeyChecking no
UserKnownHostsFile=/dev/null
EOF
chmod 600 ~/.ssh/config

# Set environment variables for the installation
cat << 'EOF' >> ~/.bashrc
export BASE_DIR='/home/sno'
export BUILDNUMBER=4.18.27
# export BUILDNUMBER=4.19.17
EOF
source ~/.bashrc

# Set the specific OpenShift version for this installation
# export BUILDNUMBER=4.18.24

# Create directories for installation files
mkdir -p ${BASE_DIR}/data/ocp-${BUILDNUMBER}
mkdir -p $HOME/.local/bin

cd ${BASE_DIR}/data/ocp-${BUILDNUMBER}

# Download OpenShift client, installer, and mirror tools
wget -O openshift-client-linux-${BUILDNUMBER}.tar.gz https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp/${BUILDNUMBER}/openshift-client-linux-${BUILDNUMBER}.tar.gz
wget -O openshift-install-linux-${BUILDNUMBER}.tar.gz https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp/${BUILDNUMBER}/openshift-install-linux-${BUILDNUMBER}.tar.gz
wget -O oc-mirror.tar.gz https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp/${BUILDNUMBER}/oc-mirror.tar.gz

# Extract and install the binaries
tar -xzf openshift-client-linux-${BUILDNUMBER}.tar.gz -C $HOME/.local/bin/
tar -xzf openshift-install-linux-${BUILDNUMBER}.tar.gz -C $HOME/.local/bin/
tar -xzf oc-mirror.tar.gz -C $HOME/.local/bin/
chmod +x $HOME/.local/bin/oc-mirror

# Download butane and coreos-installer
wget  -nd -np -e robots=off --reject="index.html*" -P ./ --recursive -A "butane-amd64" https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/butane/latest/
wget  -nd -np -e robots=off --reject="index.html*" -P ./ -r -A "coreos-installer_amd64" https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/coreos-installer/latest/
install -m 755 ./butane-amd64 $HOME/.local/bin/butane
install -m 755 ./coreos-installer_amd64 $HOME/.local/bin/coreos-installer

# Download mirror-registry tool
wget -O mirror-registry-amd64.tar.gz https://mirror.openshift.com/pub/cgw/mirror-registry/latest/mirror-registry-amd64.tar.gz

# Download Helm client
wget -O helm-linux-amd64.tar.gz https://developers.redhat.com/content-gateway/file/pub/openshift-v4/clients/helm/3.17.1/helm-linux-amd64.tar.gz
tar xzf helm-linux-amd64.tar.gz -C $HOME/.local/bin/
mv ~/.local/bin/helm-linux-amd64 ~/.local/bin/helm

3.2. Mirroring Container Images

Before starting the installation, we must mirror all required OpenShift container images to our local Quay registry. This is a crucial step for a disconnected installation.

# Reference: https://github.com/openshift/oc-mirror

mkdir -p ${BASE_DIR}/data/mirror/

# Define the ImageSetConfiguration for oc-mirror
# This specifies the OpenShift version, operators, and any additional images to mirror.
tee ${BASE_DIR}/data/mirror/mirror.yaml << EOF
apiVersion: mirror.openshift.io/v1alpha2
kind: ImageSetConfiguration
mirror:
  platform:
    architectures:
      - amd64
    channels:
      - name: stable-4.18
        type: ocp
        minVersion: 4.18.27
        maxVersion: 4.18.27
        shortestPath: true
    graph: false
  # operators:
  #   - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.18
  #     packages:
  #      - name: nfd
  #      - name: kubevirt-hyperconverged
  # additionalImages:
    # This is the custom RHCOS image we will build later
    # - name: quay.io/wangzheng422/ocp:418.94.202509301252-metax-0-x86_64
    # - name: quay.io/wangzheng422/ocp:418.94.202509301252-metax-0-x86_64-sdk-v01
EOF

cd ${BASE_DIR}/data/mirror/

INSTALL_IMAGE_REGISTRY=mirror.infra.wzhlab.top:8443

# Log in to the local registry
podman login $INSTALL_IMAGE_REGISTRY -u admin -p redhat..

# Run oc-mirror to start the mirroring process
# Note: This will take a significant amount of time and disk space.
oc-mirror --v2 --config ${BASE_DIR}/data/mirror/mirror.yaml --authfile ${BASE_DIR}/data/pull-secret.json --workspace file://${BASE_DIR}/data/mirror/  docker://$INSTALL_IMAGE_REGISTRY

# After mirroring, oc-mirror generates several YAML files in the working-dir/cluster-resources directory.
# These files must be applied to the cluster post-installation to configure it to use the local registry.

3.3. Configuring and Launching the OpenShift Installation

Now we create the configuration files (install-config.yaml, agent-config.yaml) that define our cluster topology and then generate the agent-based installation ISO.

# export BUILDNUMBER=4.18.24
mkdir -p ${BASE_DIR}/data/{sno/disconnected,install}

# Define cluster network and node parameters
INSTALL_IMAGE_REGISTRY=mirror.infra.wzhlab.top:8443
PULL_SECRET=$(cat ${BASE_DIR}/data/pull-secret.json)

# Create a file with default environment variables for the cluster nodes
cat << 'EOF' > ${BASE_DIR}/data/ocp-default.env
CIDR_PREFIX=192.168.99
CIDR_PREFIX_02=192.168.77
NTP_SERVER=time.nju.edu.cn
HELP_SERVER=$CIDR_PREFIX.11
API_VIP=$CIDR_PREFIX.21
INGRESS_VIP=$CIDR_PREFIX.22
MACHINE_NETWORK="$CIDR_PREFIX.0/24"

SNO_CLUSTER_NAME=demo-01-rhsys
SNO_BASE_DOMAIN=wzhlab.top

MASTER_01_IP=$CIDR_PREFIX.23
MASTER_02_IP=$CIDR_PREFIX.24
MASTER_03_IP=$CIDR_PREFIX.25
WORKER_01_IP=$CIDR_PREFIX.26
WORKER_02_IP=$CIDR_PREFIX.27

MASTER_01_IP_02=$CIDR_PREFIX_02.23
MASTER_02_IP_02=$CIDR_PREFIX_02.24
MASTER_03_IP_02=$CIDR_PREFIX_02.25
WORKER_01_IP_02=$CIDR_PREFIX_02.26
WORKER_02_IP_02=$CIDR_PREFIX_02.27

MASTER_01_HOSTNAME=master-01-demo
MASTER_02_HOSTNAME=master-02-demo
MASTER_03_HOSTNAME=master-03-demo
WORKER_01_HOSTNAME=worker-01-demo
WORKER_02_HOSTNAME=worker-02-demo

MASTER_01_INTERFACE=enp1s0
MASTER_02_INTERFACE=enp1s0
MASTER_03_INTERFACE=enp1s0
WORKER_01_INTERFACE=enp1s0
WORKER_02_INTERFACE=enp1s0

MASTER_01_INTERFACE_02=enp2s0
MASTER_02_INTERFACE_02=enp2s0
MASTER_03_INTERFACE_02=enp2s0
WORKER_01_INTERFACE_02=enp2s0
WORKER_02_INTERFACE_02=enp2s0

MASTER_01_INTERFACE_MAC=00:50:56:8e:2a:31
MASTER_02_INTERFACE_MAC=00:50:56:8e:2a:32
MASTER_03_INTERFACE_MAC=00:50:56:8e:2a:33
WORKER_01_INTERFACE_MAC=00:50:56:8e:2a:51
WORKER_02_INTERFACE_MAC=00:50:56:8e:2a:52

MASTER_01_INTERFACE_02_MAC=00:50:56:8e:2b:31
MASTER_02_INTERFACE_02_MAC=00:50:56:8e:2b:32
MASTER_03_INTERFACE_02_MAC=00:50:56:8e:2b:33
WORKER_01_INTERFACE_02_MAC=00:50:56:8e:2b:51
WORKER_02_INTERFACE_02_MAC=00:50:56:8e:2b:52

MASTER_01_DISK=/dev/vda
MASTER_02_DISK=/dev/vda
MASTER_03_DISK=/dev/vda
WORKER_01_DISK=/dev/vda
WORKER_02_DISK=/dev/vda

OCP_GW=$CIDR_PREFIX.1
OCP_NETMASK=255.255.255.0
OCP_NETMASK_S=24
OCP_DNS=223.5.5.5

OCP_GW_v6=fd03::11
OCP_NETMASK_v6=64
EOF

source ${BASE_DIR}/data/ocp-default.env

mkdir -p ${BASE_DIR}/data/install
cd ${BASE_DIR}/data/install

# Clean up previous installation attempts
/bin/rm -rf *.ign .openshift_install_state.json auth bootstrap manifests master*[0-9] worker*[0-9] *

# Create the main install-config.yaml
cat << EOF > ${BASE_DIR}/data/install/install-config.yaml 
apiVersion: v1
baseDomain: $SNO_BASE_DOMAIN
compute:
- name: worker
  replicas: 2
controlPlane:
  name: master
  replicas: 3
metadata:
  name: $SNO_CLUSTER_NAME
networking:
  networkType: OVNKubernetes 
  clusterNetwork:
    - cidr: 10.132.0.0/14 
      hostPrefix: 23
  machineNetwork:
    - cidr: $MACHINE_NETWORK
  serviceNetwork:
    - 172.22.0.0/16
platform:
  baremetal:
    apiVIPs:
    - $API_VIP
    ingressVIPs:
    - $INGRESS_VIP
pullSecret: '${PULL_SECRET}'
sshKey: |
$( cat ${BASE_DIR}/.ssh/id_ed25519.pub | sed 's/^/   /g' )
additionalTrustBundle: |
$( cat /etc/crts/wzhlab.top.ca.crt | sed 's/^/   /g' )
ImageDigestSources:
- mirrors:
  - ${INSTALL_IMAGE_REGISTRY}/openshift/release-images
  source: quay.io/openshift-release-dev/ocp-release
- mirrors:
  - ${INSTALL_IMAGE_REGISTRY}/openshift/release
  source: quay.io/openshift-release-dev/ocp-v4.0-art-dev
EOF

# Create the agent-config.yaml for static IP configuration
cat << EOF > ${BASE_DIR}/data/install/agent-config.yaml
apiVersion: v1alpha1
kind: AgentConfig
metadata:
  name: $SNO_CLUSTER_NAME
rendezvousIP: $MASTER_01_IP
additionalNTPSources:
- $NTP_SERVER
hosts:
  - hostname: $MASTER_01_HOSTNAME
    role: master
    rootDeviceHints:
      deviceName: "$MASTER_01_DISK"
    interfaces:
      - name: $MASTER_01_INTERFACE
        macAddress: $MASTER_01_INTERFACE_MAC
      - name: $MASTER_01_INTERFACE_02
        macAddress: $MASTER_01_INTERFACE_02_MAC
    networkConfig:
      interfaces:
        - name: $MASTER_01_INTERFACE
          type: ethernet
          state: up
          mac-address: $MASTER_01_INTERFACE_MAC
          ipv4:
            enabled: true
            address:
              - ip: $MASTER_01_IP
                prefix-length: $OCP_NETMASK_S
            dhcp: false
        - name: $MASTER_01_INTERFACE_02
          type: ethernet
          state: up
          mac-address: $MASTER_01_INTERFACE_02_MAC
          ipv4:
            enabled: true
            address:
              - ip: $MASTER_01_IP_02
                prefix-length: $OCP_NETMASK_S
            dhcp: false
      dns-resolver:
        config:
          server:
            - $OCP_DNS
      routes:
        config:
          - destination: 0.0.0.0/0
            next-hop-address: $OCP_GW
            next-hop-interface: $MASTER_01_INTERFACE
            table-id: 254
  - hostname: $MASTER_02_HOSTNAME
    role: master
    rootDeviceHints:
      deviceName: "$MASTER_02_DISK"
    interfaces:
      - name: $MASTER_02_INTERFACE
        macAddress: $MASTER_02_INTERFACE_MAC
      - name: $MASTER_02_INTERFACE_02
        macAddress: $MASTER_02_INTERFACE_02_MAC
    networkConfig:
      interfaces:
        - name: $MASTER_02_INTERFACE
          type: ethernet
          state: up
          mac-address: $MASTER_02_INTERFACE_MAC
          ipv4:
            enabled: true
            address:
              - ip: $MASTER_02_IP
                prefix-length: $OCP_NETMASK_S
            dhcp: false
        - name: $MASTER_02_INTERFACE_02
          type: ethernet
          state: up
          mac-address: $MASTER_02_INTERFACE_02_MAC
          ipv4:
            enabled: true
            address:
              - ip: $MASTER_02_IP_02
                prefix-length: $OCP_NETMASK_S
            dhcp: false
      dns-resolver:
        config:
          server:
            - $OCP_DNS
      routes:
        config:
          - destination: 0.0.0.0/0
            next-hop-address: $OCP_GW
            next-hop-interface: $MASTER_02_INTERFACE
            table-id: 254
  - hostname: $MASTER_03_HOSTNAME
    role: master
    rootDeviceHints:
      deviceName: "$MASTER_03_DISK"
    interfaces:
      - name: $MASTER_03_INTERFACE
        macAddress: $MASTER_03_INTERFACE_MAC
      - name: $MASTER_03_INTERFACE_02
        macAddress: $MASTER_03_INTERFACE_02_MAC
    networkConfig:
      interfaces:
        - name: $MASTER_03_INTERFACE
          type: ethernet
          state: up
          mac-address: $MASTER_03_INTERFACE_MAC
          ipv4:
            enabled: true
            address:
              - ip: $MASTER_03_IP
                prefix-length: $OCP_NETMASK_S
            dhcp: false
        - name: $MASTER_03_INTERFACE_02
          type: ethernet
          state: up
          mac-address: $MASTER_03_INTERFACE_02_MAC
          ipv4:
            enabled: true
            address:
              - ip: $MASTER_03_IP_02
                prefix-length: $OCP_NETMASK_S
            dhcp: false
      dns-resolver:
        config:
          server:
            - $OCP_DNS
      routes:
        config:
          - destination: 0.0.0.0/0
            next-hop-address: $OCP_GW
            next-hop-interface: $MASTER_03_INTERFACE
            table-id: 254
  - hostname: $WORKER_01_HOSTNAME
    role: worker
    rootDeviceHints:
      deviceName: "$WORKER_01_DISK"
    interfaces:
      - name: $WORKER_01_INTERFACE
        macAddress: $WORKER_01_INTERFACE_MAC
      - name: $WORKER_01_INTERFACE_02
        macAddress: $WORKER_01_INTERFACE_02_MAC
    networkConfig:
      interfaces:
        - name: $WORKER_01_INTERFACE
          type: ethernet
          state: up
          mac-address: $WORKER_01_INTERFACE_MAC
          ipv4:
            enabled: true
            address:
              - ip: $WORKER_01_IP
                prefix-length: $OCP_NETMASK_S
            dhcp: false
        - name: $WORKER_01_INTERFACE_02
          type: ethernet
          state: up
          mac-address: $WORKER_01_INTERFACE_02_MAC
          ipv4:
            enabled: true
            address:
              - ip: $WORKER_01_IP_02
                prefix-length: $OCP_NETMASK_S
            dhcp: false
      dns-resolver:
        config:
          server:
            - $OCP_DNS
      routes:
        config:
          - destination: 0.0.0.0/0
            next-hop-address: $OCP_GW
            next-hop-interface: $WORKER_01_INTERFACE
            table-id: 254
  - hostname: $WORKER_02_HOSTNAME
    role: worker
    rootDeviceHints:
      deviceName: "$WORKER_02_DISK"
    interfaces:
      - name: $WORKER_02_INTERFACE
        macAddress: $WORKER_02_INTERFACE_MAC
      - name: $WORKER_02_INTERFACE_02
        macAddress: $WORKER_02_INTERFACE_02_MAC
    networkConfig:
      interfaces:
        - name: $WORKER_02_INTERFACE
          type: ethernet
          state: up
          mac-address: $WORKER_02_INTERFACE_MAC
          ipv4:
            enabled: true
            address:
              - ip: $WORKER_02_IP
                prefix-length: $OCP_NETMASK_S
            dhcp: false
        - name: $WORKER_02_INTERFACE_02
          type: ethernet
          state: up
          mac-address: $WORKER_02_INTERFACE_02_MAC
          ipv4:
            enabled: true
            address:
              - ip: $WORKER_02_IP_02
                prefix-length: $OCP_NETMASK_S
            dhcp: false
      dns-resolver:
        config:
          server:
            - $OCP_DNS
      routes:
        config:
          - destination: 0.0.0.0/0
            next-hop-address: $OCP_GW
            next-hop-interface: $WORKER_02_INTERFACE
            table-id: 254
EOF

/bin/cp -f ${BASE_DIR}/data/install/install-config.yaml ${BASE_DIR}/data/install/install-config.yaml.bak

# Generate the cluster manifests from the config files
openshift-install --dir=${BASE_DIR}/data/install agent create cluster-manifests

cd ${BASE_DIR}/data/install/

# Create the agent installation ISO
# The installer will automatically cache downloaded files in ~/.cache/agent/
mkdir -p ${HOME}/.cache/agent/{files_cache,image_cache}
openshift-install --dir=${BASE_DIR}/data/install agent create image --log-level=debug

3.4. Provisioning VMs for OpenShift Installation

With the installation ISO created, we can now define and launch the KVM virtual machines. These VMs will boot from the ISO, which will trigger the automated, unattended installation of OpenShift.

# Verify the host CPU model for KVM configuration
virsh capabilities | grep model
#  <model>Icelake-Server-v2</model>

# (Optional) Clean up any existing logical volumes from previous attempts
# lv_list=$(lvdisplay -c | cut -d: -f1 | grep '.*/lv')
# for lv in $lv_list; do
#   echo "Deleting $lv..."
#   lvremove -f $lv
# done

# Identify PCI addresses of the Metax GPU devices for passthrough
lspci -nn | grep -i 9999
# 0f:00.0 Display controller [0380]: Device [9999:4000] (rev 01)
# ...
virsh nodedev-list --cap pci
# pci_0000_0f_00_0

# Copy the generated ISO to the KVM directory
/bin/cp -f /home/sno/data/install/agent.x86_64.iso /data/kvm/

# LV creation helper function
create_lv() {
    var_vg=$1
    var_pool=$2
    var_lv=$3
    var_size=$4
    var_action=$5
    lvremove -f $var_vg/$var_lv || true
    if [ "$var_action" == "recreate" ]; then
      lvcreate --type thin -n $var_lv -V $var_size --thinpool $var_vg/$var_pool
      wipefs --all --force /dev/$var_vg/$var_lv
    fi
}

# Source the cluster environment variables
source /home/sno/data/ocp-default.env

SNO_CPU=host-model

# --- Provision Master Node 1 ---
SNO_CPU_CORE=8
SNO_MEM=32
virsh destroy demo-01-master-01
virsh undefine demo-01-master-01
create_lv vgdata poolA lv-demo-01-master-01 120G recreate
virt-install --name=demo-01-master-01 --vcpus=$SNO_CPU_CORE --ram=$(($SNO_MEM*1024)) \
--cpu=$SNO_CPU \
--disk path=/dev/vgdata/lv-demo-01-master-01,device=disk,bus=virtio,format=raw,discard=unmap \
--os-variant rhel9.6 \
--network bridge=br-ocp,model=virtio,mac=$MASTER_01_INTERFACE_MAC \
--network bridge=br-ocp,model=virtio,mac=$MASTER_01_INTERFACE_02_MAC \
--graphics vnc,listen=127.0.0.1,port=59001 --noautoconsole \
--boot menu=on --cdrom /data/kvm/agent.x86_64.iso

# --- Provision Master Node 2 ---
SNO_CPU_CORE=8
SNO_MEM=32
virsh destroy demo-01-master-02
virsh undefine demo-01-master-02
create_lv vgdata poolA lv-demo-01-master-02 120G recreate
virt-install --name=demo-01-master-02 --vcpus=$SNO_CPU_CORE --ram=$(($SNO_MEM*1024)) \
--cpu=$SNO_CPU \
--disk path=/dev/vgdata/lv-demo-01-master-02,device=disk,bus=virtio,format=raw,discard=unmap \
--os-variant rhel9.6 \
--network bridge=br-ocp,model=virtio,mac=$MASTER_02_INTERFACE_MAC \
--network bridge=br-ocp,model=virtio,mac=$MASTER_02_INTERFACE_02_MAC \
--graphics vnc,listen=127.0.0.1,port=59002 --noautoconsole \
--boot menu=on --cdrom /data/kvm/agent.x86_64.iso

# --- Provision Master Node 3 ---
SNO_CPU_CORE=8
SNO_MEM=32
virsh destroy demo-01-master-03
virsh undefine demo-01-master-03
create_lv vgdata poolA lv-demo-01-master-03 120G recreate
virt-install --name=demo-01-master-03 --vcpus=$SNO_CPU_CORE --ram=$(($SNO_MEM*1024)) \
--cpu=$SNO_CPU \
--disk path=/dev/vgdata/lv-demo-01-master-03,device=disk,bus=virtio,format=raw,discard=unmap \
--os-variant rhel9.6 \
--network bridge=br-ocp,model=virtio,mac=$MASTER_03_INTERFACE_MAC \
--network bridge=br-ocp,model=virtio,mac=$MASTER_03_INTERFACE_02_MAC \
--graphics vnc,listen=127.0.0.1,port=59003 --noautoconsole \
--boot menu=on --cdrom /data/kvm/agent.x86_64.iso

# Detach GPUs from host before assigning to worker VMs
virsh nodedev-detach pci_0000_0f_00_0
virsh nodedev-detach pci_0000_34_00_0

# --- Provision Worker Node 1 (with GPU Passthrough) ---
SNO_CPU_CORE=50
SNO_MEM=450
virsh destroy demo-01-worker-01
virsh undefine demo-01-worker-01
create_lv vgdata poolA lv-demo-01-worker-01 1200G recreate
virt-install --name=demo-01-worker-01 --vcpus=$SNO_CPU_CORE --ram=$(($SNO_MEM*1024)) \
--cpu=$SNO_CPU \
--disk path=/dev/vgdata/lv-demo-01-worker-01,device=disk,bus=virtio,format=raw,discard=unmap \
--os-variant rhel9.6 \
--network bridge=br-ocp,model=virtio,mac=$WORKER_01_INTERFACE_MAC \
--network bridge=br-ocp,model=virtio,mac=$WORKER_01_INTERFACE_02_MAC \
--host-device pci_0000_0f_00_0 \
--host-device pci_0000_34_00_0 \
--graphics vnc,listen=127.0.0.1,port=59004 --noautoconsole \
--boot menu=on --cdrom /data/kvm/agent.x86_64.iso

# Detach GPUs from host before assigning to worker VMs
virsh nodedev-detach pci_0000_87_00_0
virsh nodedev-detach pci_0000_ae_00_0

# --- Provision Worker Node 2 (with GPU Passthrough) ---
SNO_CPU_CORE=50
SNO_MEM=450
virsh destroy demo-01-worker-02
virsh undefine demo-01-worker-02
create_lv vgdata poolA lv-demo-01-worker-02 1200G recreate
virt-install --name=demo-01-worker-02 --vcpus=$SNO_CPU_CORE --ram=$(($SNO_MEM*1024)) \
--cpu=$SNO_CPU \
--disk path=/dev/vgdata/lv-demo-01-worker-02,device=disk,bus=virtio,format=raw,discard=unmap \
--os-variant rhel9.6 \
--network bridge=br-ocp,model=virtio,mac=$WORKER_02_INTERFACE_MAC \
--network bridge=br-ocp,model=virtio,mac=$WORKER_02_INTERFACE_02_MAC \
--host-device pci_0000_87_00_0 \
--host-device pci_0000_ae_00_0 \
--graphics vnc,listen=127.0.0.1,port=59005 --noautoconsole \
--boot menu=on --cdrom /data/kvm/agent.x86_64.iso

3.5. Monitoring the Installation Progress

During the installation, the VMs may shut down instead of rebooting. A simple monitoring loop can ensure they are restarted promptly.

# This loop checks for shutdown VMs and starts them every 10 seconds.
while true; do virsh list --state-shutoff --name | xargs -r -I {} virsh start {}; sleep 10; done

You can monitor the installation progress from the helper node using the openshift-install command.

# Set the KUBECONFIG environment variable
cd ${BASE_DIR}/data/install
export KUBECONFIG=${BASE_DIR}/data/install/auth/kubeconfig
echo "export KUBECONFIG=${BASE_DIR}/data/install/auth/kubeconfig" >> ~/.bashrc

# Wait for the bootstrap process to complete
cd ${BASE_DIR}/data/install
openshift-install --dir=${BASE_DIR}/data/install agent wait-for bootstrap-complete --log-level=debug
# Expected output:
# INFO Bootstrap Kube API Initialized
# INFO Bootstrap configMap status is complete
# INFO cluster bootstrap is complete

# Wait for the final installation to complete
cd ${BASE_DIR}/data/install
openshift-install --dir=${BASE_DIR}/data/install agent wait-for install-complete --log-level=debug
# Expected output:
# INFO Cluster is installed
# INFO Install complete!
# INFO To access the cluster as the system:admin user when using 'oc', run
# INFO     export KUBECONFIG=/home/lab-user/data/install/auth/kubeconfig
# INFO Access the OpenShift web-console here: https://console-openshift-console.apps.demo-rhsys.wzhlab.top
# INFO Login to the console with user: "kubeadmin", and password: "..."

4. Post-Installation Configuration

4.1. Node Access Configuration (Optional)

For development and debugging purposes, it can be useful to enable password-based root login on the cluster nodes. This is not recommended for production environments.

# On the helper node, as the 'sno' user
# Create a script to enable root login and set password
cat > ${BASE_DIR}/data/install/crack.txt << 'EOF'
echo redhat | sudo passwd --stdin root
sudo sh -c 'echo "PasswordAuthentication yes" > /etc/ssh/sshd_config.d/20-wzh.conf '
sudo sh -c 'echo "PermitRootLogin yes" >> /etc/ssh/sshd_config.d/20-wzh.conf '
sudo sh -c 'echo "ClientAliveInterval 1800" >> /etc/ssh/sshd_config.d/20-wzh.conf '
sudo systemctl restart sshd
sudo sh -c 'echo "export KUBECONFIG=/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost.kubeconfig" >> /root/.bashrc'
sudo sh -c 'echo "RET=\`oc config use-context system:admin\`" >> /root/.bashrc'
EOF

# Apply the script to all master nodes
for i in 23 24 25
do
  ssh core@192.168.99.$i < ${BASE_DIR}/data/install/crack.txt
done

# Create a similar script for worker nodes (without kubeconfig setup)
cat > ${BASE_DIR}/data/install/crack.worker.txt << 'EOF'
echo redhat | sudo passwd --stdin root
sudo sh -c 'echo "PasswordAuthentication yes" > /etc/ssh/sshd_config.d/20-wzh.conf '
sudo sh -c 'echo "PermitRootLogin yes" >> /etc/ssh/sshd_config.d/20-wzh.conf '
sudo sh -c 'echo "ClientAliveInterval 1800" >> /etc/ssh/sshd_config.d/20-wzh.conf '
sudo systemctl restart sshd
EOF

# Apply the script to all worker nodes
for i in 26 27
do
  ssh core@192.168.99.$i < ${BASE_DIR}/data/install/crack.worker.txt
done

4.2. Configuring HTPasswd Identity Provider

To provide a simple authentication mechanism, you can configure the HTPasswd identity provider. This allows you to create users with standard usernames and passwords.

Reference: https://docs.openshift.com/container-platform/4.13/authentication/identity_providers/configuring-htpasswd-identity-provider.html

# Install htpasswd utility on the helper node
sudo dnf install -y /usr/bin/htpasswd

# Create an htpasswd file with an initial admin user
htpasswd -c -B -b ${BASE_DIR}/data/install/users.htpasswd admin redhat

# Add an additional user
htpasswd -B -b ${BASE_DIR}/data/install/users.htpasswd user01 redhat

# Create a secret in the cluster from the htpasswd file
oc create secret generic htpass-secret \
  --from-file=htpasswd=${BASE_DIR}/data/install/users.htpasswd \
  -n openshift-config 

# Patch the cluster OAuth configuration to use the HTPasswd provider
cat << EOF > ${BASE_DIR}/data/install/oauth.yaml
spec:
  identityProviders:
  - name: htpasswd
    mappingMethod: claim 
    type: HTPasswd
    htpasswd:
      fileData:
        name: htpass-secret
EOF
oc patch oauth/cluster --type merge --patch-file=${BASE_DIR}/data/install/oauth.yaml

# Grant cluster-admin privileges to the new admin user
oc adm policy add-cluster-role-to-user cluster-admin admin

# Grant admin role in a specific project to the new user
oc adm policy add-role-to-user admin user01 -n llm-demo

5. Building a Custom RHCOS Image with Metax Drivers

The standard RHCOS image does not contain the necessary drivers for Metax GPUs, and metax gpu driver depends on specific old kernel version. Therefore, we must build a custom image. This process involves setting up a build environment, preparing RPM repositories, and using the coreos-assembler (cosa) tool.

5.1. Setting Up a RHEL 9 Build Environment

First, provision a RHEL 9 VM that will serve as our build machine.

cd /data/kvm

# LV creation helper function
create_lv() {
    var_vg=$1
    var_pool=$2
    var_lv=$3
    var_size=$4
    var_action=$5
    lvremove -f $var_vg/$var_lv || true
    if [ "$var_action" == "recreate" ]; then
      lvcreate --type thin -n $var_lv -V $var_size --thinpool $var_pool
      wipefs --all --force /dev/$var_vg/$var_lv
    fi
}

SNO_MEM=64
virsh destroy demo-01-test
virsh undefine demo-01-test
create_lv vgdata poolA lv-demo-01-test 200G recreate

# Install RHEL 9 using a kickstart file
virt-install --name=demo-01-test --vcpus=32 --ram=$(($SNO_MEM*1024)) \
--cpu=host-model \
--disk path=/dev/vgdata/lv-demo-01-test,device=disk,bus=virtio,format=raw \
--os-variant rhel9.6 \
--network bridge=br-ocp,model=virtio \
--graphics vnc,listen=127.0.0.1,port=58001 --noautoconsole \
--boot menu=on --location /data/kvm/rhel94.iso \
--initrd-inject helper-ks-rhel9.cfg --extra-args "inst.ks=file:/helper-ks-rhel9.cfg" 

# Inside the RHEL 9 VM, lock the release to 9.4 to match OCP 4.18 dependencies
sudo subscription-manager release --set=9.4

# Enable the correct EUS repositories for RHEL 9.4
sudo subscription-manager repos --disable="rhel-9-for-x86_64-baseos-rpms" --disable="rhel-9-for-x86_64-appstream-rpms"
sudo subscription-manager repos --enable="rhel-9-for-x86_64-baseos-eus-rpms" --enable="rhel-9-for-x86_64-appstream-eus-rpms"

5.2. Preparing RPM Repositories

The RHCOS build process requires access to all necessary RPMs. We will sync the required repositories to the build machine and host them locally via a simple web server.

# On the RHEL 9 build VM
sudo dnf install -y createrepo_c

mkdir -p /data/dnf/
cd /data/dnf/

# Sync all required RHEL and OpenShift repositories
dnf reposync --repoid=rhel-9-for-x86_64-baseos-eus-rpms -m --download-metadata --delete -n
dnf reposync --repoid=rhel-9-for-x86_64-appstream-eus-rpms -m --download-metadata --delete -n
dnf reposync --repoid=rhel-9-for-x86_64-nfv-rpms -m --download-metadata --delete -n
dnf reposync --repoid=fast-datapath-for-rhel-9-x86_64-rpms -m --download-metadata --delete -n
dnf reposync --repoid=rhocp-4.18-for-rhel-9-x86_64-rpms -m --download-metadata --delete -n
dnf reposync --repoid=rhocp-ironic-4.18-for-rhel-9-x86_64-rpms -m --download-metadata --delete -n
dnf reposync --repoid=ocp-tools-4.18-for-rhel-9-x86_64-rpms -m --download-metadata --delete -n
dnf reposync --repoid=cnv-4.18-for-rhel-9-x86_64-rpms -m --download-metadata --delete -n

# Create a custom repository for kernel packages and Metax drivers
mkdir -p /data/dnf/wzh-fix-repo
cd /data/dnf/wzh-fix-repo

# Download the specific kernel version required by metax gpu and its dependencies
PACKAGES_TO_QUERY=(
    "kernel-core-5.14.0-427.13.1.el9_4"
    "kernel-devel-5.14.0-427.13.1.el9_4"
    "kernel-headers-5.14.0-427.13.1.el9_4"
    "kernel-modules-5.14.0-427.13.1.el9_4"
    "kernel-modules-extra-5.14.0-427.13.1.el9_4"
    "kernel-5.14.0-427.13.1.el9_4"
)
COMBINED_DEPS_FILE=$(mktemp)
for pkg in "${PACKAGES_TO_QUERY[@]}"; do
    dnf repoquery -q --requires --resolve --recursive --queryformat '%{name}-%{version}-%{release}.%{arch}' "$pkg" >> "$COMBINED_DEPS_FILE"
    echo "$pkg" >> "$COMBINED_DEPS_FILE"
done
sort -u "$COMBINED_DEPS_FILE" | xargs sudo dnf download --destdir=.
rm -f "$COMBINED_DEPS_FILE"

# Download and extract Metax driver RPMs
mkdir -p /data/build/metax.base/tmp.driver
cd /data/build/metax.base/tmp.driver
wget -O metax-driver-3.1.0.11-rpm-x86_64.run "https://metax-pub.oss-cn-shanghai.aliyuncs.com/mxmaca2.0/3.1.0.x/binary/x86_64/driver/metax-driver-3.1.0.11-rpm-x86_64.run"
bash metax-driver-3.1.0.11-rpm-x86_64.run --noexec --keep
# The mxgvm driver is not needed for this setup
/bin/rm -f ./metax-driver-3.1.0.11/mxgvm-3.0.11-1.x86_64.rpm
/bin/cp -f ./*.rpm /data/dnf/wzh-fix-repo/

# Create the repository index
cd /data/dnf/wzh-fix-repo/
createrepo_c ./

# Serve the repositories over HTTP
cd /data/dnf
python3 -m http.server

5.3. Building the RHCOS Image

With the repositories in place, we can now use coreos-assembler to build the custom image.

# On the RHEL 9 build VM
# Reference: https://github.com/wangzheng422/machine-os-content/tree/metax-ocp-4.18

# Set the container image for the build tool
export COREOS_ASSEMBLER_CONTAINER=quay.io/coreos-assembler/coreos-assembler:rhcos-4.18
podman pull $COREOS_ASSEMBLER_CONTAINER

mkdir -p /data/rhcos
cd /data/rhcos
rm -rf *

# Define the cosa helper function
cosa() {
   env | grep COREOS_ASSEMBLER
   local -r COREOS_ASSEMBLER_CONTAINER_LATEST="quay.io/coreos-assembler/coreos-assembler:latest"
   if [[ -z ${COREOS_ASSEMBLER_CONTAINER} ]] && $(podman image exists ${COREOS_ASSEMBLER_CONTAINER_LATEST}); then
       local -r cosa_build_date_str="$(podman inspect -f "{{.Created}}" ${COREOS_ASSEMBLER_CONTAINER_LATEST} | awk '{print $1}')"
       local -r cosa_build_date="$(date -d ${cosa_build_date_str} +%s)"
       if [[ $(date +%s) -ge $((cosa_build_date + 60*60*24*7)) ]] ; then
         echo -e "\e[0;33m----" >&2
         echo "The COSA container image is more that a week old and likely outdated." >&2
         echo "You should pull the latest version with:" >&2
         echo "podman pull ${COREOS_ASSEMBLER_CONTAINER_LATEST}" >&2
         echo -e "----\e[0m" >&2
         sleep 10
       fi
   fi
   set -x
   podman run --rm -ti --security-opt=label=disable --privileged                                    \
              --userns=keep-id:uid=1000,gid=1000                                                    \
              -v=${PWD}:/srv/ --device=/dev/kvm --device=/dev/fuse                                  \
              --tmpfs=/tmp -v=/var/tmp:/var/tmp --name=cosa                                         \
              ${COREOS_ASSEMBLER_CONFIG_GIT:+-v=$COREOS_ASSEMBLER_CONFIG_GIT:/srv/src/config/:ro}   \
              ${COREOS_ASSEMBLER_GIT:+-v=$COREOS_ASSEMBLER_GIT/src/:/usr/lib/coreos-assembler/:ro}  \
              ${COREOS_ASSEMBLER_ADD_CERTS:+-v=/etc/pki/ca-trust:/etc/pki/ca-trust:ro}              \
              ${COREOS_ASSEMBLER_CONTAINER_RUNTIME_ARGS}                                            \
              ${COREOS_ASSEMBLER_CONTAINER:-$COREOS_ASSEMBLER_CONTAINER_LATEST} "$@"
   rc=$?; set +x; return $rc
}

# Initialize the build environment from a forked machine-os-content repository
# This fork contains the necessary modifications to include the Metax drivers.
cosa init --branch metax-ocp-4.18 https://github.com/wangzheng422/machine-os-content --force

# Fetch all source RPMs
cosa fetch

# Build the RHCOS image. This will install the Metax driver RPMs into the image.
cosa build

# List the build artifacts
cosa list
# 418.94.202510070750-metax-0
#    Timestamp: 2025-10-07T07:57:37Z
#    Artifacts: ostree oci-manifest qemu

# Push the resulting container image to a registry
/bin/cp -f /run//user/0/containers/auth.json ./
chmod +r auth.json
cosa push-container --authfile ./auth.json "quay.io/wangzheng422/ocp"
# quay.io/wangzheng422/ocp:418.94.202509291625-metax-0-x86_64. -> the real based ok one
# quay.io/wangzheng422/ocp:418.94.202509300851-metax-0-x86_64 -> with driver
# quay.io/wangzheng422/ocp:418.94.202509301252-metax-0-x86_64 -> driver without mxgvm
# quay.io/wangzheng422/ocp:418.94.202509301252-metax-0-x86_64-sdk-v01 -> driver with sdk
# round 2
# quay.io/wangzheng422/ocp:418.94.202510070750-metax-0-x86_64 -> kernel downgrade only

6. Deploying and Using the Custom RHCOS Image

6.1. Applying the Custom Image to Worker Nodes

Once the custom RHCOS image is pushed to a registry accessible by the OpenShift cluster, we can apply a MachineConfig object to instruct the worker nodes to re-provision themselves using the new image.

# Create a MachineConfig to specify the custom osImageURL
tee $BASE_DIR/data/install/machine-config.yaml << 'EOF'
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker 
  name: os-layer-custom-worker
spec:
  config:
    ignition:
      version: 3.4.0
  # The osImageURL should point to the image we just built and pushed
  osImageURL: mirror.infra.wzhlab.top:8443/wangzheng422/ocp:418.94.202509301252-metax-0-x86_64-sdk-v01
EOF

# Apply the MachineConfig to the cluster
oc apply -f $BASE_DIR/data/install/machine-config.yaml

# oc delete -f $BASE_DIR/data/install/machine-config.yaml

# The Machine Config Operator (MCO) will now perform a rolling update of the worker nodes.
# You can monitor the progress with 'oc get mcp'

6.2. Verifying the Driver on a Node

After a worker node has been updated, you can SSH into it and verify that the Metax driver is functioning correctly using the mx-smi tool.

# On a worker node
mx-smi

The result should look like this:

mx-smi  version: 2.2.8

=================== MetaX System Management Interface Log ===================
Timestamp                                         : Fri Oct 10 07:23:07 2025

Attached GPUs                                     : 2
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.8                        Kernel Mode Driver Version: 3.0.11          |
| MACA Version: unknown               BIOS Version: 1.27.5.0                      |
|------------------------------------+---------------------+----------------------+
| GPU     NAME         Persistence-M | Bus-id              | GPU-Util      sGPU-M |
| Temp    Pwr:Usage/Cap         Perf | Memory-Usage        | GPU-State            |
|====================================+=====================+======================|
| 0       MetaX C550             Off | 0000:06:00.0        | 0%            Native |
| 31C     55W / 450W              P0 | 858/65536 MiB       | Available            |
+------------------------------------+---------------------+----------------------+
| 1       MetaX C550             Off | 0000:07:00.0        | 0%            Native |
| 33C     55W / 450W              P0 | 858/65536 MiB       | Available            |
+------------------------------------+---------------------+----------------------+

+---------------------------------------------------------------------------------+
| Process:                                                                        |
|  GPU                    PID         Process Name                 GPU Memory     |
|                                                                  Usage(MiB)     |
|=================================================================================|
|  no process found                                                               |
+---------------------------------------------------------------------------------+

End of Log

6.3. Layering the Metax SDK onto RHCOS (Optional)

As an alternative to including the SDK in every application container, you can layer it directly into the RHCOS image. This simplifies application containers but requires the GPU vendor to support the CRI-O container runtime.

# On the RHEL 9 build VM
mkdir -p /data/build/metax.base/tmp
cd /data/build/metax.base/tmp

# Download and extract the SDK RPMs
wget -O maca-sdk-3.1.0.14-rpm-x86_64.tar.xz "https://metax-pub.oss-cn-shanghai.aliyuncs.com/mxmaca2.0/3.1.0.x/binary/x86_64/sdk/maca-sdk-3.1.0.14-rpm-x86_64.tar.xz"
tar vxf maca-sdk-3.1.0.14-rpm-x86_64.tar.xz

cd /data/build/metax.base

# Create a Dockerfile to layer the SDK on top of our custom RHCOS image
tee dockerfile << EOF 
FROM quay.io/wangzheng422/ocp:418.94.202509301252-metax-0-x86_64

RUN --mount=type=bind,source=tmp/maca-sdk-3.1.0.14/rpm,target=/wzh/ \
    cd /wzh/ && \
    dnf install -y *.rpm && \
    dnf clean all
EOF

# Build and push the new SDK-enabled image
podman build --security-opt label=disable -t quay.io/wangzheng422/ocp:418.94.202509301252-metax-0-x86_64-sdk-v01 -f dockerfile .
podman push quay.io/wangzheng422/ocp:418.94.202509301252-metax-0-x86_64-sdk-v01

7. Deploying the Metax GPU Operator

With the nodes running the correct RHCOS image, the final step is to deploy the Metax GPU Operator. This operator will manage the device plugins, which expose the GPUs as a schedulable resource in Kubernetes.

# On the helper node
# Download the Metax Kubernetes package
wget -O metax-gpu-k8s-package.0.12.0.tar.gz "https://metax-pub.oss-cn-shanghai.aliyuncs.com/mxmaca2.0/2.33.1.x/binary/x86_64/cloud/metax-gpu-k8s-package.0.12.0.tar.gz"
tar zvxf metax-gpu-k8s-package.0.12.0.tar.gz

# Push the operator container images to our internal registry
podman login mirror.infra.wzhlab.top:8443 -u admin -p redhat..
./metax-k8s-images.0.12.0.run push mirror.infra.wzhlab.top:8443/metax

# Create an ImageTagMirrorSet to redirect image pulls from the public Metax registry to our internal one
tee $BASE_DIR/data/install/image-tag-mirror-metax.yaml << EOF
apiVersion: config.openshift.io/v1
kind: ImageTagMirrorSet
metadata:
  name: itms-generic-metax
spec:
  imageTagMirrors:
  - mirrors:
    - mirror.infra.wzhlab.top:8443/metax
    source: cr.metax-tech.com/cloud
  - mirrors:
    - mirror.infra.wzhlab.top:8443/public-library
    source: cr.metax-tech.com/public-library
  - mirrors:
    - mirror.infra.wzhlab.top:8443/public-ai-release/maca
    source: cr.metax-tech.com/public-ai-release/maca
  - mirrors:
    - mirror.infra.wzhlab.top:8443/public-cloud-release
    source: cr.metax-tech.com/public-cloud-release
EOF
oc apply -f $BASE_DIR/data/install/image-tag-mirror-metax.yaml

# Use a patched version of the Helm chart for OpenShift compatibility
git clone https://github.com/wangzheng422/metax-operator

# Install the operator using Helm
helm install ./metax-operator \
--create-namespace -n metax-operator \
--generate-name \
--wait \
--set registry=mirror.infra.wzhlab.top:8443/metax \
--set minimalMode=true

# Patch the daemonset to use the correct service account
oc patch daemonset metax-gpu-label -n metax-operator --type='merge' -p '{"spec":{"template":{"spec":{"serviceAccountName":"metax-operator"}}}}'

# To uninstall the operator:
# chart=$(helm list -q -f "metax" -n metax-operator)
# if [[ -n $chart ]]; then
#     helm uninstall $chart -n metax-operator --wait
# fi

8. Running GPU Workloads

8.1. Verifying GPU Access with a Demo Pod

Now that the operator is running, we can deploy a simple pod that requests a GPU resource to verify that the entire stack is working.

# Save as gpu-demo.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-demo
spec:
  serviceAccountName: metax-operator
  containers:
  - name: vector-add
    image: cr.metax-tech.com/public-library/maca-native:3.1.0.3-centos9-amd64
    command: [
        "bash",
        "-c",
        "cp -r /opt/maca/samples/0_Introduction/vectorAdd ./;
        cd ./vectorAdd;
        mxcc -x maca vectorAdd.cpp -o vectorAdd --maca-path=/opt/maca;
        ./vectorAdd > vectoradd_exec_output.log;
        tail -f /dev/null",
    ]
    resources:
      limits:
        metax-tech.com/gpu: 1  # Request 1 GPU

Apply this YAML with oc apply -f gpu-demo.yaml. If the pod starts successfully, the GPU is correctly configured.

8.2. Running a VLLM Benchmark

As a more advanced test, we can deploy a VLLM (vLLM is a fast and easy-to-use library for LLM inference and serving) benchmark pod.

Reference: https://developer.metax-tech.com/doc/242

# Save as vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      serviceAccountName: metax-maca
      containers:
      - name: vllm
        image: cr.metax-tech.com/public-ai-release/maca/modelzoo.llm.vllm:maca.ai2.33.1.12-torch2.6-py310-centos9-amd64
        command: [ "bash", "-c", "tail -f /dev/null" ]
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
        resources:
          limits:
            metax-tech.com/gpu: 1
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
          sizeLimit: 100Gi

After deploying, exec into the pod terminal and run the benchmark:

# In the pod's terminal
dnf install -y python-pip git

# Download the model
mkdir /model
cd /model
pip install modelscope
cat > d.py <<EOF
from modelscope import snapshot_download
model_dir = snapshot_download('Qwen/Qwen2.5-7B-Instruct',cache_dir='./')
EOF
python d.py

# Run the benchmark
git clone https://github.com/vllm-project/vllm
cd vllm/benchmarks
vllm bench throughput \
    --input-len 512 \
    --output-len 512 \
    --model /model/Qwen/Qwen2___5-7B-Instruct/  \
    --dtype bfloat16

You will see the output, the performance bench runs smoothly.

9. Alternative: Dynamic Kernel Module Injection with GPU Operator

9.1. Overview

In addition to building a custom RHCOS image, Metax provides an alternative, more dynamic method for enabling GPUs on RHCOS nodes: a Helm-based GPU operator. This operator is designed to automatically detect the host’s kernel version and inject a compatible, pre-compiled Metax kernel driver module at runtime. This approach eliminates the need for manual RHCOS image builds when the underlying kernel version changes, offering greater flexibility and simplifying maintenance.

This section details the process of deploying the Metax GPU Operator on an OpenShift cluster.

9.2. Prerequisites and Compatibility

The dynamic driver injection relies on pre-compiled driver images that are compatible with specific OpenShift and kernel versions. Before proceeding, ensure your environment aligns with the supported versions.

Supported OCP and Kernel Versions:

OpenShift Version	RHCOS Kernel Version
4.18.27	5.14.0-427.96.1.el9_4
4.19.17	5.14.0-570.54.1.el9_6

9.3. Step 1: Mirroring Operator and Driver Images

First, the necessary container images for the operator and the specific driver version must be downloaded and pushed to your internal container registry.

# Push the main operator images to the local registry
./metax-k8s-images.1.0.0-20251113-626.run push mirror.infra.wzhlab.top:8443/metax

# Push the specific pre-compiled driver image to the local registry
bash ./metax-k8s-driver-image.20251113-631-x86_64.run push mirror.infra.wzhlab.top:8443/metax
# The original image location is cr.metax-tech.com/cloud/driver-image:20251113-631-amd64
# It will be mirrored to mirror.infra.wzhlab.top:8443/metax/driver-image:20251113-631-amd64

9.4. Step 2: Installing the GPU Operator via Helm

With the images available in the local registry, use Helm to install the Metax GPU Operator. The installation is configured to point to the internal registry and specifies the exact driver image to be used.

# Install the operator using the provided Helm chart
helm install metax-operator ./metax-operator \
--create-namespace -n metax-operator \
--wait \
--set openshift.enabled=true \
--set registry=mirror.infra.wzhlab.top:8443/metax \
--set maca.payload.registry=cr.metax-tech.com/public-library \
--set maca.payload.images[0]=maca-native:3.2.1.4-centos9-amd64 \
--set runtime.deploy=false \
--set driver.payload.name=driver-image \
--set driver.payload.version=20251113-631-amd64

To uninstall the Helm chart:

# Find the chart name and uninstall it
chart=$(helm list -q -f "metax" -n metax-operator)
if [[ -n $chart ]]; then
    helm uninstall $chart -n metax-operator --wait
fi

9.5. Step 3: Post-Installation Adjustments

After the Helm chart is installed, the operator pod will start and attempt to deploy several DaemonSets. In a setup where a dedicated runtime is not yet supported or required, the node selectors on these DaemonSets must be patched to allow them to run on all targeted worker nodes.

This is necessary because the default configuration expects nodes to be labeled with metax-tech.com/runtime.ready and metax-tech.com/maca.ready, which may not be present. Removing these selectors ensures the driver and device plugin pods are scheduled correctly.

# Patch the metax-driver DaemonSet to remove the runtime.ready node selector
oc patch daemonset metax-driver -n metax-operator \
  --type='json' \
  -p='[{"op": "remove", "path": "/spec/template/spec/nodeSelector/metax-tech.com~1runtime.ready"}]'

# Patch the metax-gpu-device DaemonSet to remove both maca.ready and runtime.ready selectors
oc patch daemonset metax-gpu-device -n metax-operator \
  --type='json' \
  -p='[
    {"op": "remove", "path": "/spec/template/spec/nodeSelector/metax-tech.com~1maca.ready"},
    {"op": "remove", "path": "/spec/template/spec/nodeSelector/metax-tech.com~1runtime.ready"}
  ]'

9.6. Installing with GPU Metrics Export

Monitoring GPU metrics is essential for AI workloads. The mx-exporter component provides Prometheus-compatible metrics for Metax GPUs.

# Download the container image and push it to the internal registry
wget -O mx-exporter.0.13.1.tgz "https://metax-pub.oss-cn-shanghai.aliyuncs.com/mxmaca2.0/3.2.1.x/binary/x86_64/cloud/mx-exporter.0.13.1.tgz?OSSAccessKeyId=LTAI5t8HeoJo71RpDsrCMZbQ&Expires=1763414110&Signature=2%2BovY20wsBaYuHrw%2FMhdY%2B%2FkdaE%3D"

# Extract and load the image
xz -d mx-exporter-0.13.1-amd64.xz 
podman image load -i mx-exporter-0.13.1-amd64

# Tag and push to the local registry
podman tag cr.metax-tech.com/cloud/mx-exporter:0.13.1 mirror.infra.wzhlab.top:8443/metax/mx-exporter:0.13.1-amd64
podman push mirror.infra.wzhlab.top:8443/metax/mx-exporter:0.13.1-amd64

# Install the operator with dataExporter enabled
helm install ./metax-operator \
--create-namespace -n metax-operator \
--generate-name \
--wait \
--set openshift.enabled=true \
--set dataExporter.deploy=true \
--set dataExporter.image.name=mx-exporter \
--set dataExporter.image.version=0.13.1-amd64 \
--set registry=mirror.infra.wzhlab.top:8443/metax \
--set maca.payload.registry=cr.metax-tech.com/public-library \
--set maca.payload.images[0]=maca-native:3.2.1.4-centos9-amd64 \
--set runtime.deploy=false \
--set driver.payload.name=driver-image \
--set driver.payload.version=20251113-631-amd64

# Apply security context constraints (SCC) patch for the metrics exporter
oc patch daemonset metax-data-exporter -n metax-operator --type='merge' -p '{"spec":{"template":{"spec":{"serviceAccount":"metax-driver"}}}}'

Enable user workload monitoring in OpenShift:

# Enable monitoring for non-platform namespaces
cat << EOF > ${BASE_DIR}/data/install/enable-monitor.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true 
EOF

oc apply -f ${BASE_DIR}/data/install/enable-monitor.yaml

# Verify monitoring pods
oc -n openshift-user-workload-monitoring get pod
# NAME                                   READY   STATUS    RESTARTS   AGE
# prometheus-operator-6f766b4885-r92hx   2/2     Running   0          10s
# prometheus-user-workload-0             5/6     Running   0          9s
# prometheus-user-workload-1             5/6     Running   0          9s
# thanos-ruler-user-workload-0           4/4     Running   0          8s
# thanos-ruler-user-workload-1           4/4     Running   0          8s

oc label namespace metax-operator "openshift.io/cluster-monitoring=true"

Configure a ServiceMonitor to scrape metrics:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: metax-data-exporter
  namespace: metax-operator
spec:
  selector:
    matchLabels:
      # Match the application label of the exporter service
      app: metax-data-exporter
  endpoints:
  - port: metrics # Match the port name in the Service definition
    path: /metrics
    interval: 30s

you can check the result by access the api endpoint

curl http://metax-data-exporter:8000/metrics

and snipper of the result should looks like

# HELP mx_device_type Device type
# TYPE mx_device_type gauge
mx_device_type{deviceId="0",deviceType="MXC550",dieId="0",uuid="GPU-f31ed7f5-a28c-7df9-cd52-817776f4af03"} 1.0
mx_device_type{deviceId="1",deviceType="MXC550",dieId="0",uuid="GPU-11a13239-7d0a-7dd4-7b2b-4643afceafdc"} 1.0
# HELP mx_bios_ver Bios version
# TYPE mx_bios_ver gauge
mx_bios_ver{bios="1.27.5.0",deviceId="0",dieId="0"} 1.0
mx_bios_ver{bios="1.27.5.0",deviceId="1",dieId="0"} 1.0

10. Metax Kubernetes Operator v0.13.2 (December 2025 Update)

In December 2025, a new version of the Metax Kubernetes Operator (v0.13.2) was released. This version introduces updated drivers and improved support for containerized environments.

10.1. Downloading and Mirroring Components

The update involves downloading the driver image, the K8s package, and the latest mx-exporter.

# Download Driver Image runfile
wget -O metax-k8s-driver-image.3.3.0.4-x86_64.run "https://metax-pub.oss-cn-shanghai.aliyuncs.com/mxmaca2.0/3.3.0.x/binary/x86_64/driver/metax-k8s-driver-image.3.3.0.4-x86_64.run?OSSAccessKeyId=LTAI5t8HeoJo71RpDsrCMZbQ&Expires=1765732067&Signature=W3jPYwbZ7lwOfeJsz5qx%2BY8UYjI%3D"

# Download GPU K8s Package
wget -O metax-gpu-k8s-package.0.13.2.tar.gz "https://metax-pub.oss-cn-shanghai.aliyuncs.com/mxmaca2.0/3.0.0.x/binary/x86_64/cloud/metax-gpu-k8s-package.0.13.2.tar.gz?OSSAccessKeyId=LTAI5t8HeoJo71RpDsrCMZbQ&Expires=1765732096&Signature=FLNBs0R%2FLuFHSl2AuA9UebjWJio%3D"

# Download Latest mx-exporter
wget -O mx-exporter.0.13.2.tgz "https://metax-pub.oss-cn-shanghai.aliyuncs.com/mxmaca2.0/3.0.0.x/binary/x86_64/cloud/mx-exporter.0.13.2.tgz?OSSAccessKeyId=LTAI5t8HeoJo71RpDsrCMZbQ&Expires=1765733381&Signature=w8G17CfpCEiDCyR2H2wuFosNA00%3D"

# Download MACA Native Container Image
wget -O maca-native-3.3.0.4-centos9-amd64.container.xz "https://metax-pub.oss-cn-shanghai.aliyuncs.com/mxmaca2.0/3.3.0.x/binary/x86_64/container/maca-native-3.3.0.4-centos9-amd64.container.xz?OSSAccessKeyId=LTAI5t8HeoJo71RpDsrCMZbQ&Expires=1765733547&Signature=CQ9dtRzmvsY12rJD0xetvuRWCtg%3D"

# Log in and pull vLLM image
docker login --username=cr_temp_user --password=eyJpbnN0YW5jZUlkIjoiY3JpLXpxYTIzejI2YTU5M3R3M2QiLCJ0aW1lIjoiMTc2NTg1NzIzMTAwMCIsInR5cGUiOiJzdWIiLCJ1c2VySWQiOiIyMDcwOTQwMTA1NjYzNDE3OTIifQ:2e9ca368d0ac590ab9a6e1841bb9ca8064c1ce41 cr.metax-tech.com && docker pull cr.metax-tech.com/public-ai-release/maca/modelzoo.llm.vllm:1.0.0-maca.ai3.2.1.8-torch2.6-py310-centos9-amd64

Mirroring the components to the local registry:

# Push driver images
bash ./metax-k8s-driver-image.3.3.0.4-x86_64.run push mirror.infra.wzhlab.top:8443/metax

# Extract K8s package
tar vxf metax-gpu-k8s-package.0.13.2.tar.gz

# Push operator images
bash metax-k8s-images.0.13.2.run push mirror.infra.wzhlab.top:8443/metax

# Load and push MACA native image
podman image load -i maca-native-3.3.0.4-centos9-amd64.container.xz
podman tag docker.io/library/maca-native:3.3.0.4-centos9-amd64 mirror.infra.wzhlab.top:8443/metax/maca-native:3.3.0.4-centos9-amd64
podman push mirror.infra.wzhlab.top:8443/metax/maca-native:3.3.0.4-centos9-amd64

# Load and push mx-exporter
tar vxf mx-exporter.0.13.2.tgz
podman image load -i mx-exporter/mx-exporter-0.13.2-amd64.xz
podman tag cr.metax-tech.com/cloud/mx-exporter:0.13.2 mirror.infra.wzhlab.top:8443/metax/mx-exporter:0.13.2
podman push mirror.infra.wzhlab.top:8443/metax/mx-exporter:0.13.2

# Mirror vLLM to local registry
podman tag cr.metax-tech.com/public-ai-release/maca/modelzoo.llm.vllm:1.0.0-maca.ai3.2.1.8-torch2.6-py310-centos9-amd64 mirror.infra.wzhlab.top:8443/metax/maca/modelzoo.llm.vllm:1.0.0-maca.ai3.2.1.8-torch2.6-py310-centos9-amd64
podman push mirror.infra.wzhlab.top:8443/metax/maca/modelzoo.llm.vllm:1.0.0-maca.ai3.2.1.8-torch2.6-py310-centos9-amd64

10.2. Building a Unified Driver Image for Multiple Kernels

Due to limitations in official driver images regarding specific kernel support on OpenShift, we create a unified driver image by merging modules from different versions.

# driver-image.dockerfile
FROM mirror.infra.wzhlab.top:8443/metax/driver-image:20251113-631-amd64 as driver

FROM mirror.infra.wzhlab.top:8443/metax/driver-image:3.3.0.4-amd64

# Copy kernel modules for targeted RHCOS versions
COPY --from=driver /metax/kernel_module/5.14.0-570.54.1.el9_6.x86_64 /metax/kernel_module/5.14.0-570.54.1.el9_6.x86_64
COPY --from=driver /metax/kernel_module/5.14.0-427.96.1.el9_4.x86_64 /metax/kernel_module/5.14.0-427.96.1.el9_4.x86_64

Build and push the customized driver image:

# Build and push to internal registry
podman build -t mirror.infra.wzhlab.top:8443/metax/driver-image:3.3.0.4-amd64-ocp-v02 -f driver-image.dockerfile ./
podman push mirror.infra.wzhlab.top:8443/metax/driver-image:3.3.0.4-amd64-ocp-v02

# Optionally push to external registry
podman tag mirror.infra.wzhlab.top:8443/metax/driver-image:3.3.0.4-amd64-ocp-v02 quay.io/wangzheng422/metax/driver-image:3.3.0.4-amd64-ocp-v02
podman push quay.io/wangzheng422/metax/driver-image:3.3.0.4-amd64-ocp-v02

10.3. Deploying the Updated Operator

Deploying the v0.13.2 operator using the customized driver image.

# Extract the Helm chart
tar vxf metax-operator-0.13.2.tgz

# Install via Helm
helm install metax-operator ./metax-operator \
--create-namespace -n metax-operator \
--wait \
--set openshift.enabled=true \
--set dataExporter.deploy=false \
--set runtime.deploy=false \
--set maca.deploy=false \
--set dataExporter.image.name=mx-exporter \
--set dataExporter.image.version=0.13.2 \
--set registry=mirror.infra.wzhlab.top:8443/metax \
--set maca.payload.images[0]=maca-native:3.3.0.4-centos9-amd64 \
--set driver.payload.name=driver-image \
--set driver.payload.version=3.3.0.4-amd64-ocp-v02

Post-installation permissions setup for OpenShift SCC:

# Add privileged SCC to necessary service accounts
oc adm policy add-scc-to-user privileged -z metax-gpu-label -n metax-operator
oc adm policy add-scc-to-user privileged -z metax-container-runtime -n metax-operator
oc adm policy add-scc-to-user privileged -z metax-driver -n metax-operator
oc adm policy add-scc-to-user privileged -z metax-gpu-device -n metax-operator

Manual node labeling workaround (required if operator auto-labeling fails):

# Manually set ready labels for components
oc label node worker-01-demo metax-tech.com/maca.ready=true metax-tech.com/runtime.ready=true
oc label node worker-02-demo metax-tech.com/maca.ready=true metax-tech.com/runtime.ready=true

11. Custom Driver Loader for RHCOS / RHEL 9.6

For OpenShift clusters based on RHCOS / RHEL 9.6, the standard Metax kernel driver loader may fail to initialize properly without specific parameters. To resolve this, we implement a custom driver loader that manually handles ACS (Access Control Services) configuration and kernel module insertion with proper memory registration function addresses.

Note on ACS (Access Control Services): While ACS is a PCIe capability that is typically managed within the hardware BIOS settings (often found under PCIe configuration or virtualization/IOMMU settings), it is not always accessible in cloud or restricted environments. The script below provides a software-based alternative using setpci to disable ACS isolation at the OS level, which is necessary for efficient Peer-to-Peer (P2P) communication between GPUs.

11.1. Building the Driver Init Image

We use a UBI9-based image with essential tools like pciutils and insmod.

# driver-init.dockerfile
FROM registry.access.redhat.com/ubi9/ubi:latest

RUN dnf install -y pciutils /usr/sbin/insmod && \
    dnf clean all

# Build and push the init image
podman build -t quay.io/wangzheng422/qimgs:driver-init-2025.12.15-v01 -f driver-init.dockerfile ./
podman push quay.io/wangzheng422/qimgs:driver-init-2025.12.15-v01

11.2. Configuring Driver Loader Scripts

These scripts handle hardware preparation and driver loading.

apiVersion: v1
kind: ConfigMap
metadata:
  name: metax-driver-scripts
  namespace: metax-operator
data:
  # Script 1: Disable ACS (Software-based alternative to BIOS setting)
  disable-acs.sh: |
    #!/bin/bash
    set -e
    echo ">>> Starting ACS disable script..."
    for BDF in `lspci -d "*:*:*" | awk '{print $1}'`; do
      # Check for ACS support
      setpci -v -s ${BDF} ECAP_ACS+0x6.w > /dev/null 2>&1
      if [ $? -ne 0 ]; then continue; fi
      # Disable ACS for the device via pci register manipulation
      setpci -v -s ${BDF} ECAP_ACS+0x6.w=0000
      echo "Disabled ACS on device $BDF"
    done
    echo ">>> ACS disable script finished."

  # Script 2: Load Kernel Module with Address Resolution
  load-metax.sh: |
    #!/bin/bash
    set -e
    KERNEL_VER=$(uname -r)
    # Find the appropriate metax.ko
    KO_PATH=$(find /metax -name "metax.ko" | head -n 1)
    if [ -z "$KO_PATH" ]; then
        echo "ERROR: metax.ko not found for kernel $KERNEL_VER"; exit 1
    fi
    echo "Found driver at: $KO_PATH"

    # Remove existing module if present
    if lsmod | grep -q "metax"; then
        echo "Module metax is already loaded. Attempting to remove..."
        rmmod metax || true
    fi

    # Retrieve memory registration function addresses from host kallsyms
    PROC_FILE="/host_proc/kallsyms"
    if [ ! -f "$PROC_FILE" ]; then PROC_FILE="/proc/kallsyms"; fi

    REG_FUNC="0x$(grep "T ib_register_peer_memory_client" $PROC_FILE | awk '{print $1}' | head -n 1)"
    UNREG_FUNC="0x$(grep "T ib_unregister_peer_memory_client" $PROC_FILE | awk '{print $1}' | head -n 1)"

    if [ -z "$REG_FUNC" ] || [ "$REG_FUNC" == "0x" ]; then
        echo "Error: Could not find ib_register_peer_memory_client symbol."; exit 1
    fi

    echo "ib_register_peer_memory_client addr:   $REG_FUNC"
    echo "ib_unregister_peer_memory_client addr: $UNREG_FUNC"

    # Insert module with resolved addresses
    echo "Installing metax driver..."
    insmod $KO_PATH ib_reg_addr=$REG_FUNC ib_unreg_addr=$UNREG_FUNC
    
    if lsmod | grep -q "metax"; then
        echo "SUCCESS: metax module loaded."
    else
        echo "FAIL: metax module failed to load."; exit 1
    fi

  # Entrypoint Script
  entrypoint.sh: |
    #!/bin/bash
    bash /scripts/disable-acs.sh
    bash /scripts/load-metax.sh
    echo ">>> All tasks done. Keeping pod alive..."
    while true; do sleep 3600; done

11.3. Deploying the Driver Loader DaemonSet

The DaemonSet uses an initContainer to extract modules and a main container to execute the loading logic.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: metax-driver-loader
  namespace: metax-operator
  labels:
    app: metax-driver-loader
spec:
  selector:
    matchLabels:
      app: metax-driver-loader
  template:
    metadata:
      labels:
        app: metax-driver-loader
    spec:
      hostPID: true
      hostNetwork: true
      serviceAccountName: metax-driver
      nodeSelector:
        metax-tech.com/gpu.installed: 'true'
        
      initContainers:
      - name: driver-extractor
        image: mirror.infra.wzhlab.top:8443/metax/driver-image:3.3.0.4-amd64-ocp-v02
        imagePullPolicy: IfNotPresent
        command: ["/bin/sh", "-c"]
        args:
          - |
            echo "Extracting driver modules..."
            mkdir -p /shared-driver/metax
            cp -r /metax/kernel_module/$(uname -r)/* /shared-driver/metax/
            echo "Extraction complete."
        volumeMounts:
        - name: driver-share
          mountPath: /shared-driver

      containers:
      - name: loader
        image: quay.io/wangzheng422/qimgs:driver-init-2025.12.15-v01
        imagePullPolicy: IfNotPresent
        command: ["/bin/bash", "/scripts/entrypoint.sh"]
        securityContext:
          privileged: true
          capabilities:
            add: ["SYS_MODULE", "SYS_ADMIN"]
        volumeMounts:
        - name: driver-share
          mountPath: /metax
        - name: scripts
          mountPath: /scripts
        - name: host-proc
          mountPath: /host_proc
          readOnly: true
        - name: sys
          mountPath: /sys
        - name: modules
          mountPath: /lib/modules
        - name: host-dev
          mountPath: /dev

      volumes:
      - name: driver-share
        emptyDir: {}
      - name: scripts
        configMap:
          name: metax-driver-scripts
          defaultMode: 0755
      - name: host-proc
        hostPath:
          path: /proc
      - name: sys
        hostPath:
          path: /sys
      - name: modules
        hostPath:
          path: /lib/modules
      - name: host-dev
        hostPath:
          path: /dev

12. Optimizing OpenShift for Large Language Models (LLMs)

Running massive models like Qwen3-235B using vLLM requires high resource limits (PIDs and memory). vLLM often uses Ray for distributed execution, which creates numerous processes.

12.1. CRI-O Configuration for High Process Limits

We increase the PID limit and set essential ulimits via a MachineConfig.

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 99-worker-crio-ulimits
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
        - contents:
            # URL-encoded TOML content for /etc/crio/crio.conf.d/99-custom-ulimits.conf
            source: data:,%5Bcrio.runtime%5D%0Apids_limit%20%3D%200%0Adefault_ulimits%20%3D%20%5B%0A%20%20%20%20%22nproc%3D1048576%3A1048576%22%2C%0A%20%20%20%20%22memlock%3D-1%3A-1%22%2C%0A%20%20%20%20%22nofile%3D1048576%3A1048576%22%2C%0A%5D%0A
          mode: 420
          overwrite: true
          path: /etc/crio/crio.conf.d/99-custom-ulimits.conf

The content of the machine config come with the output of this command

python3 -c 'import urllib.parse; print("data:," + urllib.parse.quote("""[crio.runtime]
pids_limit = 0
default_ulimits = [
    "nproc=1048576:1048576",
    "memlock=-1:-1",
    "nofile=1048576:1048576",
]
"""))'

12.2. Kubelet Configuration for PID Management

Ensuring the Kubelet does not restrict the number of processes in pods.

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: set-pids-limit-unlimited
spec:
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/worker: "" 
  kubeletConfig:
    # Set to -1 to utilize host-level pid_max
    podPidsLimit: -1

13. Deploying AI Inference Services

13.1. NVIDIA Network Operator

For multi-node inference workloads, efficient network layer communication is critical. This is typically achieved using collective communication libraries like NCCL (NVIDIA Collective Communications Library) or MCCL (MetaX Collective Communications Library).

In MetaX environments, MCCL is designed to leverage high-performance network adapters, such as the NVIDIA ConnectX-5 and ConnectX-7 series. To enable and manage these hardware capabilities within a Kubernetes/OpenShift cluster, the NVIDIA Network Operator is required. This operator automates the deployment and management of the necessary networking components, including drivers, device plugins, and secondary network interfaces.

The following YAML snippet provides a reference configuration for the NicClusterPolicy, which is the primary custom resource used to configure the operator:

apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  creationTimestamp: '2025-11-29T12:46:53Z'
  generation: 3
  name: nic-cluster-policy
  resourceVersion: '4804802'
  uid: d573a4e9-d1bf-4441-9a09-8372522260a2
spec:
  ofedDriver:
    imagePullSecrets: []
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30
    forcePrecompiled: false
    terminationGracePeriodSeconds: 300
    repository: nvcr.io/nvidia/mellanox
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    env:
      - name: UNLOAD_STORAGE_MODULES
        value: 'true'
    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: true
        enable: true
        force: true
        podSelector: ''
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      safeLoad: false
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    version: doca3.1.0-25.07-0.9.7.0-0
    image: doca-driver
  rdmaSharedDevicePlugin:
    config: |
      {
        "configList": [
          {
            "resourceName": "ib",
            "rdmaHcaMax": 63,
            "devices": ["ibs2", "ibs3"]
          }
        ]
      }
    image: k8s-rdma-shared-dev-plugin
    imagePullSecrets: []
    repository: nvcr.io/nvidia/mellanox
    version: 'sha256:a87096761d155eeb6f470e042d2d167bb466d57e63b4aba957f57d745e15a9b2'
status:
  appliedStates:
    - name: state-multus-cni
      state: ignore
    - name: state-container-networking-plugins
      state: ignore
    - name: state-ipoib-cni
      state: ignore
    - name: state-whereabouts-cni
      state: ignore
    - name: state-OFED
      state: ready
    - name: state-SRIOV-device-plugin
      state: ignore
    - name: state-RDMA-device-plugin
      state: ready
    - name: state-ib-kubernetes
      state: ignore
    - name: state-nv-ipam-cni
      state: ignore
    - name: state-nic-feature-discovery
      state: ignore
    - name: state-doca-telemetry-service
      state: ignore
    - name: state-nic-configuration-operator
      state: ignore
    - name: state-spectrum-x-operator
      state: ignore
  state: ready

13.2. Standard Inference with KServe

Example of a ServingRuntime and InferenceService for models like Qwen-32B.

# ServingRuntime for vLLM on MetaX
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  annotations:
    opendatahub.io/recommended-accelerators: '["metax-tech.com/gpu"]'
    openshift.io/display-name: vLLM MetaX Runtime
  name: vllm-maca-runtime
spec:
  annotations:
    opendatahub.io/kserve-runtime: vllm
    prometheus.io/path: /metrics
    prometheus.io/port: "8080"
  containers:
    - args:
        - serve
        - --port=8080
      command:
        - /opt/conda/bin/vllm
      image: quay.ocp.usuanova.com:8443/metax/maca/modelzoo.llm.vllm:1.0.0-maca.ai3.2.1.8-torch2.6-py310-centos9-amd64
      name: kserve-container
      ports:
        - containerPort: 8080
          protocol: TCP
      env:
        - name: HOME
          value: /tmp
        - name: LD_LIBRARY_PATH
          value: $LD_LIBRARY_PATH:/opt/maca/lib64
  multiModel: false
  supportedModelFormats:
    - autoSelect: true
      name: vLLM

# InferenceService for Qwen-32B
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: qwen-32b-service
  namespace: models
  annotations:
    openshift.io/display-name: wzh-test
    serving.kserve.io/deploymentMode: RawDeployment
    serving.kserve.io/stop: 'true'
  labels:
    opendatahub.io/dashboard: 'true'
spec:
  predictor:
    automountServiceAccountToken: false
    imagePullSecrets:
      - name: wzh-oci
    maxReplicas: 1
    minReplicas: 1
    model:
      args:
        - '--model=/mnt/models/snapshots/9216db5781bf21249d130ec9da846c4624c16137'
        - '--served-model-name=Qwen'
        - '--tensor-parallel-size'
        - '4'
        - '--max-model-len=10240'
        - '--trust-remote-code'
        - '--gpu-memory-utilization'
        - '0.95'
      modelFormat:
        name: vLLM
      name: ''
      resources:
        limits:
          metax-tech.com/gpu: '4'
        requests:
          cpu: '64'
          memory: 640Gi
          metax-tech.com/gpu: '4'
      runtime: wzh-test
      storageUri: 'oci://quay.io/jonkey/models/qwen3:32b'

13.3. Distributed Inference with LeaderWorkerSet

For extremely large models (e.g., Qwen-235B) requiring multiple nodes, use the LeaderWorkerSet operator.

Critical Tip: Enabling hostNetwork: true is vital for multi-node Ray communication. Bypassing OVN can reduce First Token To Time (FTTT) latency by up to 500ms.

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: vllm
spec:
  replicas: 1
  leaderWorkerTemplate:
    size: 2
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        hostNetwork: true
        dnsPolicy: ClusterFirstWithHostNet
        containers:
          - name: vllm-leader
            image: quay.ocp.usuanova.com:8443/metax/maca/vllm:maca.ai3.1.0.7-torch2.6-py310-centos9-amd64
            securityContext:
              allowPrivilegeEscalation: true
              capabilities:
                add: ["NET_ADMIN", "IPC_LOCK"]
              privileged: true
            command:
              - bash
              - -lc
              - |
                bash /scripts/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE)
                
                vllm serve /model/Qwen3-235B-A22B-Instruct-2507 \
                  --port 18080 \
                  --served-model-name wzh-model \
                  --trust-remote-code \
                  --max-model-len=32768 \
                  --max-num-seqs=128 \
                  --gpu-memory-utilization 0.90 \
                  --tensor-parallel-size 8 \
                  --pipeline_parallel_size 2
            env:
              - name: HOME
                value: /tmp
              # - name: VLLM_LOGGING_LEVEL
              #   value: "DEBUG"   
              - name: GLOO_SOCKET_IFNAME
                value: br-ex
              - name: MCCL_IB_HCA
                value: "mlx5_0,mlx5_1"
              # - name: MCCL_DEBUG
              #   value: "TRACE"
              # - name: MCCL_DEBUG_SUBSYS
              #   value: "INIT,NET"
              # --- 新增环境变量 Start ---
              - name: FORCE_ACTIVE_WAIT
                value: "2"
              - name: MCCL_CROSS_NIC
                value: "1"
              # --- 新增环境变量 End ---
              - name: LWS_WORKER_INDEX
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
            resources:
              limits:
                # cpu: "180"
                # memory: "640Gi"
                rdma/ib: 2
                metax-tech.com/gpu: '8'
              requests:
                cpu: "120"
                memory: "640Gi"
                rdma/ib: 2
                metax-tech.com/gpu: '8'
            ports:
              - containerPort: 18080
            readinessProbe:
              tcpSocket:
                port: 18080
              initialDelaySeconds: 15
              periodSeconds: 10
            volumeMounts:
              - name: cache-volume
                mountPath: /model
              - name: script-volume
                mountPath: /scripts
              - name: dshm
                mountPath: /dev/shm
        volumes:
        - name: cache-volume
          hostPath:
            path: /mnt/wzh
            type: Directory
        - name: script-volume
          configMap:
            name: scripts
            defaultMode: 0755 
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 100Gi
    workerTemplate:
      spec:
        hostNetwork: true
        dnsPolicy: ClusterFirstWithHostNet
        containers:
          - name: vllm-worker
            image: quay.ocp.usuanova.com:8443/metax/maca/vllm:maca.ai3.1.0.7-torch2.6-py310-centos9-amd64
            securityContext:
              allowPrivilegeEscalation: true
              capabilities:
                add: ["NET_ADMIN", "IPC_LOCK"]
              privileged: true
            command:
              - bash
              - -lc
              - "bash /scripts/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
            env:
              - name: HOME
                value: /tmp    
              # - name: VLLM_LOGGING_LEVEL
              #   value: "DEBUG"              
              - name: GLOO_SOCKET_IFNAME
                value: br-ex
              - name: MCCL_IB_HCA
                value: "mlx5_0,mlx5_1"
              # - name: MCCL_DEBUG
              #   value: "TRACE"
              # - name: MCCL_DEBUG_SUBSYS
              #   value: "INIT,NET"
              # --- 新增环境变量 Start ---
              - name: FORCE_ACTIVE_WAIT
                value: "2"
              - name: MCCL_CROSS_NIC
                value: "1"
              # --- 新增环境变量 End ---
              - name: LWS_WORKER_INDEX
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
            resources:
              limits:
                # cpu: "180"
                # memory: "640Gi"
                rdma/ib: 2
                metax-tech.com/gpu: '8'
              requests:
                cpu: "120"
                memory: "640Gi"
                rdma/ib: 2
                metax-tech.com/gpu: '8'
            volumeMounts:
              - name: cache-volume
                mountPath: /model
              - name: script-volume
                mountPath: /scripts
              - name: dshm
                mountPath: /dev/shm
        volumes:
        - name: cache-volume
          hostPath:
            path: /mnt/wzh
            type: Directory
        - name: script-volume
          configMap:
            name: scripts
            defaultMode: 0755 
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 100Gi