← Back to Index

ROSA ENA Network Metrics Monitoring Deployment Guide

Architecture Overview

flowchart TB
            subgraph ROSA HCP OCP 4.20 / AWS us-east-2
                subgraph Worker Node 1 m6a.xlarge
                    LG[load-generator Pod<br/>curl x50 per batch<br/>Generates network traffic]
                    ENA1[ENA NIC ens5<br/>bw_in_allowance_exceeded<br/>bw_out_allowance_exceeded<br/>pps_allowance_exceeded<br/>conntrack_allowance_exceeded<br/>linklocal_allowance_exceeded<br/>conntrack_allowance_available]
                    EXP1[ENA Exporter Pod<br/>DaemonSet<br/>Image ubi9/python-311<br/>port 9101<br/>hostPID true + privileged]
                end
                subgraph Worker Node 2 m6a.xlarge
                    ENA2[ENA NIC ens5<br/>bw_in_allowance_exceeded<br/>bw_out_allowance_exceeded<br/>pps_allowance_exceeded<br/>conntrack_allowance_exceeded<br/>linklocal_allowance_exceeded<br/>conntrack_allowance_available]
                    EXP2[ENA Exporter Pod<br/>DaemonSet<br/>Image ubi9/python-311<br/>port 9101<br/>hostPID true + privileged]
                end
                subgraph openshift-user-workload-monitoring
                    SVC[Headless Service<br/>port 9101 / clusterIP None]
                    SM[ServiceMonitor<br/>interval 30s / path /metrics]
                    PROM[Prometheus user-workload<br/>node_ena_ time series]
                end
                subgraph openshift-monitoring
                    THANOS[Thanos Querier<br/>Federates all metrics]
                    CONSOLE[OCP Console<br/>Observe / Metrics<br/>Query Browser]
                    AM[AlertManager<br/>PrometheusRule ena-alerts<br/>ENABandwidthInThrottled warning<br/>ENAPPSThrottled warning<br/>ENAConntrackThrottled critical<br/>ENALinklocalThrottled critical]
                end
            end

            LG -- "network traffic" --> ENA1
            ENA1 -- "nsenter ethtool -S" --> EXP1
            ENA2 -- "nsenter ethtool -S" --> EXP2
            EXP1 -- "HTTP /metrics" --> SVC
            EXP2 -- "HTTP /metrics" --> SVC
            SM -- "discovers endpoints" --> SVC
            PROM -- "scrapes every 30s" --> SM
            THANOS -- "federated query" --> PROM
            CONSOLE -- "PromQL" --> THANOS
            AM -- "alert rules" --> PROM

            style LG fill:#FFE6CC,stroke:#D79B00
            style ENA1 fill:#F8CECC,stroke:#B85450
            style ENA2 fill:#F8CECC,stroke:#B85450
            style EXP1 fill:#D5E8D4,stroke:#82B366
            style EXP2 fill:#D5E8D4,stroke:#82B366
            style SVC fill:#E1D5E7,stroke:#9673A6
            style SM fill:#E1D5E7,stroke:#9673A6
            style PROM fill:#E1D5E7,stroke:#9673A6
            style THANOS fill:#F0E6FF,stroke:#7B4FA6
            style CONSOLE fill:#F0E6FF,stroke:#7B4FA6
            style AM fill:#F0E6FF,stroke:#7B4FA6

Background and Objectives

ROSA (Red Hat OpenShift Service on AWS) runs on AWS EC2 instances that use the ENA (Elastic Network Adapter) network driver. The AWS ENA driver exposes a set of critical network performance throttling metrics that are essential for diagnosing network bottlenecks:

Metric Description
bw_in_allowance_exceeded Number of times inbound bandwidth exceeded the EC2 instance limit (m6a.xlarge baseline: 1.562 Gbps)
bw_out_allowance_exceeded Number of times outbound bandwidth exceeded the instance limit
pps_allowance_exceeded Number of times packets-per-second exceeded the limit (easily triggered by high-frequency API calls or service mesh)
conntrack_allowance_exceeded Number of times the connection tracking table exceeded the limit (equivalent to “connection exhaustion” at the AWS layer)
linklocal_allowance_exceeded Number of times access to 169.254.x.x exceeded the limit (affects DNS and EC2 metadata service)
conntrack_allowance_available Current remaining available connection tracking entries

Metrics Gap Summary Table

After a thorough investigation of the ROSA cluster Prometheus (2,100 metrics total), below is a complete comparison of customer requirements vs. existing ROSA capabilities:

Customer Requirement Available in Prometheus? Alternative Collection Method Priority
bw_in_allowance_exceeded ethtool -S / custom DaemonSet High
bw_out_allowance_exceeded ethtool -S / custom DaemonSet High
pps_allowance_exceeded ethtool -S / custom DaemonSet High
conntrack_allowance_exceeded ethtool -S / custom DaemonSet High
linklocal_allowance_exceeded ethtool -S / custom DaemonSet High
conntrack_allowance_available ethtool -S / custom DaemonSet High
NAT Gateway ErrorPortAllocation CloudWatch Exporter Medium
NAT Gateway ActiveConnectionCount CloudWatch Exporter Medium
EBSIOBalance% CloudWatch Exporter Medium
EBSByteBalance% CloudWatch Exporter Medium
Cross-AZ Latency Requires custom measurement Low
ECR Pull Rate CloudWatch Exporter Low
node_filefd_allocated Already available -
conntrack entries/limit Already available -
ListenOverflows/ListenDrops Already available -
TCP TIME_WAIT Already available -
CoreDNS metrics Already available -
somaxconn value ⚠️ Obtained via sysctl, not Prometheus Low
ip_local_port_range ⚠️ Obtained via sysctl, not Prometheus Low

Problem this document solves: The 6 high-priority ENA network throttling metrics in the table above. OpenShift’s built-in node_exporter includes an ethtool collector, but it is disabled by default, and OpenShift does not provide a configuration option to enable it (RFE-7342 will be supported in OCP 4.22). This means the ENA metrics listed above are not available in ROSA’s Prometheus.

Solution: Deploy a custom ENA Metrics Exporter DaemonSet that uses nsenter to enter the host network namespace, calls the host’s ethtool command to read ENA statistics, and exposes them in Prometheus format.


Environment Information

Item Value
Cluster Type ROSA HCP (Hosted Control Plane)
OCP Version 4.20.19
Worker Nodes 2 × m6a.xlarge (4 vCPU, 16GB RAM)
Network Baseline 1.562 Gbps, burst up to 12.5 Gbps
ENA Driver Kernel built-in (5.14.0-570.107.1.el9_6.x86_64)
ENA Interface Name ens5
Region us-east-2

Deployment Steps

Step 1: Enable User Workload Monitoring

OpenShift’s User Workload Monitoring must be explicitly enabled for custom exporter metrics to be ingested by Prometheus.

cat <<'EOF' | oc apply -f -
        apiVersion: v1
        kind: ConfigMap
        metadata:
          name: cluster-monitoring-config
          namespace: openshift-monitoring
        data:
          config.yaml: |
            enableUserWorkload: true
        EOF

Verification:

oc get pods -n openshift-user-workload-monitoring
        
        # Expected output:
        
        # NAME                                   READY   STATUS    RESTARTS   AGE
        
        # prometheus-operator-6cd8958974-nwqrf   2/2     Running   0          8s
        
        # prometheus-user-workload-0             5/6     Running   0          6s
        
        # prometheus-user-workload-1             5/6     Running   0          6s
        
        # thanos-ruler-user-workload-0           4/4     Running   0          5s
        
        # thanos-ruler-user-workload-1           4/4     Running   0          5s

Step 2: Create Namespace and ServiceAccount


        # Create dedicated namespace
        
        oc new-project ena-monitoring
        
        # Create ServiceAccount
        
        oc create serviceaccount ena-exporter -n ena-monitoring
        
        # Grant privileged SCC (requires hostPID and nsenter permissions)
        
        oc adm policy add-scc-to-user privileged -z ena-exporter -n ena-monitoring

Verification:

oc get serviceaccount ena-exporter -n ena-monitoring
        
        # Expected output:
        
        # NAME           SECRETS   AGE
        
        # ena-exporter   0         5s

Step 3: Create ENA Exporter Python Script ConfigMap

The core logic of the ENA Exporter: use nsenter -t 1 --mount --net to enter the host’s mount and network namespace, execute the host’s ethtool -S ens5 command, parse the output, and expose it in Prometheus format.

cat <<'YAML' | oc apply -f -
        apiVersion: v1
        kind: ConfigMap
        metadata:
          name: ena-exporter-script
          namespace: ena-monitoring
        data:
          ena_exporter.py: |
            #!/usr/bin/env python3
            """
            ENA Metrics Exporter for ROSA/OpenShift
            Uses nsenter to read ethtool stats from the host network namespace.
            No extra packages needed - uses only Python stdlib.
            """
            import subprocess
            import http.server
            import socketserver
            import os
        
            PORT = 9101
            INTERFACE = os.getenv('ENA_INTERFACE', 'ens5')
        
            # ENA throttling metrics (counters - monotonically increasing)
            ENA_THROTTLE_METRICS = [
                'bw_in_allowance_exceeded',
                'bw_out_allowance_exceeded',
                'pps_allowance_exceeded',
                'conntrack_allowance_exceeded',
                'linklocal_allowance_exceeded',
            ]
        
            # ENA capacity metrics (gauges - current value)
            ENA_GAUGE_METRICS = [
                'conntrack_allowance_available',
            ]
        
            # ENA SRD metrics
            ENA_SRD_METRICS = [
                'ena_srd_mode',
                'ena_srd_tx_pkts',
                'ena_srd_rx_pkts',
                'ena_srd_resource_utilization',
            ]
        
            ALL_ENA_METRICS = ENA_THROTTLE_METRICS + ENA_GAUGE_METRICS + ENA_SRD_METRICS
        
            def get_ena_stats():
                """Run ethtool via nsenter into the host's mount+net namespace."""
                try:
                    result = subprocess.run(
                        ['nsenter', '-t', '1', '--mount', '--net', '--',
                         'ethtool', '-S', INTERFACE],
                        capture_output=True, text=True, timeout=10
                    )
                    if result.returncode != 0:
                        print(f"[WARN] ethtool failed: {result.stderr.strip()}", flush=True)
                        return {}
                    metrics = {}
                    for line in result.stdout.split('\n'):
                        line = line.strip()
                        if ':' not in line:
                            continue
                        key, _, val = line.partition(':')
                        key = key.strip()
                        val = val.strip()
                        if key in ALL_ENA_METRICS:
                            try:
                                metrics[key] = float(val)
                            except ValueError:
                                pass
                    return metrics
                except Exception as e:
                    print(f"[ERROR] {e}", flush=True)
                    return {}
        
            class MetricsHandler(http.server.BaseHTTPRequestHandler):
                def do_GET(self):
                    if self.path != '/metrics':
                        self.send_response(404)
                        self.end_headers()
                        self.wfile.write(b'Not Found\n')
                        return
        
                    stats = get_ena_stats()
                    lines = []
                    for key in ALL_ENA_METRICS:
                        metric_name = f'node_ena_{key}'
                        val = stats.get(key, 0)
                        lines.append(f'# HELP {metric_name} AWS ENA driver statistic: {key}')
                        if key in ENA_GAUGE_METRICS:
                            lines.append(f'# TYPE {metric_name} gauge')
                        else:
                            lines.append(f'# TYPE {metric_name} counter')
                        lines.append(f'{metric_name}{{interface="{INTERFACE}"}} {val}')
        
                    body = '\n'.join(lines) + '\n'
                    self.send_response(200)
                    self.send_header('Content-Type', 'text/plain; version=0.0.4; charset=utf-8')
                    self.end_headers()
                    self.wfile.write(body.encode())
        
                def log_message(self, fmt, *args):
                    pass  # suppress per-request logs
        
            if __name__ == '__main__':
                print(f"[INFO] ENA Exporter starting on port {PORT}, interface={INTERFACE}", flush=True)
                with socketserver.TCPServer(('', PORT), MetricsHandler) as httpd:
                    httpd.serve_forever()
        YAML

Step 4: Deploy DaemonSet, Service, and ServiceMonitor

cat <<'YAML' | oc apply -f -
        apiVersion: apps/v1
        kind: DaemonSet
        metadata:
          name: ena-exporter
          namespace: ena-monitoring
          labels:
            app: ena-exporter
        spec:
          selector:
            matchLabels:
              app: ena-exporter
          template:
            metadata:
              labels:
                app: ena-exporter
            spec:
              serviceAccountName: ena-exporter
              hostPID: true        # Required to access host PID 1 for nsenter
              nodeSelector:
                kubernetes.io/os: linux
              tolerations:
                - operator: Exists
              containers:
                - name: ena-exporter
                  image: registry.access.redhat.com/ubi9/python-311:latest
                  command: ["python3", "/scripts/ena_exporter.py"]
                  ports:
                    - containerPort: 9101
                      name: metrics
                      protocol: TCP
                  env:
                    - name: ENA_INTERFACE
                      value: "ens5"
                  securityContext:
                    privileged: true   # Required for nsenter to enter host namespace
                    runAsUser: 0
                  volumeMounts:
                    - name: script
                      mountPath: /scripts
                  resources:
                    requests:
                      cpu: 10m
                      memory: 32Mi
                    limits:
                      cpu: 100m
                      memory: 64Mi
              volumes:
                - name: script
                  configMap:
                    name: ena-exporter-script
                    defaultMode: 0755
        ---
        apiVersion: v1
        kind: Service
        metadata:
          name: ena-exporter
          namespace: ena-monitoring
          labels:
            app: ena-exporter
        spec:
          ports:
            - name: metrics
              port: 9101
              targetPort: 9101
              protocol: TCP
          selector:
            app: ena-exporter
          clusterIP: None   # Headless service, allows Prometheus to discover all endpoints
        ---
        apiVersion: monitoring.coreos.com/v1
        kind: ServiceMonitor
        metadata:
          name: ena-exporter
          namespace: ena-monitoring
          labels:
            app: ena-exporter
        spec:
          selector:
            matchLabels:
              app: ena-exporter
          endpoints:
            - port: metrics
              interval: 30s
              path: /metrics
        YAML

Verify pod status:

oc get pods -n ena-monitoring -o wide
        
        # Actual output:
        
        # NAME                 READY   STATUS    RESTARTS   AGE   IP            NODE
        
        # ena-exporter-m6zx4   1/1     Running   0          57s   10.128.0.43   ip-10-0-0-69.us-east-2.compute.internal
        
        # ena-exporter-ww5c9   1/1     Running   0          57s   10.129.0.58   ip-10-0-0-181.us-east-2.compute.internal

Step 5: Verify the Metrics Endpoint


        # View pod startup logs
        
        oc logs -n ena-monitoring ena-exporter-m6zx4
        
        # Actual output:
        
        # [INFO] ENA Exporter starting on port 9101, interface=ens5
        
        # Directly curl the metrics endpoint
        
        oc exec -n ena-monitoring ena-exporter-m6zx4 -- curl -s http://localhost:9101/metrics
        
        # Actual output:
        
        # # HELP node_ena_bw_in_allowance_exceeded AWS ENA driver statistic: bw_in_allowance_exceeded
        
        # # TYPE node_ena_bw_in_allowance_exceeded counter
        
        # node_ena_bw_in_allowance_exceeded{interface="ens5"} 0.0
        
        # # HELP node_ena_bw_out_allowance_exceeded AWS ENA driver statistic: bw_out_allowance_exceeded
        
        # # TYPE node_ena_bw_out_allowance_exceeded counter
        
        # node_ena_bw_out_allowance_exceeded{interface="ens5"} 0.0
        
        # # HELP node_ena_pps_allowance_exceeded AWS ENA driver statistic: pps_allowance_exceeded
        
        # # TYPE node_ena_pps_allowance_exceeded counter
        
        # node_ena_pps_allowance_exceeded{interface="ens5"} 0.0
        
        # # HELP node_ena_conntrack_allowance_exceeded AWS ENA driver statistic: conntrack_allowance_exceeded
        
        # # TYPE node_ena_conntrack_allowance_exceeded counter
        
        # node_ena_conntrack_allowance_exceeded{interface="ens5"} 0.0
        
        # # HELP node_ena_linklocal_allowance_exceeded AWS ENA driver statistic: linklocal_allowance_exceeded
        
        # # TYPE node_ena_linklocal_allowance_exceeded counter
        
        # node_ena_linklocal_allowance_exceeded{interface="ens5"} 0.0
        
        # # HELP node_ena_conntrack_allowance_available AWS ENA driver statistic: conntrack_allowance_available
        
        # # TYPE node_ena_conntrack_allowance_available gauge
        
        # node_ena_conntrack_allowance_available{interface="ens5"} 153423.0

Step 6: Verify Metrics are Being Scraped via Prometheus API

THANOS_URL=$(oc get route thanos-querier -n openshift-monitoring -o jsonpath='{.spec.host}')
        TOKEN=$(oc whoami -t)
        
        # Query the conntrack_allowance_available metric
        
        curl -sk -H "Authorization: Bearer $TOKEN" \
          "https://$THANOS_URL/api/v1/query?query=node_ena_conntrack_allowance_available&namespace=ena-monitoring" \
          | python3 -c "
        import json,sys
        d=json.load(sys.stdin)
        print('Status:', d.get('status'))
        for r in d['data']['result']:
            print(f'  pod={r[\"metric\"][\"pod\"]}  node={r[\"metric\"][\"instance\"]}  value={r[\"value\"][1]}')
        "
        
        # Actual output:
        
        # Status: success
        
        #   pod=ena-exporter-m6zx4  node=10.128.0.43:9101  value=153421
        
        #   pod=ena-exporter-ww5c9  node=10.129.0.58:9101  value=153437

Step 7: Deploy Demo Application (Load Generator)

Deploy a load generator that continuously generates network traffic to simulate real-world network load scenarios:

cat <<'YAML' | oc apply -f -
        apiVersion: v1
        kind: Namespace
        metadata:
          name: ena-demo
        ---
        apiVersion: apps/v1
        kind: Deployment
        metadata:
          name: load-generator
          namespace: ena-demo
        spec:
          replicas: 1
          selector:
            matchLabels:
              app: load-generator
          template:
            metadata:
              labels:
                app: load-generator
            spec:
              containers:
                - name: load-gen
                  image: registry.access.redhat.com/ubi9/ubi-minimal:latest
                  command:
                    - /bin/sh
                    - -c
                    - |
                      echo "Starting load generator..."
                      while true; do
                        for i in $(seq 1 50); do
                          curl -s --connect-timeout 1 -o /dev/null \
                            https://api.rosa-bvjtr.d71w.p3.openshiftapps.com/healthz &
                        done
                        wait
                        echo "$(date): Sent 50 requests"
                      done
                  resources:
                    requests:
                      cpu: 100m
                      memory: 64Mi
                    limits:
                      cpu: 500m
                      memory: 128Mi
        YAML
        
        # Wait for pod to be ready
        
        oc get pods -n ena-demo -w

View load generator logs:

oc logs -n ena-demo deployment/load-generator --tail=10
        
        # Actual output:
        
        # Wed May  6 06:24:08 UTC 2026: Sent 50 requests
        
        # Wed May  6 06:24:09 UTC 2026: Sent 50 requests
        
        # Wed May  6 06:24:10 UTC 2026: Sent 50 requests
        
        # Wed May  6 06:24:11 UTC 2026: Sent 50 requests
        
        # Wed May  6 06:24:12 UTC 2026: Sent 50 requests

Monitoring Display

View ENA Metrics in OpenShift Console

Access the OpenShift Metrics page via the following URL:

https://console-openshift-console.apps.rosa.rosa-bvjtr.d71w.p3.openshiftapps.com/monitoring/query-browser?query0=node_ena_conntrack_allowance_available

After logging in, you can see the ENA conntrack_allowance_available metric chart from both worker nodes (approximately 153,000 available connection tracking entries), which remains stable over time, indicating that the current network load has not reached the limit.

Screenshot note: The Console Observe → Metrics page shows the node_ena_conntrack_allowance_available metric, with the green line holding steady at approximately 140k.

Query All ENA Metrics via Thanos Querier API

THANOS_URL=$(oc get route thanos-querier -n openshift-monitoring -o jsonpath='{.spec.host}')
        TOKEN=$(oc whoami -t)
        
        # Query all ENA throttling metrics
        
        for metric in \
          node_ena_bw_in_allowance_exceeded \
          node_ena_bw_out_allowance_exceeded \
          node_ena_pps_allowance_exceeded \
          node_ena_conntrack_allowance_exceeded \
          node_ena_linklocal_allowance_exceeded \
          node_ena_conntrack_allowance_available; do
          echo "--- $metric ---"
          curl -sk -H "Authorization: Bearer $TOKEN" \
            "https://$THANOS_URL/api/v1/query?query=$metric&namespace=ena-monitoring" | \
            python3 -c "import json,sys; d=json.load(sys.stdin); [print(f'  pod={r[\"metric\"][\"pod\"]} => {r[\"value\"][1]}') for r in d['data']['result']]"
        done
        
        # Actual output (2026-05-06 14:24 UTC+8):
        
        # --- node_ena_bw_in_allowance_exceeded ---
        
        #   pod=ena-exporter-m6zx4 => 0
        
        #   pod=ena-exporter-ww5c9 => 0
        
        # --- node_ena_bw_out_allowance_exceeded ---
        
        #   pod=ena-exporter-m6zx4 => 0
        
        #   pod=ena-exporter-ww5c9 => 0
        
        # --- node_ena_pps_allowance_exceeded ---
        
        #   pod=ena-exporter-m6zx4 => 0
        
        #   pod=ena-exporter-ww5c9 => 0
        
        # --- node_ena_conntrack_allowance_exceeded ---
        
        #   pod=ena-exporter-m6zx4 => 0
        
        #   pod=ena-exporter-ww5c9 => 0
        
        # --- node_ena_linklocal_allowance_exceeded ---
        
        #   pod=ena-exporter-m6zx4 => 0
        
        #   pod=ena-exporter-ww5c9 => 0
        
        # --- node_ena_conntrack_allowance_available ---
        
        #   pod=ena-exporter-m6zx4 => 153362
        
        #   pod=ena-exporter-ww5c9 => 153289

Result Interpretation:


Optional: Configure Alert Rules

ENA Network Throttling Alerts

cat <<'YAML' | oc apply -f -
        apiVersion: monitoring.coreos.com/v1
        kind: PrometheusRule
        metadata:
          name: ena-alerts
          namespace: ena-monitoring
        spec:
          groups:
            - name: ena.rules
              rules:
                # Inbound bandwidth throttled (potential packet loss)
                - alert: ENABandwidthInThrottled
                  expr: rate(node_ena_bw_in_allowance_exceeded[5m]) > 0
                  for: 2m
                  labels:
                    severity: warning
                  annotations:
                    summary: "Node {{ $labels.pod }} inbound bandwidth throttled by AWS ENA"
                    description: |
                      Node {{ $labels.pod }} inbound bandwidth has exceeded the EC2 instance limit.
                      m6a.xlarge baseline: 1.562 Gbps, burst: 12.5 Gbps.
                      Sustained throttling will cause packets to be dropped.
                      Recommendation: Check for high-traffic applications; consider upgrading the instance type.
        
                # PPS limit exceeded (high-frequency small packets, e.g., mTLS/API gateway)
                - alert: ENAPPSThrottled
                  expr: rate(node_ena_pps_allowance_exceeded[5m]) > 0
                  for: 2m
                  labels:
                    severity: warning
                  annotations:
                    summary: "Node {{ $labels.pod }} packet rate throttled by AWS ENA"
                    description: |
                      Node {{ $labels.pod }} packet rate has exceeded the limit.
                      Service Mesh (mTLS) and API gateways (e.g., Kong) are prone to triggering this limit.
                      Recommendation: Check for high volumes of small-packet traffic; consider using larger frame sizes.
        
                # AWS connection tracking exceeded (distinct from Linux conntrack)
                - alert: ENAConntrackThrottled
                  expr: rate(node_ena_conntrack_allowance_exceeded[5m]) > 0
                  for: 2m
                  labels:
                    severity: critical
                  annotations:
                    summary: "Node {{ $labels.pod }} AWS ENA connection tracking throttled"
                    description: |
                      This is a connection tracking limit enforced at the AWS security group level (not Linux netfilter conntrack).
                      When exceeded, new connections are rejected by the AWS network layer, causing intermittent connection failures.
        
                # Link-local exceeded (DNS/metadata service affected)
                - alert: ENALinklocalThrottled
                  expr: rate(node_ena_linklocal_allowance_exceeded[5m]) > 0
                  for: 1m
                  labels:
                    severity: critical
                  annotations:
                    summary: "Node {{ $labels.pod }} link-local traffic throttled (DNS affected!)"
                    description: |
                      AWS limits 169.254.x.x traffic to 1024 pps.
                      This limit affects the VPC DNS resolver (.2 address) and EC2 metadata (IMDSv2).
                      Throttling will cause DNS query timeouts and IAM role retrieval failures!
                      Recommendation: Deploy NodeLocal DNSCache immediately to reduce link-local DNS requests.
        
                # Connection tracking nearly exhausted (early warning)
                - alert: ENAConntrackLow
                  expr: node_ena_conntrack_allowance_available < 10000
                  for: 5m
                  labels:
                    severity: warning
                  annotations:
                    summary: "Node {{ $labels.pod }} AWS ENA connection tracking entries below 10000"
        YAML

Known Limitations and Considerations

  1. ROSA Classic vs. HCP: This solution works on both ROSA HCP and Classic. ROSA Classic may have an SRE-deployed sre-ebs-iops-reporter, but there is no ENA-related monitoring.

  2. Interface Name: This example uses ens5 (the default ENA interface name for RHCOS on AWS). If multiple NICs are present, you can specify a different interface (e.g., eth0) via the ENA_INTERFACE environment variable in the DaemonSet.

  3. ethtool Version Requirement: The ENA driver requires version 2.2.10+ to support statistics. In this environment, the ENA driver is the kernel 5.14 built-in version, which is fully supported.

  4. RFE-7342 Tracking: Red Hat is developing the feature to add the ethtool collector to NodeExporterCollectorConfig. Once merged, it can be configured directly via ConfigMap without deploying this DaemonSet. It is recommended to open a Support Case to help prioritize this RFE.

  5. Permissions Note: privileged: true and hostPID: true are required for nsenter to enter the host namespace. In production environments, it is recommended to use NetworkPolicy to isolate inbound and outbound traffic for the ena-monitoring namespace.


Cleanup (Optional)

To remove all resources:

oc delete namespace ena-monitoring
        oc delete namespace ena-demo
        oc delete configmap cluster-monitoring-config -n openshift-monitoring

References