RHCL Three-Cluster / Two-Domain GLB E2E Report

Field	Value
Date	2026-06-18
Author	George Zheng Wang (SAA), zhengwan@redhat.com
Audience	Customer senior technical review
Scope	Red Hat Connectivity Link / Kuadrant DNS based global load balancing validation
Environment	Three single-node OpenShift clusters on AWS EC2 helper
Helper users	`sno`, `sno2`, `sno3`
Cluster groups	`demo-01`, `demo-02`, `demo-03`
Tested domains	`glb-a.kuadrant.wzhlab.top`, `glb-b.kuadrant.wzhlab.top`
Baseline domain	`glb.kuadrant.wzhlab.top`

1. Executive Conclusion

This E2E validation proves that RHCL can publish two independent application hostnames from three OpenShift clusters and can remove a failed cluster from DNS answers after application failure.

The validated design is:

Per hostname:
  Gateway + HTTPRoute + DNSPolicy
  -> DNSRecord
  -> DNSHealthCheckProbe
  -> Kuadrant CoreDNS answer
  -> Gateway VIP
  -> application Service and Pod

For failover:
  App failure
  -> DNSHealthCheckProbe becomes unhealthy
  -> controller/test logic removes the cluster group from that hostname active-groups TXT
  -> CoreDNS/RHCL converges
  -> bad Gateway VIP disappears from DNS answers

The most important conclusion is:

DNSPolicy.weight is useful for weighted distribution inside a bucket.
active-groups is the reliable include/exclude switch for failover.
DNSPolicy.weight=0 alone was not reliable enough as the only failover switch in this lab.

From an Avi ALB/GSLB replacement perspective:

Avi capability area	RHCL PoC result	Replacement conclusion
Active/active DNS publication across clusters	Validated	Can replace for DNS-based multi-cluster active/active publication.
Weighted distribution among clusters in one region/bucket	Validated	Can replace this part.
Health-based removal of failed app endpoint	Validated with active-groups convergence	Can replace with controller automation and explicit convergence SLA.
Multiple independent FQDNs with different policy	Validated	Can replace with per-hostname Gateway/DNSPolicy/active-groups.
Multi-region survival when one site fails	Validated with caveat	Can provide survival/failover, but not full Avi GSLB policy equivalence.
Closest data center / best-performing data center	Not validated / not native in this PoC	Not a direct replacement without additional geo, ECS, latency, or upstream DNS/GSLB layer.
Avi-style GSLB leader/site federation	Not provided by this PoC	Requires separate controller/control-plane design.

2. Three-Cluster Topology

Cluster	Helper user	RHCL group	CoreDNS LB VIP	Baseline Gateway VIP	Domain A VIP	Domain B VIP
demo-01	`sno`	`demo-01`	`192.168.99.210`	`192.168.99.211`	`192.168.99.212`	`192.168.99.213`
demo-02	`sno2`	`demo-02`	`192.168.99.230`	`192.168.99.221`	`192.168.99.220`	`192.168.99.222`
demo-03	`sno3`	`demo-03`	`192.168.99.240`	`192.168.99.241`	`192.168.99.242`	`192.168.99.243`

The upstream DNS model simulated an Infoblox-style delegated subdomain:

Parent DNS delegates kuadrant.wzhlab.top to the OpenShift-hosted Kuadrant CoreDNS VIPs.
Kuadrant CoreDNS serves RHCL-generated records for application hostnames.
Each OpenShift Gateway VIP is the actual HTTP entry point for the application.

3. Operator Installation

The same operator installation pattern is applied to all three clusters.

3.1 cert-manager, RHCL, and MetalLB operators

apiVersion: v1
kind: Namespace
metadata:
  name: cert-manager-operator
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: cert-manager-operator
  namespace: cert-manager-operator
spec:
  targetNamespaces:
    - cert-manager-operator
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: openshift-cert-manager-operator
  namespace: cert-manager-operator
spec:
  channel: stable-v1.18
  installPlanApproval: Automatic
  name: openshift-cert-manager-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
---
apiVersion: v1
kind: Namespace
metadata:
  name: kuadrant-system
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: kuadrant-system
  namespace: kuadrant-system
spec:
  targetNamespaces:
    - kuadrant-system
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: rhcl-operator
  namespace: kuadrant-system
spec:
  channel: stable
  installPlanApproval: Automatic
  name: rhcl-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
---
apiVersion: v1
kind: Namespace
metadata:
  name: metallb-system
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: metallb-system
  namespace: metallb-system
spec:
  targetNamespaces:
    - metallb-system
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: metallb-operator
  namespace: metallb-system
spec:
  channel: stable
  installPlanApproval: Automatic
  name: metallb-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace

Apply to all clusters:

su - sno  -c 'oc apply -f /tmp/rhcl-operators.yaml'
su - sno2 -c 'oc apply -f /tmp/rhcl-operators.yaml'
su - sno3 -c 'oc apply -f /tmp/rhcl-operators.yaml'

Lab-specific OLM correction that was required in this environment:

for user in sno sno2 sno3; do
  su - "${user}" -c 'oc patch operatorgroup kuadrant-system -n kuadrant-system --type=json -p='\''[{"op":"remove","path":"/spec/targetNamespaces"}]'\'''
  su - "${user}" -c 'oc patch operatorgroup metallb-system -n metallb-system --type=json -p='\''[{"op":"remove","path":"/spec/targetNamespaces"}]'\'''
done

This correction is not a GLB feature. It is an operator install-mode compatibility correction for this lab.

4. RHCL Core Configuration

4.1 GatewayClass and Kuadrant CR

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: openshift-default
spec:
  controllerName: openshift.io/gateway-controller/v1
---
apiVersion: kuadrant.io/v1beta1
kind: Kuadrant
metadata:
  name: kuadrant
  namespace: kuadrant-system

Apply and wait:

for user in sno sno2 sno3; do
  su - "${user}" -c 'oc apply -f /tmp/rhcl-core.yaml'
  su - "${user}" -c 'oc wait kuadrant/kuadrant -n kuadrant-system --for=condition=Ready=true --timeout=600s'
done

4.2 RHCL DNS Operator identity per cluster

Each DNS Operator must have a stable GROUP. This group is later used in DNSRecord ownership and active-groups membership.

demo-01:

apiVersion: v1
kind: ConfigMap
metadata:
  name: dns-operator-controller-env
  namespace: kuadrant-system
data:
  DELEGATION_ROLE: primary
  GROUP: demo-01
  MAX_REQUEUE_TIME: 30s

demo-02 and demo-03 use the same ConfigMap with:

demo-02: GROUP=demo-02
demo-03: GROUP=demo-03

After changing this ConfigMap, restart DNS Operator pods:

su - sno  -c 'oc delete pod -n kuadrant-system -l app.kubernetes.io/name=dns-operator --ignore-not-found'
su - sno2 -c 'oc delete pod -n kuadrant-system -l app.kubernetes.io/name=dns-operator --ignore-not-found'
su - sno3 -c 'oc delete pod -n kuadrant-system -l app.kubernetes.io/name=dns-operator --ignore-not-found'

5. MetalLB and Fixed VIPs

MetalLB was used to make CoreDNS and Gateway VIPs deterministic.

demo-01:

apiVersion: metallb.io/v1beta1
kind: MetalLB
metadata:
  name: metallb
  namespace: metallb-system
---
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: demo-01-pool
  namespace: metallb-system
spec:
  addresses:
    - 192.168.99.210-192.168.99.219
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: demo-01-l2
  namespace: metallb-system
spec:
  interfaces:
    - br-ex
  ipAddressPools:
    - demo-01-pool

Per-cluster pool ranges:

demo-01: 192.168.99.210-192.168.99.219
demo-02: 192.168.99.220-192.168.99.239
demo-03: 192.168.99.240-192.168.99.249

Patch CoreDNS Service fixed VIPs:

su - sno  -c 'oc patch svc kuadrant-coredns -n kuadrant-coredns -p '\''{"spec":{"loadBalancerIP":"192.168.99.210"}}'\'''
su - sno2 -c 'oc patch svc kuadrant-coredns -n kuadrant-coredns -p '\''{"spec":{"loadBalancerIP":"192.168.99.230"}}'\'''
su - sno3 -c 'oc patch svc kuadrant-coredns -n kuadrant-coredns -p '\''{"spec":{"loadBalancerIP":"192.168.99.240"}}'\'''

Patch baseline Gateway fixed VIPs:

su - sno  -c 'oc patch svc ingress-gateway-openshift-default -n api-gateway -p '\''{"spec":{"loadBalancerIP":"192.168.99.211"}}'\'''
su - sno2 -c 'oc patch svc ingress-gateway-openshift-default -n api-gateway -p '\''{"spec":{"loadBalancerIP":"192.168.99.221"}}'\'''
su - sno3 -c 'oc patch svc ingress-gateway-openshift-default -n api-gateway -p '\''{"spec":{"loadBalancerIP":"192.168.99.241"}}'\'''

For the two-domain test, the extra Gateway VIPs were allocated automatically from the same pools:

Domain A: .212, .220, .242
Domain B: .213, .222, .243

6. RHCL DNS Provider Secret

The coredns-credentials Secret tells RHCL DNS Operator to manage DNS records through the local Kuadrant CoreDNS provider.

apiVersion: v1
kind: Secret
metadata:
  name: coredns-credentials
  namespace: api-gateway
  labels:
    kuadrant.io/default-provider: "true"
type: kuadrant.io/coredns
stringData:
  ZONES: kuadrant.wzhlab.top
  NAMESERVERS: <local-kuadrant-coredns-cluster-ip>

Patch the local Service clusterIP into the Secret on each cluster:

for user in sno sno2 sno3; do
  su - "${user}" -c '
    cluster_ip="$(oc get svc kuadrant-coredns -n kuadrant-coredns -o jsonpath="{.spec.clusterIP}")"
    oc patch secret coredns-credentials -n api-gateway --type merge \
      -p "{\"stringData\":{\"NAMESERVERS\":\"${cluster_ip}\"}}"
  '
done

This Secret is not the cross-cluster kubeconfig. It is a provider configuration for CoreDNS.

7. RHCL Cross-Cluster Secrets and RBAC

RHCL DNS Operator uses cross-cluster Secrets to read peer DNS state. This is separate from the external test controller.

Representative commands:

# On demo-01 context, add demo-02 and demo-03 as remote clusters.
kubectl-kuadrant_dns add-cluster-secret \
  --context demo-02 \
  --namespace kuadrant-system \
  --name demo-02 \
  --service-account dns-operator-remote-cluster

kubectl-kuadrant_dns add-cluster-secret \
  --context demo-03 \
  --namespace kuadrant-system \
  --name demo-03 \
  --service-account dns-operator-remote-cluster

# Grant the service account the DNS operator remote-cluster role.
oc adm policy add-cluster-role-to-user dns-operator-remote-cluster-role \
  -z dns-operator-remote-cluster \
  -n kuadrant-system

Repeat the same idea for each source cluster so the three clusters form the required peer mesh.

Security note:

kubectl-kuadrant_dns add-cluster-secret creates a Kubernetes Secret containing a kubeconfig.
That kubeconfig contains remote API server information, CA data, and a ServiceAccount credential.
The credential is sensitive and must not be printed in a customer report.

8. Active-Groups CoreDNS Configuration

The active-groups TXT zone is the hard site membership switch.

For this two-domain validation, each cluster’s kuadrant-coredns ConfigMap included three active-groups zones:

apiVersion: v1
kind: ConfigMap
metadata:
  name: kuadrant-coredns
  namespace: kuadrant-coredns
data:
  Corefile: |
    kuadrant-active-groups.glb.kuadrant.wzhlab.top:53 {
        errors
        log
        file /etc/coredns/active-groups.db {
            reload 2s
        }
    }
    kuadrant-active-groups.glb-a.kuadrant.wzhlab.top:53 {
        errors
        log
        file /etc/coredns/active-groups-a.db {
            reload 2s
        }
    }
    kuadrant-active-groups.glb-b.kuadrant.wzhlab.top:53 {
        errors
        log
        file /etc/coredns/active-groups-b.db {
            reload 2s
        }
    }
    kuadrant.wzhlab.top:53 {
        errors
        health {
            lameduck 5s
        }
        ready
        log
        metadata
        kuadrant
    }
  active-groups.db: |
    kuadrant-active-groups.glb.kuadrant.wzhlab.top. 10 IN SOA ns1. hostmaster. 1781693568 7200 3600 1209600 10
    kuadrant-active-groups.glb.kuadrant.wzhlab.top. 10 IN NS ns1.
    kuadrant-active-groups.glb.kuadrant.wzhlab.top. 10 IN TXT "version=1;groups=demo-01&&demo-02&&demo-03"
  active-groups-a.db: |
    kuadrant-active-groups.glb-a.kuadrant.wzhlab.top. 10 IN SOA ns1. hostmaster. 1781693568 7200 3600 1209600 10
    kuadrant-active-groups.glb-a.kuadrant.wzhlab.top. 10 IN NS ns1.
    kuadrant-active-groups.glb-a.kuadrant.wzhlab.top. 10 IN TXT "version=1;groups=demo-01&&demo-02&&demo-03"
  active-groups-b.db: |
    kuadrant-active-groups.glb-b.kuadrant.wzhlab.top. 10 IN SOA ns1. hostmaster. 1781693568 7200 3600 1209600 10
    kuadrant-active-groups.glb-b.kuadrant.wzhlab.top. 10 IN NS ns1.
    kuadrant-active-groups.glb-b.kuadrant.wzhlab.top. 10 IN TXT "version=1;groups=demo-01&&demo-02&&demo-03"

The Deployment must mount all three files:

oc patch deployment kuadrant-coredns -n kuadrant-coredns --type json -p '[
  {
    "op":"replace",
    "path":"/spec/template/spec/volumes/0/configMap/items",
    "value":[
      {"key":"Corefile","path":"Corefile"},
      {"key":"active-groups.db","path":"active-groups.db"},
      {"key":"active-groups-a.db","path":"active-groups-a.db"},
      {"key":"active-groups-b.db","path":"active-groups-b.db"}
    ]
  }
]'

OpenShift DNS forwarding must know the active-groups zones:

oc patch dns.operator/default --type json -p '[
  {
    "op": "replace",
    "path": "/spec/servers/0/zones",
    "value": [
      "kuadrant-active-groups.glb.kuadrant.wzhlab.top",
      "kuadrant-active-groups.glb-a.kuadrant.wzhlab.top",
      "kuadrant-active-groups.glb-b.kuadrant.wzhlab.top"
    ]
  }
]'

Validation command:

dig +tcp +short @192.168.99.210 kuadrant-active-groups.glb-a.kuadrant.wzhlab.top TXT
dig +tcp +short @192.168.99.230 kuadrant-active-groups.glb-b.kuadrant.wzhlab.top TXT
dig +tcp +short @192.168.99.240 kuadrant-active-groups.glb-b.kuadrant.wzhlab.top TXT

Expected output:

"version=1;groups=demo-01&&demo-02&&demo-03"

9. Demo Application

The same app was deployed on all clusters. Only the text content differs by cluster.

apiVersion: v1
kind: Namespace
metadata:
  name: connectlink-demo
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: echo-content
  namespace: connectlink-demo
data:
  index.html: |
    demo-01 via Connectivity Link GLB
  health: |
    ok
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: echo
  namespace: connectlink-demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: echo
  template:
    metadata:
      labels:
        app: echo
    spec:
      containers:
        - name: echo
          image: registry.access.redhat.com/ubi9/python-311:latest
          command: ["/bin/bash", "-c"]
          args: ["cd /opt/app-root/src && python -m http.server 8080"]
          ports:
            - containerPort: 8080
          volumeMounts:
            - name: content
              mountPath: /opt/app-root/src
      volumes:
        - name: content
          configMap:
            name: echo-content
---
apiVersion: v1
kind: Service
metadata:
  name: echo
  namespace: connectlink-demo
spec:
  selector:
    app: echo
  ports:
    - name: http
      port: 8080
      targetPort: 8080

Per-cluster content:

demo-01 via Connectivity Link GLB
demo-02 via Connectivity Link GLB
demo-03 via Connectivity Link GLB

10. Two-Domain Gateway / HTTPRoute / DNSPolicy Configuration

Each domain has independent Gateway, HTTPRoute, and DNSPolicy objects.

Domain A:

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: ingress-gateway-a
  namespace: api-gateway
spec:
  gatewayClassName: openshift-default
  listeners:
    - name: http
      hostname: glb-a.kuadrant.wzhlab.top
      port: 80
      protocol: HTTP
      allowedRoutes:
        namespaces:
          from: All
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: echo-a
  namespace: connectlink-demo
spec:
  parentRefs:
    - name: ingress-gateway-a
      namespace: api-gateway
  hostnames:
    - glb-a.kuadrant.wzhlab.top
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /
      backendRefs:
        - name: echo
          port: 8080
---
apiVersion: kuadrant.io/v1
kind: DNSPolicy
metadata:
  name: ingress-gateway-a-dns
  namespace: api-gateway
spec:
  delegate: true
  healthCheck:
    failureThreshold: 2
    interval: 30s
    path: /health
    port: 80
    protocol: HTTP
  loadBalancing:
    defaultGeo: true
    geo: GEO-NA
    weight: 60
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: ingress-gateway-a

Domain B uses the same structure with:

Gateway: ingress-gateway-b
HTTPRoute: echo-b
DNSPolicy: ingress-gateway-b-dns
Hostname: glb-b.kuadrant.wzhlab.top

Important namespace rule:

Gateway and DNSPolicy are in api-gateway.
HTTPRoute is in connectlink-demo.
The backend Service is connectlink-demo/echo.

Putting the HTTPRoute in api-gateway without a ReferenceGrant caused an earlier failed attempt. The working pattern is to keep the route in the application namespace.

11. Test Method

DNS sample command:

for ns in 192.168.99.210 192.168.99.230 192.168.99.240; do
  echo "===== nameserver=${ns} ====="
  for i in $(seq 1 120); do
    dig +tcp +short @"${ns}" glb-a.kuadrant.wzhlab.top A | tail -1
  done | sort | uniq -c
done

Bad VIP absent check:

for ns in 192.168.99.210 192.168.99.230 192.168.99.240; do
  echo "===== nameserver=${ns} ====="
  for i in $(seq 1 60); do
    dig +tcp +short @"${ns}" glb-a.kuadrant.wzhlab.top A | tail -1
  done | grep -c '192.168.99.242' || true
done

Pass condition:

For a failed endpoint, every nameserver must return bad VIP count = 0.

This is stricter than “one query does not show the bad VIP” because weighted DNS is probabilistic.

12. Scenario 1: Three Clusters In One Region

All clusters were configured in GEO-NA.

Domain A policy:

Cluster	Group	Geo	VIP	Weight
demo-01	demo-01	GEO-NA	`192.168.99.212`	60
demo-02	demo-02	GEO-NA	`192.168.99.220`	30
demo-03	demo-03	GEO-NA	`192.168.99.242`	10

Domain B policy:

Cluster	Group	Geo	VIP	Weight
demo-01	demo-01	GEO-NA	`192.168.99.213`	20
demo-02	demo-02	GEO-NA	`192.168.99.222`	30
demo-03	demo-03	GEO-NA	`192.168.99.243`	50

Configuration commands:

# Domain A: 60 / 30 / 10
su - sno  -c 'oc patch dnspolicy ingress-gateway-a-dns -n api-gateway --type merge -p '\''{"spec":{"loadBalancing":{"geo":"GEO-NA","defaultGeo":true,"weight":60}}}'\'''
su - sno2 -c 'oc patch dnspolicy ingress-gateway-a-dns -n api-gateway --type merge -p '\''{"spec":{"loadBalancing":{"geo":"GEO-NA","defaultGeo":false,"weight":30}}}'\'''
su - sno3 -c 'oc patch dnspolicy ingress-gateway-a-dns -n api-gateway --type merge -p '\''{"spec":{"loadBalancing":{"geo":"GEO-NA","defaultGeo":false,"weight":10}}}'\'''

# Domain B: 20 / 30 / 50
su - sno  -c 'oc patch dnspolicy ingress-gateway-b-dns -n api-gateway --type merge -p '\''{"spec":{"loadBalancing":{"geo":"GEO-NA","defaultGeo":true,"weight":20}}}'\'''
su - sno2 -c 'oc patch dnspolicy ingress-gateway-b-dns -n api-gateway --type merge -p '\''{"spec":{"loadBalancing":{"geo":"GEO-NA","defaultGeo":false,"weight":30}}}'\'''
su - sno3 -c 'oc patch dnspolicy ingress-gateway-b-dns -n api-gateway --type merge -p '\''{"spec":{"loadBalancing":{"geo":"GEO-NA","defaultGeo":false,"weight":50}}}'\'''

Healthy DNS result, 120 queries per nameserver:

Domain	Nameserver	demo-01	demo-02	demo-03
A	`192.168.99.210`	76	35	9
A	`192.168.99.230`	71	36	13
A	`192.168.99.240`	81	28	11
B	`192.168.99.210`	21	38	61
B	`192.168.99.230`	31	33	56
B	`192.168.99.240`	18	39	63

This confirms that same-bucket weighting works and that two domains can have different weight distributions.

12.1 Scenario 1 Failure: demo-03 App Down

Failure action:

su - sno3 -c 'oc scale deployment/echo -n connectlink-demo --replicas=0'

Failover configuration:

Domain A active-groups: demo-01&&demo-02
Domain B active-groups: demo-01&&demo-02
Domain A demo-03 DNSPolicy weight: 0
Domain B demo-03 DNSPolicy weight: 0

Result:

Domain	Bad VIP	Check
A	`192.168.99.242`	`0/60` bad answers on all three nameservers
B	`192.168.99.243`	`0/60` bad answers on all three nameservers

Post-failover DNS samples:

Domain	Nameserver	Remaining answers
A	`192.168.99.210`	`.212`: 79, `.220`: 41
A	`192.168.99.230`	`.212`: 80, `.220`: 40
A	`192.168.99.240`	`.212`: 76, `.220`: 44
B	`192.168.99.210`	`.222`: 77, `.213`: 43
B	`192.168.99.230`	`.222`: 68, `.213`: 52
B	`192.168.99.240`	`.222`: 72, `.213`: 48

Restore:

su - sno3 -c 'oc scale deployment/echo -n connectlink-demo --replicas=1'

13. Scenario 2: Two Clusters In Region A, One Cluster In Region B

In this test:

demo-01 -> GEO-NA
demo-02 -> GEO-NA
demo-03 -> GEO-EU

Domain A:

Cluster	Geo	VIP	Weight
demo-01	GEO-NA	`192.168.99.212`	70
demo-02	GEO-NA	`192.168.99.220`	20
demo-03	GEO-EU	`192.168.99.242`	10

Domain B:

Cluster	Geo	VIP	Weight
demo-01	GEO-NA	`192.168.99.213`	20
demo-02	GEO-NA	`192.168.99.222`	50
demo-03	GEO-EU	`192.168.99.243`	30

Configuration commands:

# Domain A: region A 70/20, region B 10
su - sno  -c 'oc patch dnspolicy ingress-gateway-a-dns -n api-gateway --type merge -p '\''{"spec":{"loadBalancing":{"geo":"GEO-NA","defaultGeo":true,"weight":70}}}'\'''
su - sno2 -c 'oc patch dnspolicy ingress-gateway-a-dns -n api-gateway --type merge -p '\''{"spec":{"loadBalancing":{"geo":"GEO-NA","defaultGeo":false,"weight":20}}}'\'''
su - sno3 -c 'oc patch dnspolicy ingress-gateway-a-dns -n api-gateway --type merge -p '\''{"spec":{"loadBalancing":{"geo":"GEO-EU","defaultGeo":false,"weight":10}}}'\'''

# Domain B: region A 20/50, region B 30
su - sno  -c 'oc patch dnspolicy ingress-gateway-b-dns -n api-gateway --type merge -p '\''{"spec":{"loadBalancing":{"geo":"GEO-NA","defaultGeo":true,"weight":20}}}'\'''
su - sno2 -c 'oc patch dnspolicy ingress-gateway-b-dns -n api-gateway --type merge -p '\''{"spec":{"loadBalancing":{"geo":"GEO-NA","defaultGeo":false,"weight":50}}}'\'''
su - sno3 -c 'oc patch dnspolicy ingress-gateway-b-dns -n api-gateway --type merge -p '\''{"spec":{"loadBalancing":{"geo":"GEO-EU","defaultGeo":false,"weight":30}}}'\'''

Healthy DNS result, 120 queries per nameserver:

Domain	Nameserver	demo-01	demo-02	demo-03
A	`192.168.99.210`	68	13	39
A	`192.168.99.230`	66	17	37
A	`192.168.99.240`	70	16	34
B	`192.168.99.210`	15	56	49
B	`192.168.99.230`	27	50	43
B	`192.168.99.240`	22	60	38

Interpretation:

RHCL returned answers from both buckets.
The result is not a single flat global 70/20/10 or 20/50/30 pool.
Same-bucket weights are reliable; cross-bucket behavior must be explained as geo-layered DNS behavior, not Avi-style global pool weighting.

13.1 Scenario 2 Failure: Region A Member demo-02 Down

Failure action:

su - sno2 -c 'oc scale deployment/echo -n connectlink-demo --replicas=0'

Failover configuration:

Domain A active-groups: demo-01&&demo-03
Domain B active-groups: demo-01&&demo-03
Domain A demo-02 DNSPolicy weight: 0
Domain B demo-02 DNSPolicy weight: 0

Result:

Domain	Bad VIP	Check
A	`192.168.99.220`	`0/60` bad answers on all three nameservers
B	`192.168.99.222`	`0/60` bad answers on all three nameservers

Post-failover DNS samples:

Domain	Nameserver	Remaining answers
A	`192.168.99.210`	`.212`: 79, `.242`: 41
A	`192.168.99.230`	`.212`: 80, `.242`: 40
A	`192.168.99.240`	`.212`: 76, `.242`: 44
B	`192.168.99.210`	`.213`: 74, `.243`: 46
B	`192.168.99.230`	`.213`: 82, `.243`: 38
B	`192.168.99.240`	`.213`: 91, `.243`: 29

Restore:

su - sno2 -c 'oc scale deployment/echo -n connectlink-demo --replicas=1'

13.2 Scenario 2 Failure: Region B Member demo-03 Down

Failure action:

su - sno3 -c 'oc scale deployment/echo -n connectlink-demo --replicas=0'

Failover configuration:

Domain A active-groups: demo-01&&demo-02
Domain B active-groups: demo-01&&demo-02
Domain A demo-03 DNSPolicy weight: 0
Domain B demo-03 DNSPolicy weight: 0

Result:

Domain	Bad VIP	Check
A	`192.168.99.242`	`0/60` bad answers on all three nameservers on first check
B	`192.168.99.243`	First check saw `9/60` on `192.168.99.240`; second check was `0/60` on all three nameservers

Post-failover DNS samples:

Domain	Nameserver	Remaining answers
A	`192.168.99.210`	`.212`: 94, `.220`: 26
A	`192.168.99.230`	`.212`: 83, `.220`: 37
A	`192.168.99.240`	`.212`: 94, `.220`: 26
B	`192.168.99.210`	`.222`: 83, `.213`: 37
B	`192.168.99.230`	`.222`: 89, `.213`: 31
B	`192.168.99.240`	`.222`: 88, `.213`: 32

Restore:

su - sno3 -c 'oc scale deployment/echo -n connectlink-demo --replicas=1'

14. Why `DNSPolicy.weight=0` Alone Is Not Enough

One failed attempt intentionally tested weight-only failover.

Observed state:

DNSPolicy spec.loadBalancing.weight: 0
DNSPolicy condition: HealthChecksFailed
DNSHealthCheckProbe: false
dig glb-b.kuadrant.wzhlab.top: still returned 192.168.99.243

Conclusion:

For production failover, do not treat weight=0 as the only exclusion mechanism.
Use active-groups, or an equivalent pool membership mechanism, as the hard include/exclude switch.
Still patch weight=0 for intent, observability, and alignment with policy.
Always verify the final DNS answer.

15. External Controller Recommendation

The PoC used scripts/test logic to patch per-domain active-groups and DNSPolicy. A production controller should implement the same logic continuously.

Recommended controller behavior:

Input:
  hostname
  group
  helperUser or kubeconfig reference
  baseWeight
  geo
  defaultGeo
  health probe owner
  active-groups FQDN

Loop:
  read DNSHealthCheckProbe per hostname/site
  compute healthy groups per hostname
  patch that hostname active-groups TXT on all authoritative CoreDNS instances
  patch DNSPolicy weight to baseWeight when healthy
  patch DNSPolicy weight to 0 when unhealthy
  verify final DNS answer until bad VIP is absent

Example CRD shape used in the single-domain controller test:

apiVersion: rhcl-lab.wzhlab.top/v1alpha1
kind: GlobalTrafficPolicy
metadata:
  name: glb-kuadrant
  namespace: rhcl-glb-controller
spec:
  activeGroupsFQDN: kuadrant-active-groups.glb.kuadrant.wzhlab.top.
  ttl: 10
  strategies:
    - active-groups
    - dnspolicy-weight
  sites:
    - group: demo-01
      helperUser: sno
      baseWeight: 60
      geo: GEO-NA
      defaultGeo: true
      corednsNamespace: kuadrant-coredns
      corednsConfigMap: kuadrant-coredns
      dnsPolicyNamespace: api-gateway
      dnsPolicyName: ingress-gateway-dns
      probeNamespace: api-gateway
      probeOwner: ingress-gateway-http
    - group: demo-02
      helperUser: sno2
      baseWeight: 30
      geo: GEO-NA
      defaultGeo: false
      corednsNamespace: kuadrant-coredns
      corednsConfigMap: kuadrant-coredns
      dnsPolicyNamespace: api-gateway
      dnsPolicyName: ingress-gateway-dns
      probeNamespace: api-gateway
      probeOwner: ingress-gateway-http
    - group: demo-03
      helperUser: sno3
      baseWeight: 10
      geo: GEO-NA
      defaultGeo: false
      corednsNamespace: kuadrant-coredns
      corednsConfigMap: kuadrant-coredns
      dnsPolicyNamespace: api-gateway
      dnsPolicyName: ingress-gateway-dns
      probeNamespace: api-gateway
      probeOwner: ingress-gateway-http

For two domains, the controller should make the active-groups state hostname-scoped:

glb-a.kuadrant.wzhlab.top -> kuadrant-active-groups.glb-a.kuadrant.wzhlab.top
glb-b.kuadrant.wzhlab.top -> kuadrant-active-groups.glb-b.kuadrant.wzhlab.top

16. Avi Replacement Analysis

The customer requirement was:

We are looking for an alternative to Avi ALB that performs multi-geo, multi-cluster load balancing.

16.1 What RHCL can replace based on this E2E

Requirement	Result	Notes
Publish one app hostname from multiple OpenShift clusters	Validated	Baseline single-domain and two-domain tests both worked.
Publish multiple hostnames with independent policy	Validated	`glb-a` and `glb-b` used different weights and different Gateway VIPs.
Weighted traffic distribution among healthy clusters	Validated inside same bucket	Scenario 1 proved per-domain same-bucket weights.
Health-based cluster removal	Validated	Requires active-groups membership update and convergence wait.
Recover failed cluster into pool	Validated operationally	The tests restored app replicas and active-groups to full membership.
Use OpenShift-native Gateway API/RHCL objects	Validated	Gateway, HTTPRoute, DNSPolicy, DNSRecord, DNSHealthCheckProbe all participated.

16.2 What RHCL partially replaces

Requirement	Status	Gap
Multi-region DNS survival	Partially validated	Failed region member can be removed, but empty-bucket behavior and fallback policy must be designed.
Global weighted pool across regions	Not equivalent	RHCL geo buckets are not one flat global weight pool.
Centralized GSLB policy	PoC controller only	Needs production controller packaging, HA, RBAC, observability, and reconciliation semantics.
SLA-grade failover	Needs measurement	Health check interval plus CoreDNS/RHCL convergence must be part of the failover SLA.

16.3 What RHCL does not replace in this PoC

Avi GSLB feature	RHCL PoC status
Closest data center selection based on client location	Not validated as native capability.
Best-performing data center based on latency/performance	Not implemented.
EDNS Client Subnet based answer selection	Not supported in this PoC.
Avi GSLB leader/site federation	Not implemented.
Full global service object model with pool/member health, algorithm, persistence, and site federation	Requires external controller or upstream product capability.

16.4 Recommended customer positioning

Use this RHCL design when the customer needs:

OpenShift-native DNS-based multi-cluster publication
weighted distribution among healthy clusters
health-based DNS failover
multiple hostnames with independent routing policy

Do not position this PoC as a full Avi GSLB replacement when the customer requires:

closest-site routing
best-performing-site routing
native geo/ECS decisioning
Avi-like GSLB site federation
central GUI/API for global application service lifecycle

The credible migration strategy is:

Start with RHCL for OpenShift-native Gateway + DNSPolicy + health-based failover.
Add a production external controller for hostname-scoped active-groups and DNSPolicy reconciliation.
If the customer requires location or performance routing, integrate with upstream DNS/GSLB or extend the controller with geo/ECS/latency data sources.
Treat Avi migration as capability-by-capability replacement, not as a one-shot product swap.

17. Final Conclusion

This validation reached the following technical conclusion:

RHCL can support a practical DNS-based GLB pattern for three OpenShift clusters and two independent domains.
Same-region weighted distribution works.
Application failure can be removed from DNS answers.
The robust failover mechanism is per-domain active-groups plus DNSPolicy intent.

This validation did not prove full Avi GSLB equivalence:

No native closest-site algorithm was validated.
No native best-performing-site algorithm was validated.
No native ECS/client-location answer selection was validated.
No Avi-style GSLB site federation was implemented.

Therefore, RHCL is a strong candidate for OpenShift-native multi-cluster DNS failover and weighted DNS distribution, but a complete Avi replacement requires additional geo/performance routing design and a production-grade controller/control-plane layer.

RHCL Three-Cluster / Two-Domain GLB E2E Report

1. Executive Conclusion

2. Three-Cluster Topology

3. Operator Installation

3.1 cert-manager, RHCL, and MetalLB operators

4. RHCL Core Configuration

4.1 GatewayClass and Kuadrant CR

4.2 RHCL DNS Operator identity per cluster

5. MetalLB and Fixed VIPs

6. RHCL DNS Provider Secret

7. RHCL Cross-Cluster Secrets and RBAC

8. Active-Groups CoreDNS Configuration

9. Demo Application

10. Two-Domain Gateway / HTTPRoute / DNSPolicy Configuration

11. Test Method

12. Scenario 1: Three Clusters In One Region

12.1 Scenario 1 Failure: demo-03 App Down

13. Scenario 2: Two Clusters In Region A, One Cluster In Region B

13.1 Scenario 2 Failure: Region A Member demo-02 Down

13.2 Scenario 2 Failure: Region B Member demo-03 Down

14. Why DNSPolicy.weight=0 Alone Is Not Enough

15. External Controller Recommendation

16. Avi Replacement Analysis

16.1 What RHCL can replace based on this E2E

16.2 What RHCL partially replaces

16.3 What RHCL does not replace in this PoC

16.4 Recommended customer positioning

17. Final Conclusion

14. Why `DNSPolicy.weight=0` Alone Is Not Enough