Mitigating etcd Pressure with API Server Throttling

Background

In large-scale or high-activity OpenShift clusters, the etcd database can become a performance bottleneck. Excessive API requests, often generated by applications, controllers, or automation jobs, can overwhelm etcd, leading to increased latency, performance degradation, and in severe cases, cluster instability.

A straightforward solution is to scale up the master nodes by increasing their CPU and memory resources. However, this approach may not always be feasible due to budget constraints, hardware limitations, or other operational considerations.

A more nuanced and cost-effective strategy is to leverage the built-in API Priority and Fairness feature in Kubernetes. This allows administrators to identify and selectively throttle less critical workloads, thereby protecting the API server and etcd from being overloaded by request storms.

This document outlines the process of diagnosing and implementing custom throttling rules for a specific workload that is causing excessive API traffic.

Analyzing the Existing API Priority and Fairness Configuration

Before implementing any changes, it’s crucial to understand the current configuration. API Priority and Fairness is controlled by two main resource types:

PriorityLevelConfiguration: Defines distinct priority “lanes” or levels for API requests. Each level is allocated a share of the server’s concurrency budget.
FlowSchema: Defines rules that classify incoming requests based on their properties (e.g., source user, verb, target resource) and assign them to a specific PriorityLevelConfiguration.

We can inspect the default configuration by running the following commands:

oc get PriorityLevelConfiguration
# NAME                                TYPE      NOMINALCONCURRENCYSHARES   QUEUES   HANDSIZE   QUEUELENGTHLIMIT   AGE
# catch-all                           Limited   5                          <none>   <none>     <none>             21d
# exempt                              Exempt    <none>                     <none>   <none>     <none>             21d
# global-default                      Limited   20                         128      6          50                 21d
# leader-election                     Limited   10                         16       4          50                 21d
# node-high                           Limited   40                         64       6          50                 21d
# openshift-control-plane-operators   Limited   10                         128      6          50                 21d
# system                              Limited   30                         64       6          50                 21d
# workload-high                       Limited   40                         128      6          50                 21d
# workload-low                        Limited   100                        128      6          50                 21d

oc get FlowSchema
# NAME                                PRIORITYLEVEL                       MATCHINGPRECEDENCE   DISTINGUISHERMETHOD   AGE   MISSINGPL
# exempt                              exempt                              1                    <none>                21d   False
# openshift-apiserver-sar             exempt                              2                    ByUser                21d   False
# openshift-oauth-apiserver-sar       exempt                              2                    ByUser                21d   False
# probes                              exempt                              2                    <none>                21d   False
# system-leader-election              leader-election                     100                  ByUser                21d   False
# endpoint-controller                 workload-high                       150                  ByUser                21d   False
# workload-leader-election            leader-election                     200                  ByUser                21d   False
# system-node-high                    node-high                           400                  ByUser                21d   False
# openshift-ovn-kubernetes            system                              500                  ByUser                21d   False
# system-nodes                        system                              500                  ByUser                21d   False
# kube-controller-manager             workload-high                       800                  ByNamespace           21d   False
# kube-scheduler                      workload-high                       800                  ByNamespace           21d   False
# kube-system-service-accounts        workload-high                       900                  ByNamespace           21d   False
# openshift-apiserver                 workload-high                       1000                 ByUser                21d   False
# openshift-controller-manager        workload-high                       1000                 ByUser                21d   False
# openshift-oauth-apiserver           workload-high                       1000                 ByUser                21d   False
# openshift-oauth-server              workload-high                       1000                 ByUser                21d   False
# openshift-apiserver-operator        openshift-control-plane-operators   2000                 ByUser                21d   False
# openshift-authentication-operator   openshift-control-plane-operators   2000                 ByUser                21d   False
# openshift-etcd-operator             openshift-control-plane-operators   2000                 ByUser                21d   False
# openshift-kube-apiserver-operator   openshift-control-plane-operators   2000                 ByUser                21d   False
# openshift-monitoring-metrics        exempt                              2000                 ByUser                21d   False
# service-accounts                    workload-low                        9000                 ByUser                21d   False
# global-default                      global-default                      9900                 ByUser                21d   False
# catch-all                           catch-all                           10000                ByUser                21d   False

# https://console-openshift-console.apps.demo-01-rhsys.wzhlab.top/k8s/cluster/flowcontrol.apiserver.k8s.io~v1~FlowSchema
# https://console-openshift-console.apps.demo-01-rhsys.wzhlab.top/k8s/cluster/flowcontrol.apiserver.k8s.io~v1~PriorityLevelConfiguration

From the output, we can observe that the workload-low priority level has a NOMINALCONCURRENCYSHARES of 100, which is significantly higher than most other levels. If a high-volume, non-critical application falls into this category, it can consume a disproportionate amount of API server resources, potentially impacting etcd.

Our goal is to isolate the problematic workload (in this example, a series of dagster jobs) and assign it to a new, more restrictive priority level.

Implementing Custom Throttling for a Specific Workload

To achieve this, we will create a dedicated PriorityLevelConfiguration with a low concurrency share and a corresponding FlowSchema to direct the target workload’s traffic to this new level.

The following YAML manifest defines these resources:

apiVersion: flowcontrol.apiserver.k8s.io/v1
kind: PriorityLevelConfiguration
metadata:
  # 1. Define a new PriorityLevelConfiguration named 'dagster-limit'. 
  #    This acts as a dedicated, restricted lane for our target workload's API requests.
  name: dagster-limit
spec:
  type: Limited
  limited:
    # 2. Set a low nominal concurrency share.
    #    This value represents a share of the server's total concurrency limit.
    #    A lower value restricts the number of concurrent requests for this level.
    #    The server may default this value if not specified (e.g., to 30).
    #    We will explicitly set it to a restrictive value.
    nominalConcurrencyShares: 1
    
    # Lendable percent is the portion of unused concurrency that can be borrowed by other levels.
    # Setting to 0 prevents this priority level from lending its unused quota.
    lendablePercent: 0
    
    limitResponse:
      # queuing:
      #   # 'queues' specifies the number of queues used to process requests for this priority level.
      #   # Using multiple queues (e.g., 64) helps prevent head-of-line blocking, where a single slow
      #   # request flow cannot stall other independent flows. This value was derived from the
      #   # actual applied configuration.
      #   queues: 1
      #   # 'queueLengthLimit' is the maximum number of requests that can be queued in each queue.
      #   queueLengthLimit: 2
      #   # 'handSize' is the number of requests to process from a given queue at a time.
      #   # The value of 'handSize' must not be greater than 'queues'.
      #   handSize: 1
      # type: Queue
      type: Reject
---
apiVersion: flowcontrol.apiserver.k8s.io/v1
kind: FlowSchema
metadata:
  name: dagster-jobs-flow
spec:
  distinguisherMethod:
    type: ByUser
  # 3. Set a high matching precedence (a lower number means higher priority). 
  #    A value of 8000 ensures this schema is evaluated before the more 
  #    general 'service-accounts' schema (which has a precedence of 9000).
  matchingPrecedence: 8000
  
  # 4. Associate this FlowSchema with the restrictive PriorityLevelConfiguration created above.
  priorityLevelConfiguration:
    name: dagster-limit
    
  # 5. Define the rules to identify the requests that should be throttled.
  rules:
  - resourceRules:
    - apiGroups: ["*"]
      resources: ["*"]
      verbs: ["*"]
      # Replace "dagster-ns" with the actual namespace of the Dagster jobs.
      namespaces: ["dagster-ns"] 
    subjects:
    # IMPORTANT: When a client authenticates with a ServiceAccount token,
    # the request is associated with a User whose name is formatted as
    # 'system:serviceaccount:<namespace>:<serviceaccount-name>'.
    # Therefore, we must match on 'kind: User' and the full name.
    - kind: User
      user:
        name: system:serviceaccount:dagster-ns:dagster-sa
    # Replace "dagster-sa" with the ServiceAccount used by the jobs.
    # - kind: ServiceAccount
    #   serviceAccount:
    #     name: dagster-sa
    #     namespace: dagster-ns

Summary of the Solution

By applying this configuration, any API request originating from the dagster-sa ServiceAccount within the dagster-ns namespace will be matched by the dagster-jobs-flow FlowSchema. These requests will then be directed to the dagster-limit priority level, which enforces a strict concurrency limit.

This ensures that even if the Dagster jobs generate a storm of API requests, they will be effectively queued and throttled. This prevents the workload from overwhelming the API server and etcd, thereby safeguarding the overall health and stability of the OpenShift cluster.

Verifying the Throttling in Action

To demonstrate that the throttling is working as expected, we can set up the environment described in the example, apply the custom rules, and then simulate a high volume of API requests.

1. Prepare the Test Environment

First, create the namespace and ServiceAccount that our throttling rules are designed to target.

# Create the namespace
oc new-project dagster-ns

# Create the ServiceAccount
oc create sa dagster-sa -n dagster-ns

# Grant the ServiceAccount view permissions in its own namespace
oc policy add-role-to-user view -z dagster-sa -n dagster-ns

# Grant the ServiceAccount edit permissions in its own namespace to allow creating/deleting secrets
oc policy add-role-to-user edit -z dagster-sa -n dagster-ns

2. Apply the Custom Throttling Rules

Now, apply the PriorityLevelConfiguration and FlowSchema defined earlier in this document. Save the YAML content into a file named dagster-throttling.yaml and apply it.

# The content of dagster-throttling.yaml is the same as the manifest provided in the section above.
oc apply -f dagster-throttling.yaml

3. Simulate High API Traffic

With the rules in place, we can now simulate a request storm using the dagster-sa ServiceAccount. The following script will launch 500 concurrent requests to create and delete large ConfigMaps, putting significant pressure on the API server and etcd.

#!/bin/bash

# Get the API server URL
APISERVER=$(oc config view --minify -o jsonpath='{.clusters[0].cluster.server}')

# Get the token for the dagster ServiceAccount
TOKEN=$(oc create token dagster-sa -n dagster-ns)

# Create a temporary directory for kubeconfig files
mkdir -p /tmp/throttle-test-configs

echo "Starting API request storm..."

# Disable job control notifications to prevent "Done" messages.
set +m

# Launch 500 concurrent requests in the background
for i in $(seq 1 500); do
  # Run the entire process, including kubeconfig setup, in the background
  # to ensure true concurrency from the start.
  (
    # Create a unique kubeconfig for each request to avoid conflicts
    KUBECONFIG_PATH="/tmp/throttle-test-configs/config-${i}"
    oc config set-cluster temp-cluster --server="$APISERVER" --insecure-skip-tls-verify=true --kubeconfig="$KUBECONFIG_PATH" > /dev/null
    oc config set-credentials test-user --token="$TOKEN" --kubeconfig="$KUBECONFIG_PATH" > /dev/null
    oc config set-context temp-context --cluster=temp-cluster --user=test-user --namespace=dagster-ns --kubeconfig="$KUBECONFIG_PATH" > /dev/null
    oc config use-context temp-context --kubeconfig="$KUBECONFIG_PATH" > /dev/null

    echo "Request $i started..."
    # To avoid the "Argument list too long" error, we create a temporary file for the
    # large payload and use --from-file. This avoids passing the data as a command-line argument.
    TMP_PAYLOAD_FILE=$(mktemp "/tmp/payload-${i}.XXXXXX")
    head -c 512000 /dev/urandom > "$TMP_PAYLOAD_FILE"
    
    # Record start time
    START_TIME=$(date +%s.%N)
    
    # Execute the create and delete with a short timeout to make throttling failures more visible.
    # If a request is queued for longer than 5 seconds, the client will time out and report an error.
    oc --request-timeout=1s --kubeconfig="$KUBECONFIG_PATH" create configmap "heavy-cm-${i}" --from-file="payload=${TMP_PAYLOAD_FILE}" > /dev/null && oc --request-timeout=1s --kubeconfig="$KUBECONFIG_PATH" delete configmap "heavy-cm-${i}" > /dev/null
    OC_EXIT_CODE=$?

    # Record end time and calculate duration
    END_TIME=$(date +%s.%N)
    DURATION=$(echo "$END_TIME - $START_TIME" | bc)

    # Clean up the temporary file
    rm -f "$TMP_PAYLOAD_FILE"

    if [ $OC_EXIT_CODE -ne 0 ]; then
      echo "Request $i failed, likely due to throttling (HTTP 429). Duration: ${DURATION}s"
    else
      echo "Request $i succeeded. Duration: ${DURATION}s"
    fi
  ) &
done

# Wait for all background jobs to finish
wait

# Re-enable job control notifications
set -m
echo "All requests completed."

# Clean up
rm -rf /tmp/throttle-test-configs

4. Observe the Results

When you run the script, you may not immediately see failed requests. This is because API Priority and Fairness is designed to queue requests first before rejecting them. If all your requests succeeded, it likely means they were queued and processed successfully without filling the queue to its limit.

A Note on Client-Side Retries

It’s important to understand that oc and kubectl clients have built-in retry logic. When they receive a throttling response from the server (HTTP 429 “Too Many Requests”), they don’t fail immediately. Instead, they will wait and retry the request several times. This is why your script might report all requests as successful, even when throttling is actively happening. The requests are simply waiting longer on the client side before eventually succeeding.

While measuring the request duration reveals this delay, if you want to see explicit failures, you can add a request timeout. This forces the client to give up if its request isn’t served within a specific timeframe.

Here are the most reliable ways to verify that your throttling rule is being triggered:

Method 1: Observing API Server Metrics (Recommended)

The most definitive way to check is by querying the API server’s metrics while your load script is running. These metrics provide direct insight into the Priority and Fairness controller.

Run this command in a separate terminal while the test script is active:

# Watch the metrics related to your custom priority level in real-time
watch "oc get --raw /metrics | grep 'apiserver_flowcontrol_.*dagster-limit'"

watch "oc get --raw /metrics | grep 'apiserver_flowcontrol_rejected_requests_total.*dagster-limit'"

Look for these key metrics:

apiserver_flowcontrol_current_inqueue_requests{priority_level="dagster-limit"}: If this value is greater than 0, it confirms that requests are being queued by your rule.
apiserver_flowcontrol_rejected_requests_total{priority_level="dagster-limit"}: If this counter increases, it’s definitive proof that requests are being rejected due to throttling.
apiserver_flowcontrol_request_wait_duration_seconds_sum{priority_level="dagster-limit"}: An increasing value here indicates the total time requests are spending in the queue, confirming that your rule is introducing latency as intended.

If you see any activity in these metrics for the dagster-limit priority level, your FlowSchema is correctly matching the traffic.

Method 2: Measuring Request Duration

Another way to see the effect of queuing is to measure how long each request takes. A throttled request will take longer to complete because it spends time in a queue.

We can modify the test script to record and display the duration of each API call. The updated script below includes this logic. When you run it, you will likely see the duration for each request increase as the system comes under load and queuing begins.

Method 3: Checking API Server Logs

You can also check the API server logs for messages related to throttling.

# Check logs on all kube-apiserver pods
oc logs -n openshift-kube-apiserver -l app=openshift-kube-apiserver -c kube-apiserver --tail=-1 | grep -i "throttling"

Look for messages indicating that requests are being throttled for the dagster-limit priority level.

By using these methods, you can be confident that your custom throttling rule is successfully isolating and limiting the impact of the targeted workload, protecting the overall stability of the cluster, even if you don’t see explicit “HTTP 429” errors.