Mitigating etcd Pressure with API Server Throttling
Background
In large-scale or high-activity OpenShift clusters, the etcd database can become a performance bottleneck. Excessive API requests, often generated by applications, controllers, or automation jobs, can overwhelm etcd, leading to increased latency, performance degradation, and in severe cases, cluster instability.
A straightforward solution is to scale up the master nodes by increasing their CPU and memory resources. However, this approach may not always be feasible due to budget constraints, hardware limitations, or other operational considerations.
A more nuanced and cost-effective strategy is to leverage the built-in API Priority and Fairness feature in Kubernetes. This allows administrators to identify and selectively throttle less critical workloads, thereby protecting the API server and etcd from being overloaded by request storms.
This document outlines the process of diagnosing and implementing custom throttling rules for a specific workload that is causing excessive API traffic.
Analyzing the Existing API Priority and Fairness Configuration
Before implementing any changes, it’s crucial to understand the current configuration. API Priority and Fairness is controlled by two main resource types:
PriorityLevelConfiguration: Defines distinct priority “lanes” or levels for API requests. Each level is allocated a share of the server’s concurrency budget.FlowSchema: Defines rules that classify incoming requests based on their properties (e.g., source user, verb, target resource) and assign them to a specificPriorityLevelConfiguration.
We can inspect the default configuration by running the following commands:
oc get PriorityLevelConfiguration
# NAME TYPE NOMINALCONCURRENCYSHARES QUEUES HANDSIZE QUEUELENGTHLIMIT AGE
# catch-all Limited 5 <none> <none> <none> 21d
# exempt Exempt <none> <none> <none> <none> 21d
# global-default Limited 20 128 6 50 21d
# leader-election Limited 10 16 4 50 21d
# node-high Limited 40 64 6 50 21d
# openshift-control-plane-operators Limited 10 128 6 50 21d
# system Limited 30 64 6 50 21d
# workload-high Limited 40 128 6 50 21d
# workload-low Limited 100 128 6 50 21d
oc get FlowSchema
# NAME PRIORITYLEVEL MATCHINGPRECEDENCE DISTINGUISHERMETHOD AGE MISSINGPL
# exempt exempt 1 <none> 21d False
# openshift-apiserver-sar exempt 2 ByUser 21d False
# openshift-oauth-apiserver-sar exempt 2 ByUser 21d False
# probes exempt 2 <none> 21d False
# system-leader-election leader-election 100 ByUser 21d False
# endpoint-controller workload-high 150 ByUser 21d False
# workload-leader-election leader-election 200 ByUser 21d False
# system-node-high node-high 400 ByUser 21d False
# openshift-ovn-kubernetes system 500 ByUser 21d False
# system-nodes system 500 ByUser 21d False
# kube-controller-manager workload-high 800 ByNamespace 21d False
# kube-scheduler workload-high 800 ByNamespace 21d False
# kube-system-service-accounts workload-high 900 ByNamespace 21d False
# openshift-apiserver workload-high 1000 ByUser 21d False
# openshift-controller-manager workload-high 1000 ByUser 21d False
# openshift-oauth-apiserver workload-high 1000 ByUser 21d False
# openshift-oauth-server workload-high 1000 ByUser 21d False
# openshift-apiserver-operator openshift-control-plane-operators 2000 ByUser 21d False
# openshift-authentication-operator openshift-control-plane-operators 2000 ByUser 21d False
# openshift-etcd-operator openshift-control-plane-operators 2000 ByUser 21d False
# openshift-kube-apiserver-operator openshift-control-plane-operators 2000 ByUser 21d False
# openshift-monitoring-metrics exempt 2000 ByUser 21d False
# service-accounts workload-low 9000 ByUser 21d False
# global-default global-default 9900 ByUser 21d False
# catch-all catch-all 10000 ByUser 21d False
# https://console-openshift-console.apps.demo-01-rhsys.wzhlab.top/k8s/cluster/flowcontrol.apiserver.k8s.io~v1~FlowSchema
# https://console-openshift-console.apps.demo-01-rhsys.wzhlab.top/k8s/cluster/flowcontrol.apiserver.k8s.io~v1~PriorityLevelConfigurationFrom the output, we can observe that the workload-low priority level has a NOMINALCONCURRENCYSHARES of 100, which is significantly higher than most other levels. If a high-volume, non-critical application falls into this category, it can consume a disproportionate amount of API server resources, potentially impacting etcd.
Our goal is to isolate the problematic workload (in this example, a series of dagster jobs) and assign it to a new, more restrictive priority level.
Implementing Custom Throttling for a Specific Workload
To achieve this, we will create a dedicated PriorityLevelConfiguration with a low concurrency share and a corresponding FlowSchema to direct the target workload’s traffic to this new level.
The following YAML manifest defines these resources:
apiVersion: flowcontrol.apiserver.k8s.io/v1
kind: PriorityLevelConfiguration
metadata:
# 1. Define a new PriorityLevelConfiguration named 'dagster-limit'.
# This acts as a dedicated, restricted lane for our target workload's API requests.
name: dagster-limit
spec:
type: Limited
limited:
# 2. Set a low nominal concurrency share.
# This value represents a share of the server's total concurrency limit.
# A lower value restricts the number of concurrent requests for this level.
# The server may default this value if not specified (e.g., to 30).
# We will explicitly set it to a restrictive value.
nominalConcurrencyShares: 1
# Lendable percent is the portion of unused concurrency that can be borrowed by other levels.
# Setting to 0 prevents this priority level from lending its unused quota.
lendablePercent: 0
limitResponse:
# queuing:
# # 'queues' specifies the number of queues used to process requests for this priority level.
# # Using multiple queues (e.g., 64) helps prevent head-of-line blocking, where a single slow
# # request flow cannot stall other independent flows. This value was derived from the
# # actual applied configuration.
# queues: 1
# # 'queueLengthLimit' is the maximum number of requests that can be queued in each queue.
# queueLengthLimit: 2
# # 'handSize' is the number of requests to process from a given queue at a time.
# # The value of 'handSize' must not be greater than 'queues'.
# handSize: 1
# type: Queue
type: Reject
---
apiVersion: flowcontrol.apiserver.k8s.io/v1
kind: FlowSchema
metadata:
name: dagster-jobs-flow
spec:
distinguisherMethod:
type: ByUser
# 3. Set a high matching precedence (a lower number means higher priority).
# A value of 8000 ensures this schema is evaluated before the more
# general 'service-accounts' schema (which has a precedence of 9000).
matchingPrecedence: 8000
# 4. Associate this FlowSchema with the restrictive PriorityLevelConfiguration created above.
priorityLevelConfiguration:
name: dagster-limit
# 5. Define the rules to identify the requests that should be throttled.
rules:
- resourceRules:
- apiGroups: ["*"]
resources: ["*"]
verbs: ["*"]
# Replace "dagster-ns" with the actual namespace of the Dagster jobs.
namespaces: ["dagster-ns"]
subjects:
# IMPORTANT: When a client authenticates with a ServiceAccount token,
# the request is associated with a User whose name is formatted as
# 'system:serviceaccount:<namespace>:<serviceaccount-name>'.
# Therefore, we must match on 'kind: User' and the full name.
- kind: User
user:
name: system:serviceaccount:dagster-ns:dagster-sa
# Replace "dagster-sa" with the ServiceAccount used by the jobs.
# - kind: ServiceAccount
# serviceAccount:
# name: dagster-sa
# namespace: dagster-nsSummary of the Solution
By applying this configuration, any API request originating from the dagster-sa ServiceAccount within the dagster-ns namespace will be matched by the dagster-jobs-flow FlowSchema. These requests will then be directed to the dagster-limit priority level, which enforces a strict concurrency limit.
This ensures that even if the Dagster jobs generate a storm of API requests, they will be effectively queued and throttled. This prevents the workload from overwhelming the API server and etcd, thereby safeguarding the overall health and stability of the OpenShift cluster.
Verifying the Throttling in Action
To demonstrate that the throttling is working as expected, we can set up the environment described in the example, apply the custom rules, and then simulate a high volume of API requests.
1. Prepare the Test Environment
First, create the namespace and ServiceAccount that our throttling rules are designed to target.
# Create the namespace
oc new-project dagster-ns
# Create the ServiceAccount
oc create sa dagster-sa -n dagster-ns
# Grant the ServiceAccount view permissions in its own namespace
oc policy add-role-to-user view -z dagster-sa -n dagster-ns
# Grant the ServiceAccount edit permissions in its own namespace to allow creating/deleting secrets
oc policy add-role-to-user edit -z dagster-sa -n dagster-ns2. Apply the Custom Throttling Rules
Now, apply the PriorityLevelConfiguration and FlowSchema defined earlier in this document. Save the YAML content into a file named dagster-throttling.yaml and apply it.
# The content of dagster-throttling.yaml is the same as the manifest provided in the section above.
oc apply -f dagster-throttling.yaml3. Simulate High API Traffic
With the rules in place, we can now simulate a request storm using the dagster-sa ServiceAccount. The following script will launch 500 concurrent requests to create and delete large ConfigMaps, putting significant pressure on the API server and etcd.
#!/bin/bash
# Get the API server URL
APISERVER=$(oc config view --minify -o jsonpath='{.clusters[0].cluster.server}')
# Get the token for the dagster ServiceAccount
TOKEN=$(oc create token dagster-sa -n dagster-ns)
# Create a temporary directory for kubeconfig files
mkdir -p /tmp/throttle-test-configs
echo "Starting API request storm..."
# Disable job control notifications to prevent "Done" messages.
set +m
# Launch 500 concurrent requests in the background
for i in $(seq 1 500); do
# Run the entire process, including kubeconfig setup, in the background
# to ensure true concurrency from the start.
(
# Create a unique kubeconfig for each request to avoid conflicts
KUBECONFIG_PATH="/tmp/throttle-test-configs/config-${i}"
oc config set-cluster temp-cluster --server="$APISERVER" --insecure-skip-tls-verify=true --kubeconfig="$KUBECONFIG_PATH" > /dev/null
oc config set-credentials test-user --token="$TOKEN" --kubeconfig="$KUBECONFIG_PATH" > /dev/null
oc config set-context temp-context --cluster=temp-cluster --user=test-user --namespace=dagster-ns --kubeconfig="$KUBECONFIG_PATH" > /dev/null
oc config use-context temp-context --kubeconfig="$KUBECONFIG_PATH" > /dev/null
echo "Request $i started..."
# To avoid the "Argument list too long" error, we create a temporary file for the
# large payload and use --from-file. This avoids passing the data as a command-line argument.
TMP_PAYLOAD_FILE=$(mktemp "/tmp/payload-${i}.XXXXXX")
head -c 512000 /dev/urandom > "$TMP_PAYLOAD_FILE"
# Record start time
START_TIME=$(date +%s.%N)
# Execute the create and delete with a short timeout to make throttling failures more visible.
# If a request is queued for longer than 5 seconds, the client will time out and report an error.
oc --request-timeout=1s --kubeconfig="$KUBECONFIG_PATH" create configmap "heavy-cm-${i}" --from-file="payload=${TMP_PAYLOAD_FILE}" > /dev/null && oc --request-timeout=1s --kubeconfig="$KUBECONFIG_PATH" delete configmap "heavy-cm-${i}" > /dev/null
OC_EXIT_CODE=$?
# Record end time and calculate duration
END_TIME=$(date +%s.%N)
DURATION=$(echo "$END_TIME - $START_TIME" | bc)
# Clean up the temporary file
rm -f "$TMP_PAYLOAD_FILE"
if [ $OC_EXIT_CODE -ne 0 ]; then
echo "Request $i failed, likely due to throttling (HTTP 429). Duration: ${DURATION}s"
else
echo "Request $i succeeded. Duration: ${DURATION}s"
fi
) &
done
# Wait for all background jobs to finish
wait
# Re-enable job control notifications
set -m
echo "All requests completed."
# Clean up
rm -rf /tmp/throttle-test-configs4. Observe the Results
When you run the script, you may not immediately see failed requests. This is because API Priority and Fairness is designed to queue requests first before rejecting them. If all your requests succeeded, it likely means they were queued and processed successfully without filling the queue to its limit.
A Note on Client-Side Retries
It’s important to understand that oc and kubectl clients have built-in retry logic. When they receive a throttling response from the server (HTTP 429 “Too Many Requests”), they don’t fail immediately. Instead, they will wait and retry the request several times. This is why your script might report all requests as successful, even when throttling is actively happening. The requests are simply waiting longer on the client side before eventually succeeding.
While measuring the request duration reveals this delay, if you want to see explicit failures, you can add a request timeout. This forces the client to give up if its request isn’t served within a specific timeframe.
Here are the most reliable ways to verify that your throttling rule is being triggered:
Method 1: Observing API Server Metrics (Recommended)
The most definitive way to check is by querying the API server’s metrics while your load script is running. These metrics provide direct insight into the Priority and Fairness controller.
Run this command in a separate terminal while the test script is active:
# Watch the metrics related to your custom priority level in real-time
watch "oc get --raw /metrics | grep 'apiserver_flowcontrol_.*dagster-limit'"
watch "oc get --raw /metrics | grep 'apiserver_flowcontrol_rejected_requests_total.*dagster-limit'"Look for these key metrics:
apiserver_flowcontrol_current_inqueue_requests{priority_level="dagster-limit"}: If this value is greater than 0, it confirms that requests are being queued by your rule.apiserver_flowcontrol_rejected_requests_total{priority_level="dagster-limit"}: If this counter increases, it’s definitive proof that requests are being rejected due to throttling.apiserver_flowcontrol_request_wait_duration_seconds_sum{priority_level="dagster-limit"}: An increasing value here indicates the total time requests are spending in the queue, confirming that your rule is introducing latency as intended.
If you see any activity in these metrics for the dagster-limit priority level, your FlowSchema is correctly matching the traffic.
Method 2: Measuring Request Duration
Another way to see the effect of queuing is to measure how long each request takes. A throttled request will take longer to complete because it spends time in a queue.
We can modify the test script to record and display the duration of each API call. The updated script below includes this logic. When you run it, you will likely see the duration for each request increase as the system comes under load and queuing begins.
Method 3: Checking API Server Logs
You can also check the API server logs for messages related to throttling.
# Check logs on all kube-apiserver pods
oc logs -n openshift-kube-apiserver -l app=openshift-kube-apiserver -c kube-apiserver --tail=-1 | grep -i "throttling"Look for messages indicating that requests are being throttled for the dagster-limit priority level.
By using these methods, you can be confident that your custom throttling rule is successfully isolating and limiting the impact of the targeted workload, protecting the overall stability of the cluster, even if you don’t see explicit “HTTP 429” errors.