RHAIIS 3.4 and Qwen3.5 122B Long-Context Validation Report
| Field | Value |
|---|---|
| Date | 2026-06-04 |
| Audience | Customer discussion and internal Red Hat technical review |
| Platform | OpenShift cluster-ml6gl |
| Runtime | Red Hat AI Inference Server / vLLM 3.4 |
| Runtime image | registry.redhat.io/rhaii/vllm-cuda-rhel9:3.4.0-1777444689 |
| Model | RedHatAI/Qwen3.5-122B-A10B-FP8-dynamic |
| Main hardware | AWS g6e.12xlarge, 4 x NVIDIA L40S |
| Current driver validated | NVIDIA 580.126.20 |
| Report package | This directory contains this Markdown file and all
referenced PNG trend charts under images/ |
Executive Summary
The model and runtime can start and serve successfully on the tested 4 x L40S OpenShift node. This is not a simple “the pod cannot run” issue. The important failure mode is long-prefill queueing and service-quality collapse under large-context summarization-style workloads.
The native 262K baseline is usable below the exact sequence
boundary, but the exact 262144 random input is not
a safe user prompt size. After chat/template/tokenizer overhead,
that request became 271524 > 262144 and was
rejected. Near-limit probes showed that 245,760 to 253,000
benchmark input tokens can pass, so the practical guidance is to
reserve prompt budget for chat template, system prompt,
requested output, and tokenizer expansion.
The customer-style 1M RoPE configuration can process 800K and
900K benchmark inputs, but the service quality is not acceptable
for ordinary production summarization SLOs on this hardware. The
900K/c1 case had 584.29s TTFT and
918.35s end-to-end duration, with only
0.07 output tok/s.
The “service looks alive but inference barely moves” symptom was reproduced. In native 262K supplement tests, 64K/c16 and 128K/c16 produced minute-level P99 TTFT while the server logs showed queued requests and very low generation throughput. In the 1M RoPE tests, 262K/c4, 800K, and 900K cases showed the same operational pattern.
The current tested startup recommendation for this hardware
and workload is conservative: keep
--max-num-batched-tokens=4096. Increasing it to
8192 made the 1M RoPE 262K/c4 pain point worse.
Primary Trend Dashboard
This overview is the most useful chart for customer discussion. It shows that several bad cases still had 100% request success, while P99 first-token latency and output throughput were already unacceptable.
Test Environment
| Item | Result |
|---|---|
| OpenShift server | 4.18.21 / Kubernetes v1.31.10 |
| OpenShift client | 4.21.16 |
| GPU node | ip-10-0-71-58.us-east-2.compute.internal |
| Instance type | AWS g6e.12xlarge |
| GPU | 4 x NVIDIA L40S |
| GPU memory label | 46068 MiB per GPU |
| GPU capacity / allocatable | 4 / 4 |
| NVIDIA driver | 580.126.20 |
| CUDA runtime label | 13.0 |
| Namespace | qwen35-rhaiis |
| Service account | rhaiis-vllm |
| Model cache | HostPath
/var/lib/rhaiis-model-cache/qwen35-122b-fp8 |
| Security posture | Privileged SCC approved for this validation because of hostPath and GPU runtime requirements |
Runtime Parameters Tested
Common vLLM settings
| Parameter | Value |
|---|---|
--model |
RedHatAI/Qwen3.5-122B-A10B-FP8-dynamic |
--served-model-name |
RedHatAI/Qwen3.5-122B-A10B-FP8-dynamic |
--tensor-parallel-size |
4 |
--gpu-memory-utilization |
0.90 for long-context tests |
--kv-cache-dtype |
fp8 |
--enable-prefix-caching |
enabled |
--max-num-batched-tokens |
4096 default tested setting |
--reasoning-parser |
qwen3 |
| Host / port | 0.0.0.0:8000 |
Native 262K baseline
| Parameter | Value |
|---|---|
| RoPE override | none |
--max-model-len |
262144 |
| Reported GPU KV cache | 343,744 tokens |
| Reported max concurrency for full model length | 4.98x |
1M RoPE customer-style configuration
| Parameter | Value |
|---|---|
--max-model-len |
1010000 |
| Long-length guard | VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 |
| RoPE type | YaRN |
| RoPE factor | 4.0 |
| Original max position embeddings | 262144 |
rope_theta |
10000000 |
partial_rotary_factor |
0.25 |
mrope_interleaved |
true |
mrope_section |
[11, 11, 10] |
| Reported chunked prefill | automatically enabled with
max_num_batched_tokens=4096 |
| Reported GPU KV cache | 326,976 tokens |
| Reported max concurrency for 1,010,000 tokens | 1.28x |
Startup parameter probe
| Setting | GPU KV cache | Max concurrency for 1,010,000 tokens | Result |
|---|---|---|---|
max_num_batched_tokens=4096 |
326,976 tokens | 1.28x | Current tested recommendation |
max_num_batched_tokens=8192 |
310,208 tokens | 1.22x | Worse latency and token pace in the 262K/c4 pain point |
Benchmark Tools
| Tool | Purpose |
|---|---|
vllm bench serve |
Main serving benchmark for TTFT, TPOT, ITL, throughput, request success/failure, and long-context stress |
| GuideLLM 0.6.0 | Load-shape corroboration with concurrent and constant profiles |
The GuideLLM results are used as corroborating evidence, not
as a direct apples-to-apples replacement for
vllm bench serve, because its workload and latency
semantics differ.
Key Findings
1. Native 262K is not a safe full-prompt budget
The native 262K deployment started successfully. However,
setting the random benchmark input to exactly
262144 failed before generation, because the
effective tokenized request exceeded the configured limit:
| Native 262K exact-boundary case | Result |
|---|---|
| Input payload requested | 262144 random input tokens |
| Effective tokenized sequence | 271524 tokens |
| Configured max model length | 262144 |
| API result | Bad Request |
Customer-facing interpretation:
--max-model-len=262144 is the total sequence
budget, not a safe prompt payload size. Production workloads
must reserve margin for chat template, system prompt, requested
output, and tokenizer expansion.
2. Native 262K high concurrency reproduces the stuck-feeling symptom
The supplement matrix intentionally expanded the native 262K baseline to c16 and near-limit c4 cases.
| Scenario | Input tokens | Concurrency | Success / failure | Mean TTFT | P99 TTFT | Mean ITL | Output tok/s |
|---|---|---|---|---|---|---|---|
| native262k-supplement-ctx64k-c16 | 65,536 | 16 | 16 / 0 | 139.53s | 249.27s | 477.66ms | 1.88 |
| native262k-supplement-ctx128k-c16 | 131,072 | 16 | 16 / 0 | 279.37s | 517.27s | 476.41ms | 0.89 |
| native262k-supplement-ctx196k-c8 | 196,608 | 8 | 8 / 0 | 238.58s | 409.01s | 454.01ms | 0.53 |
| native262k-supplement-ctx240k-c4 | 245,760 | 4 | 4 / 0 | 155.87s | 235.55s | 408.69ms | 0.38 |
| native262k-supplement-ctx250k-c4 | 250,000 | 4 | 4 / 0 | 160.74s | 243.22s | 409.97ms | 0.37 |
Server logs during this run showed the operational signature we were looking for: requests were still running, but the waiting queue grew and generation throughput dropped close to zero in multiple samples.
Examples captured during the run:
| Observed server-side signal | Interpretation |
|---|---|
Running: 2 reqs, Waiting: 14 reqs, Avg generation throughput: 0.5 tokens/s |
The API server is alive, but requests are queueing and generation progress is very slow |
Running: 1 reqs, Waiting: 14 reqs, Avg generation throughput: 0.0 tokens/s |
Client-visible behavior can look stuck even without a pod crash |
Running: 2 reqs, Waiting: 10 reqs, Avg generation throughput: 0.4 tokens/s |
Long-prefill concurrency is the pressure point |
3. 1M RoPE can process 800K and 900K inputs, but service quality is poor
| Scenario | Input tokens | Concurrency | Success / failure | Mean TTFT | P99 TTFT | Output tok/s | Benchmark duration |
|---|---|---|---|---|---|---|---|
| one-mctx-rope-1010k-ctx262k-c1 | 262,144 | 1 | 1 / 0 | 96.89s | 96.89s | 1.71 | 149.96s |
| one-mctx-rope-1010k-ctx262k-c2 | 262,144 | 2 | 2 / 0 | 43.96s | 85.78s | 3.67 | 139.68s |
| one-mctx-rope-1010k-ctx262k-c4 | 262,144 | 4 | 4 / 0 | 114.14s | 218.18s | 1.88 | 271.88s |
| one-mctx-rope-1010k-ctx512k-c1 | 524,288 | 1 | 1 / 0 | 234.01s | 234.01s | 0.76 | 337.48s |
| one-mctx-rope-1010k-ctx512k-c2 | 524,288 | 2 | 2 / 0 | 118.29s | 231.58s | 0.76 | 335.44s |
| one-mctx-rope-1010k-ctx800k-c1 | 819,200 | 1 | 1 / 0 | 478.97s | 478.97s | 0.19 | 680.74s |
| one-mctx-rope-1010k-ctx800k-c2 | 819,200 | 2 | 2 / 0 | 242.53s | 474.73s | 0.19 | 680.15s |
| one-mctx-rope-1010k-ctx900k-c1 | 921,600 | 1 | 1 / 0 | 584.29s | 584.29s | 0.07 | 918.35s |
Customer-facing interpretation: “It can run a 900K input” is true, but it does not imply acceptable service behavior for large-document summarization. On this node, 800K and 900K inputs are stress or boundary cases, not healthy production settings.
4. Chunked prefill is required for the tested 1M RoPE setup
The 1M RoPE deployment did not explicitly pass
--enable-chunked-prefill, but vLLM enabled it
automatically. A controlled test with
--no-enable-chunked-prefill entered
CrashLoopBackOff.
The relevant validation error was:
max_num_batched_tokens (4096) is smaller than max_model_len (1010000)
Customer-facing interpretation: chunked prefill is not just
an optional tuning knob in this configuration. It is required
for the 1M context configuration when
max_model_len=1010000 and
max_num_batched_tokens=4096.
5.
Increasing max_num_batched_tokens to 8192 made the
pain point worse
The tested representative pain point was 1M RoPE with 262K input, c4 concurrency, and 128 output tokens.
| Parameter setting | Success / failure | Mean TTFT | P99 TTFT | Mean TPOT | P99 TPOT | Mean ITL | P99 ITL | Output tok/s |
|---|---|---|---|---|---|---|---|---|
max_num_batched_tokens=4096 |
4 / 0 | 114.14s | 218.18s | 514.73ms | 687.86ms | 521.92ms | 917.66ms | 1.88 |
max_num_batched_tokens=8192 |
4 / 0 | 205.59s | 311.97s | 1007.01ms | 1982.24ms | 1031.38ms | 2767.14ms | 1.40 |
Recommendation: keep
--max-num-batched-tokens=4096 as the current tested
starting point for this hardware, model, and customer-like
workload. This is not claimed to be a global optimum; it is the
safer result among the tested 4096 vs 8192 pair.
Recommended Guardrails
| Area | Recommendation |
|---|---|
| Native 262K prompt size | Do not allow client prompt payloads to consume the full
262144 budget. Reserve safety margin for template,
system prompt, output, and tokenizer expansion. |
| Long-document summarization | Cap concurrency by prompt length. Short-prompt concurrency results do not predict 64K, 128K, 250K, or 900K behavior. |
| 1M RoPE | Treat 262K/c4, 512K/c2, 800K, and 900K as stress or boundary conditions unless the SLO accepts minute-level TTFT. |
| Chunked prefill | Do not disable it for the tested 1M RoPE configuration. |
max_num_batched_tokens |
Start with 4096 on this node. The tested
8192 setting was worse for the representative pain
point. |
| Customer success criteria | Do not evaluate only request success rate. Tail TTFT, output throughput, and ITL are required to detect the bad user experience. |
Primary Multi-Metric Trend Dashboards
Native 262K Combined Dashboard
1M RoPE Combined Dashboard
Startup Parameter Comparison
GuideLLM Combined Dashboard
Internal Review Notes
| Topic | Note |
|---|---|
| Runtime compatibility | Earlier startup attempts failed with an NVIDIA driver too
old for the RHAIIS 3.4 CUDA/PyTorch stack. The successful
validation used driver 580.126.20. |
| HostPath security | The persistent model cache required a privileged route in
this validation because the node path was under
/var/lib and needed GPU/runtime access. |
| Scope | This report validates
RedHatAI/Qwen3.5-122B-A10B-FP8-dynamic on 4 x L40S.
Do not automatically generalize the numeric thresholds to larger
models or different GPU topology. |
| 397B customer mentions | If the customer discusses a different 397B/1M setup, use this report as evidence about long-context behavior and tuning principles, not as a direct capacity result for that larger model. |
| Current live cluster state after testing | The deployment was left on native 262K with
--max-model-len=262144 and
--max-num-batched-tokens=4096. |
Image Manifest
The following images are included in this standalone package:
| Image | Purpose |
|---|---|
images/01-customer-argument-overview-dashboard.png |
Cross-scenario customer-facing overview |
images/02-native262k-combined-dashboard.png |
Native 262K multi-metric dashboard |
images/03-one-mctx-rope-combined-dashboard.png |
1M RoPE multi-metric dashboard |
images/04-startup-4096-vs-8192-dashboard.png |
Startup parameter comparison |
images/05-guidellm-combined-dashboard.png |
GuideLLM load-shape dashboard |