Final Integrated Benchmark Report: Qwen3.6-27B-FP8 on Upstream vLLM and Red Hat RHAI vLLM
| Field | Value |
|---|---|
| Test date | 2026-06-04 |
| Model | Qwen/Qwen3.6-27B-FP8 |
| Target host | RHEL AI / rpm-ostree-like RHEL host reached over SSH through
proxy 127.0.0.1:5085 |
| GPU | 4 x NVIDIA L4, 23034 MiB per GPU, driver 550.163.01, CUDA 12.4 |
| Storage used | 4 x approximately 800G NVMe XFS devices; benchmark
cache/results under /mnt/bench-nvme* |
| Runtime rule | Podman only; Docker was not used |
| Serving API | OpenAI-compatible API on port 8000 during each serving run |
Executive Summary
All three serving stacks ran
Qwen/Qwen3.6-27B-FP8 successfully under the
8K-input, 512-output, concurrency-4 GuideLLM long-reasoning
workload. Upstream vLLM 0.21, Red Hat RHAI vLLM 3.4, and
upstream latest, which resolved to vLLM 0.22.0 during this test,
each completed 12/12 GuideLLM requests without errors.
The recommended Red Hat RHAI vLLM 3.4 customer starting point on this 4 x L4 host remains:
tensor_parallel_size=4
gpu_memory_utilization=0.84
max_model_len=32768
max_num_seqs=64
max_num_batched_tokens=32768
This configuration produced the best observed RHAI throughput
in the main GuideLLM run: 55.3 output tokens/s and 927.9 total
tokens/s. The gpu_memory_utilization=0.88,
max_num_seqs=64,
max_num_batched_tokens=32768 variant also passed
the main long run and reduced RHAI p95 latency from 76.7s to
70.8s, but with lower output throughput.
The destructive runtime risk is now also measured. RHAI vLLM 3.4 can start successfully and pass health checks, then degrade or fail later under long-context, agent-like concurrency when the scheduling envelope is pushed too far:
gpu_memory_utilization=0.88,max_num_seqs=128,max_num_batched_tokens=65536stayed healthy through c4, then at 12K c8 completed 8/16 target requests, produced 8 request errors, health becamenot_ready, and server logs capturedEngineDeadErrorplus HTTP 500 responses.gpu_memory_utilization=0.84,max_num_seqs=256,max_num_batched_tokens=32768stayed healthy through c4, then at 12K c8 completed 8/16 target requests, health becamenot_ready, p95 end-to-end latency reached 315.0s, and p95 TPOT reached 724.2ms.- The 16K-input pilot for
0.88 / 128 / 65536did not crash, but it became extremely slow at c8: p95 latency 200.6s, p95 TTFT 82018ms, and p95 TPOT 209.0ms while health still returned ready.
Target Host and Runtime
The test target was an AWS EC2-style GPU host. It was a RHEL
AI / rpm-ostree-like system, so the test avoided mutable-root
assumptions: no changes were made under /usr, and
persistent high-write data stayed under /var-backed
paths and the NVMe mounts.
| Component | Observed value |
|---|---|
| GPU | 4 x NVIDIA L4 |
| GPU memory | 23034 MiB per GPU |
| NVIDIA driver | 550.163.01 |
| CUDA reported by driver | 12.4 |
| CPU | 48 vCPU AMD EPYC 7R13 |
| System memory | Approximately 181 GiB |
| Container runtime | Podman 4.9.4-rhel |
| Benchmark storage | 4 x approximately 800G NVMe XFS |
| Mount | Purpose |
|---|---|
/mnt/bench-nvme1 |
Hugging Face cache and workspace root |
/mnt/bench-nvme2 |
Temporary directories, XDG cache, vLLM compile/cache directories |
/mnt/bench-nvme3 |
Benchmark logs, JSON/CSV/HTML outputs, result archives |
/mnt/bench-nvme4 |
Scratch/profile-ready space |
Serving Stacks
| Stack | Image | Observed version |
|---|---|---|
| Upstream vLLM 0.21 | docker.io/vllm/vllm-openai:v0.21.0 |
vLLM 0.21.0 |
| Red Hat RHAI vLLM 3.4 | registry.redhat.io/rhaii/vllm-cuda-rhel9:3.4.0 |
vLLM 0.18.0+rhaiv.7 |
| Upstream latest | docker.io/vllm/vllm-openai:latest |
vLLM 0.22.0 during this test |
All serving runs exposed the model through the OpenAI-compatible API and used tensor parallelism across four NVIDIA L4 GPUs.
Main Long-Reasoning Benchmark
Workload: approximately 8K input tokens, 512 output tokens, concurrency 4, 12 GuideLLM requests, 359-second measurement window.
| Stack | Scenario | OK/Err | Output tok/s | Total tok/s | Latency p50/p95 | TTFT p50/p95 | TPOT p50/p95 |
|---|---|---|---|---|---|---|---|
| upstream-v0.21 | long-8k-512-c4 | 12/0 | 52.5 | 881.4 | 40.9s / 56.0s | 4746ms / 18127ms | 76.3ms / 92.1ms |
| redhat-rhai-3.4-gmu84-seq64 | long-8k-512-c4 | 12/0 | 55.3 | 927.9 | 45.4s / 76.7s | 4890ms / 37097ms | 88.8ms / 126.1ms |
| redhat-rhai-3.4-gmu88-seq64 | long-8k-512-c4 | 12/0 | 51.4 | 863.0 | 41.9s / 70.8s | 4907ms / 31196ms | 88.9ms / 116.4ms |
| upstream-latest-0.22 | long-8k-512-c4 | 12/0 | 53.3 | 894.8 | 40.2s / 56.9s | 5086ms / 18973ms | 77.3ms / 99.3ms |
vLLM Built-In Benchmark Reference
The vLLM built-in benchmark was run for upstream vLLM 0.21 as an additional reference point, separate from GuideLLM.
| Stack | Scenario | OK/Err | Output tok/s | Total tok/s | Latency p50/p95 | TTFT p50/p95 | TPOT p50/p95 |
|---|---|---|---|---|---|---|---|
| upstream-v0.21 | agent-concurrency-8k-512-c4 | 24/0 | 49.3 | 870.1 | 38.9s / 47.9s | 4895ms / 12230ms | 72.1ms / 76.4ms |
| upstream-v0.21 | long-single-4k-512 | 8/0 | 23.6 | 194.9 | 22.5s / 23.9s | 2118ms / 2464ms | 37.5ms / 37.9ms |
Red Hat RHAI vLLM 3.4 Parameter Search and Recommended Range
The first RHAI 3.4 default-style attempt with
max_num_seqs=256 failed during sampler warmup with
CUDA OOM. Lowering max_num_seqs was required on
this 4 x L4 host before full long-run validation.
Startup and short chat probes passed for these RHAI 3.4 candidates:
| Case | GPU memory utilization | max_num_seqs | max_num_batched_tokens | Ready | Exit code | OOMKilled | Chat HTTP |
|---|---|---|---|---|---|---|---|
| gmu84-seq64-bt32768 | 0.84 | 64 | 32768 | 1 | 0 | false | 200 |
| gmu86-seq64-bt32768 | 0.86 | 64 | 32768 | 1 | 0 | false | 200 |
| gmu88-seq64-bt32768 | 0.88 | 64 | 32768 | 1 | 0 | false | 200 |
| gmu84-seq128-bt32768 | 0.84 | 128 | 32768 | 1 | 0 | false | 200 |
Two RHAI parameter sets received full GuideLLM long-run validation:
| Configuration | Full long run status | Interpretation |
|---|---|---|
gpu_memory_utilization=0.84,
max_num_seqs=64,
max_num_batched_tokens=32768 |
Passed, 12/12 successful, 0 errors | Best observed RHAI throughput in the main long benchmark. Recommended throughput-first setting. |
gpu_memory_utilization=0.88,
max_num_seqs=64,
max_num_batched_tokens=32768 |
Passed, 12/12 successful, 0 errors | Lower p95 latency than the 0.84 run, but lower output throughput. Useful as a latency-balanced alternative. |
The 0.84 / 128 / 32768 case passed startup and
short chat probing, but it should not replace the recommended
setting without full long-run validation at the customer’s
target concurrency.
Runtime Degradation Matrix
This matrix tested Red Hat RHAI vLLM 3.4 startup parameters against a fixed long-agent workload shape: approximately 12K input tokens, 640 output tokens, concurrency 1/2/4/8. It is the key evidence for the destructive pattern that appears after successful startup.
| Startup parameters | c | OK/Err/Target | Success | Health | GPU MiB | Out tok/s | Latency p95 | TTFT p95 | TPOT p95 |
|---|---|---|---|---|---|---|---|---|---|
| gmu .84 / seq 64 / bt 32K | 1 | 4/0/4 | 100% | ready | 76340 | 21.7 | 50.6s | 18687ms | 60.8ms |
| gmu .84 / seq 64 / bt 32K | 2 | 8/0/8 | 100% | ready | 81188 | 33.5 | 57.6s | 14833ms | 80.6ms |
| gmu .84 / seq 64 / bt 32K | 4 | 12/0/12 | 100% | ready | 85540 | 46.8 | 88.3s | 28406ms | 109.3ms |
| gmu .84 / seq 64 / bt 32K | 8 | 16/0/16 | 100% | ready | 85540 | 55.8 | 152.6s | 60166ms | 191.0ms |
| gmu .84 / seq 128 / bt 32K | 1 | 4/0/4 | 100% | ready | 78380 | 21.7 | 50.5s | 18607ms | 60.7ms |
| gmu .84 / seq 128 / bt 32K | 2 | 8/0/8 | 100% | ready | 78380 | 30.9 | 57.4s | 14558ms | 73.1ms |
| gmu .84 / seq 128 / bt 32K | 4 | 12/0/12 | 100% | ready | 83836 | 44.3 | 88.3s | 28268ms | 109.5ms |
| gmu .84 / seq 128 / bt 32K | 8 | 16/0/16 | 100% | ready | 88188 | 55.6 | 152.9s | 59303ms | 191.3ms |
| gmu .88 / seq 64 / bt 32K | 1 | 4/0/4 | 100% | ready | 79924 | 21.6 | 51.2s | 19010ms | 61.6ms |
| gmu .88 / seq 64 / bt 32K | 2 | 8/0/8 | 100% | ready | 84772 | 33.6 | 57.5s | 14792ms | 80.6ms |
| gmu .88 / seq 64 / bt 32K | 4 | 11/0/12 | 92% | ready | 84772 | 45.4 | 86.6s | 29329ms | 104.1ms |
| gmu .88 / seq 64 / bt 32K | 8 | 16/0/16 | 100% | ready | 89124 | 57.6 | 152.1s | 58641ms | 200.9ms |
| gmu .88 / seq 128 / bt 64K | 1 | 4/0/4 | 100% | ready | 70388 | 21.7 | 50.7s | 18616ms | 60.9ms |
| gmu .88 / seq 128 / bt 64K | 2 | 8/0/8 | 100% | ready | 75004 | 33.5 | 57.6s | 14867ms | 80.7ms |
| gmu .88 / seq 128 / bt 64K | 4 | 12/0/12 | 100% | ready | 76628 | 45.2 | 87.6s | 29084ms | 109.5ms |
| gmu .88 / seq 128 / bt 64K | 8 | 8/8/16 | 50% | not_ready | 21590 | 1083.1 | 11.7s | 7563ms | 26.9ms |
| gmu .84 / seq 256 / bt 32K | 1 | 4/0/4 | 100% | ready | 81760 | 21.6 | 51.0s | 18695ms | 61.3ms |
| gmu .84 / seq 256 / bt 32K | 2 | 8/0/8 | 100% | ready | 86766 | 33.5 | 57.4s | 14792ms | 80.6ms |
| gmu .84 / seq 256 / bt 32K | 4 | 12/0/12 | 100% | ready | 87914 | 47.3 | 86.8s | 29516ms | 117.4ms |
| gmu .84 / seq 256 / bt 32K | 8 | 8/0/16 | 50% | not_ready | 22170 | 9.8 | 315.0s | 22468ms | 724.2ms |
The throughput number for failed c8 rows should be read
together with success rate and health. For example,
0.88 / 128 / 65536 reports high completed-request
throughput at c8 because half the workload failed quickly; the
governing signals are 50% success rate, not_ready
health, and EngineDeadError.
Longer 16K Degradation Pilot
The 16K-input pilot used
gpu_memory_utilization=0.88,
max_num_seqs=128,
max_num_batched_tokens=65536 with approximately 16K
input tokens and 768 output tokens. It demonstrates the
slow-response version of the same risk: at c8 the service still
answered health checks, but p95 latency and TTFT became
customer-visible failures.
| Startup parameters | c | OK/Err/Target | Health | GPU MiB | Out tok/s | Latency p95 | TTFT p95 | TPOT p95 |
|---|---|---|---|---|---|---|---|---|
| 16K pilot: gmu .88 / seq 128 / bt 64K | 1 | 4/0/4 | ready | 74620 | 21.4 | 58.0s | 21407ms | 60.4ms |
| 16K pilot: gmu .88 / seq 128 / bt 64K | 2 | 8/0/8 | ready | 74620 | 29.9 | 69.6s | 19469ms | 75.1ms |
| 16K pilot: gmu .88 / seq 128 / bt 64K | 4 | 12/0/12 | ready | 89492 | 44.3 | 114.3s | 40865ms | 119.1ms |
| 16K pilot: gmu .88 / seq 128 / bt 64K | 8 | 16/0/16 | ready | 89634 | 51.4 | 200.6s | 82018ms | 209.0ms |
Destructive Boundary Summary
| Stack | Destructive parameters | Failure mode |
|---|---|---|
| Upstream vLLM 0.21 | gpu_memory_utilization=0.999,
max_model_len=65536, max_num_seqs=256,
max_num_batched_tokens=131072 |
Startup admission failure. Container exited with code 1,
OOMKilled=false; free GPU memory was below the
target requested by utilization 0.999. |
| Upstream latest / vLLM 0.22.0 | gpu_memory_utilization=0.999,
max_model_len=65536, max_num_seqs=256,
max_num_batched_tokens=131072 |
Same startup admission failure as upstream v0.21. |
| Red Hat RHAI vLLM 3.4 | gpu_memory_utilization=0.88,
max_num_seqs=128,
max_num_batched_tokens=65536 |
Successful startup, then runtime degradation/failure at 12K
c8: 8/16 target requests completed, 8 errors, health
not_ready, EngineDeadError, HTTP 500
responses. |
| Red Hat RHAI vLLM 3.4 | gpu_memory_utilization=0.84,
max_num_seqs=256,
max_num_batched_tokens=32768 |
Successful startup, then runtime degradation at 12K c8: 8/16
target requests completed, health not_ready, p95
latency 315.0s, p95 TPOT 724.2ms. |
Recommendation
Use this RHAI vLLM 3.4 configuration as the customer-facing starting point on this 4 x NVIDIA L4 host:
tensor_parallel_size=4
gpu_memory_utilization=0.84
max_model_len=32768
max_num_seqs=64
max_num_batched_tokens=32768
For latency-sensitive tests, evaluate this variant under the customer’s actual workload:
tensor_parallel_size=4
gpu_memory_utilization=0.88
max_model_len=32768
max_num_seqs=64
max_num_batched_tokens=32768
Avoid treating max_num_seqs,
max_num_batched_tokens, and
gpu_memory_utilization as independent “more is
better” knobs on 4 x L4. This test shows three distinct risk
zones:
- Too much startup memory pressure can fail admission before serving starts.
- Larger scheduling envelopes can pass startup but degrade at high long-context concurrency.
- A service can remain health-ready while p95 latency and TTFT are already unacceptable for interactive agent use.
Evidence Package
| Artifact | Path |
|---|---|
| Integrated final report | solution/solution-2026.06.04.15.54.md |
| Main benchmark archive | solution/files/solution-2026-06-04-12-44/round2-benchmark-results-20260604T1244Z.tar.gz |
| Main benchmark manifest | solution/files/solution-2026-06-04-12-44/round2-benchmark-results-20260604T1244Z.manifest.txt |
| Main benchmark CSV | solution/files/solution-2026-06-04-12-44/round2-summary.csv |
| Runtime degradation archive | solution/files/solution-2026-06-04-13-40/round4-degradation-results-20260604T072632Z.tar.gz |
| Runtime degradation manifest | solution/files/solution-2026-06-04-13-40/round4-degradation-results-20260604T072632Z.manifest.txt |
| Runtime degradation CSV | solution/files/solution-2026-06-04-13-40/round4-degradation-summary.csv |
| Command audit logs | steps/steps-2026.06.04.11.04.md,
steps/steps-2026.06.04.12.44.md,
steps/steps-2026.06.04.13.36.md,
steps/steps-2026.06.04.13.40.md,
steps/steps-2026.06.04.15.54.md |
Final cleanup was verified after the benchmark and degradation work: no vLLM containers remained, no GPU processes were running, and port 8000 had no listener.