Update DSV4 GB200 MTP 8k/1k vLLM 1p1d-dep8-dep8 recipe by hjjq · Pull Request #1390 · SemiAnalysisAI/InferenceX

hjjq · 2026-05-15T18:13:24Z

Sweep higher concurrency and use vLLM v0.21.0.

claude · 2026-05-15T18:21:01Z

      # MegaMOE mid curve: 1 prefill (DEP=8) + 1 decode (DEP=8).
      # 5 nodes total with a dedicated NATS/etcd infra node.
-      - conc-list: [128]
+      - conc-list: [128, 256, 512, 1024]


🔴 The recipe's model.container/identity.container.image is bumped to vllm/vllm-openai:v0.21.0-ubuntu2404, but the image: field in .github/configs/nvidia-master.yaml for dsv4-fp4-gb200-dynamo-vllm-mtp2 (line 8102) is still v0.20.1-ubuntu2404 — per AGENTS.md line 111 these must match, since the launcher registers ${IMAGE} as the container-alias key in srtslurm.yaml and srtctl resolves model.container against it. Note that the three sibling recipes under this same config-key (agg-gb200-low-latency-mtp2, disagg-gb200-low-latency-mtp2, disagg-gb200-high-tpt-megamoe-mtp2) are all still on v0.20.1, so the single top-level image: can'''t satisfy all four — either bump every sibling recipe together with the master image:, or split this recipe into its own config-key entry with its own image:.

Extended reasoning...

What'''s broken

This PR bumps the mid-curve recipe at benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-mid-curve-megamoe-mtp2.yaml from vLLM v0.20.1 to v0.21.0 in two places (model.container and identity.container.image), but leaves the top-level image: field in .github/configs/nvidia-master.yaml (line 8102, under dsv4-fp4-gb200-dynamo-vllm-mtp2) unchanged at vllm/vllm-openai:v0.20.1-ubuntu2404.

Why this is a problem

AGENTS.md line 111 spells out the rule directly:

For image bumps, model.container must equal image:, since the launcher uses the latter as the container-alias key.

The mechanism is concrete. In runners/launch_gb200-nv.sh:

Line 71: SQUASH_FILE=".../$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh" — the on-disk squash file path is derived from $IMAGE (sourced from nvidia-master.yaml'''s top-level image:).

Line 74: enroot import -o $SQUASH_FILE docker://$IMAGE — the actual container pulled and squashed is the master'''s image:.

Lines 223–226: the generated srtslurm.yaml contains:

containers: dynamo-trtllm: ${SQUASH_FILE} dynamo-sglang: ${SQUASH_FILE} "${IMAGE}": ${SQUASH_FILE}

So the alias key in the containers: map is literally ${IMAGE} — i.e., the master image: string. srtctl then resolves the recipe'''s model.container against this map.

Step-by-step proof this breaks

The sweep workflow reads nvidia-master.yaml. For config-key dsv4-fp4-gb200-dynamo-vllm-mtp2 it picks up image: vllm/vllm-openai:v0.20.1-ubuntu2404.

The launcher (launch_gb200-nv.sh) sets IMAGE=vllm/vllm-openai:v0.20.1-ubuntu2404, imports it into a squash file, and emits srtslurm.yaml with containers: { "vllm/vllm-openai:v0.20.1-ubuntu2404": <squash> }.

srtctl opens the mid-curve recipe yaml after this PR and reads model.container: vllm/vllm-openai:v0.21.0-ubuntu2404.

It looks up that key in containers: — there is no v0.21.0-ubuntu2404 entry — alias resolution fails (or, worse, silently falls back to the v0.20.1 squash, producing benchmark results that report as v0.21.0 but actually ran v0.20.1).

Why bumping master image: alone won'''t fix it

The same config-key dsv4-fp4-gb200-dynamo-vllm-mtp2 has four sibling recipes wired in via additional-settings CONFIG_FILE=... entries (lines 8124, 8140, 8157, 8174):

agg-gb200-low-latency-mtp2.yaml — line 5: container: "vllm/vllm-openai:v0.20.1-ubuntu2404"

disagg-gb200-low-latency-mtp2.yaml — line 5: container: "vllm/vllm-openai:v0.20.1-ubuntu2404"

disagg-gb200-high-tpt-megamoe-mtp2.yaml — line 5: container: "vllm/vllm-openai:v0.20.1-ubuntu2404"

disagg-gb200-mid-curve-megamoe-mtp2.yaml — line 5: container: "vllm/vllm-openai:v0.21.0-ubuntu2404" (this PR)

There is exactly one top-level image: field for the config-key; it cannot match both v0.20.1 and v0.21.0 simultaneously. Whichever value is chosen, three of the four recipes will mismatch.

How to fix

Two acceptable resolutions per the AGENTS.md rule:

Bump all four sibling recipes and the master image: together to v0.21.0-ubuntu2404 so the alias key matches every recipe under the config-key. Note that the comment on line 8099 (# ... Uses the vLLM 0.20.1 image ...) should be updated to match.

Split disagg-gb200-mid-curve-megamoe-mtp2 into its own config-key entry in nvidia-master.yaml with its own image: vllm/vllm-openai:v0.21.0-ubuntu2404, leaving the other three siblings under the existing v0.20.1 config-key.

Either way, this is benchmark-blocking as it stands — the mid-curve sweep that this PR is supposed to enable cannot resolve its container alias against the registered srtslurm.yaml containers map.

claude · 2026-05-15T18:21:01Z

-  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1271
+    - "Upgrade GB200 DSV4 MTP2 1P/1D DEP8 MegaMOE mid-curve recipe to vLLM v0.21.0"
+    - "Extend mid-curve concurrency sweep to also benchmark conc=256, conc=512, and conc=1024"
+  pr-link: TBD


🟡 The new perf-changelog entry at line 2495 has pr-link: TBD, but the repo convention (per AGENTS.md and utils/merge_with_reuse.sh) is to use https://github.com/SemiAnalysisAI/InferenceX/pull/XXX as the placeholder, which the merge tooling auto-replaces with the real PR number. Replace TBD with either https://github.com/SemiAnalysisAI/InferenceX/pull/1390 or the /pull/XXX placeholder before merge.

Extended reasoning...

What is the bug

perf-changelog.yaml line 2495 sets pr-link: TBD for the new dsv4-fp4-gb200-dynamo-vllm-mtp2 entry. Every other entry in the file uses a fully-qualified URL of the form https://github.com/SemiAnalysisAI/InferenceX/pull/<n>, and the documented placeholder convention (see AGENTS.md and benchmarks/multi_node/srt-slurm-recipes/RECIPES.md) is https://github.com/SemiAnalysisAI/InferenceX/pull/XXX.

Why existing tooling does not catch it

The Pydantic ChangelogEntry model in utils/matrix_logic/validation.py types pr_link as a plain str — it does not assert URL shape — so TBD passes validation. The merge automation in utils/merge_with_reuse.sh is the piece that normally substitutes placeholders: it scans for the literal substring XXX in pr-link and rewrites it to the merged PR's number, then asserts the final entry's pr-link ends with /pull/<PR-number>. TBD contains no XXX, so the substitution is skipped, and the trailing-PR-number assertion would then fail on merge-with-reuse paths.

Step-by-step proof

PR is merged as PR Update DSV4 GB200 MTP 8k/1k vLLM 1p1d-dep8-dep8 recipe #1390 with pr-link: TBD on line 2495.

utils/merge_with_reuse.sh walks changelog entries and applies its substitution rule if "XXX" not in link and not link.endswith(f"/pull/{pr}") — TBD matches neither branch, so no rewrite happens.

The post-condition assertion assert last['pr-link'].endswith('/pull/1390') fails because 'TBD' does not end with /pull/1390.

Even on the non-merge-with-reuse path (direct merge), the changelog is left with a dangling TBD string that breaks any downstream consumer that parses pr-link as a URL (e.g. anything that constructs a clickable link from the field).

Addressing the refutation

A reviewer argued this is a duplicate of an already-refuted earlier bug and that TBD is just a pre-merge placeholder the author will swap before merging. Two points respond to that: (1) the repo already has an established placeholder convention — /pull/XXX — and the merge tooling depends on it; using TBD defeats that automation and is inconsistent with every other entry in the file, so it is a real (if small) deviation, not a stylistic preference. (2) Pydantic validation accepts TBD but the merge-with-reuse assertion does not, so the "author will fix it before merge" assumption is exactly the failure mode worth flagging in review — that is the moment to flip TBD to /pull/1390 (or /pull/XXX).

How to fix

Change line 2495 to pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1390 (or .../pull/XXX to let the merge tooling fill in the number). Severity is nit — a one-line find/replace, but worth doing before merge to keep the changelog consistent and the merge-with-reuse path green.

github-actions · 2026-05-15T23:09:12Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25945612428
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25945612428

github-actions · 2026-05-16T01:34:37Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25948327237
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25948327237

hjjq requested a review from a team May 15, 2026 18:13

hjjq requested a review from kedarpotdar-nv as a code owner May 15, 2026 18:13

github-project-automation Bot added this to InferenceMAX Board May 15, 2026

hjjq requested a review from jgangani as a code owner May 15, 2026 18:13

claude Bot reviewed May 15, 2026

View reviewed changes

Ankur-singh added the full-sweep-enabled label May 15, 2026

hjjq added 2 commits May 15, 2026 17:37

update gb200 mtp

9639a79

update pr-link

6ed44ff

Ankur-singh force-pushed the hjjq/gb200-mtp branch from 48c5660 to 6ed44ff Compare May 16, 2026 00:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update DSV4 GB200 MTP 8k/1k vLLM 1p1d-dep8-dep8 recipe#1390

Update DSV4 GB200 MTP 8k/1k vLLM 1p1d-dep8-dep8 recipe#1390
hjjq wants to merge 2 commits into
mainfrom
hjjq/gb200-mtp

hjjq commented May 15, 2026

Uh oh!

claude Bot May 15, 2026

Uh oh!

claude Bot May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hjjq commented May 15, 2026

Uh oh!

claude Bot May 15, 2026

Choose a reason for hiding this comment

What'''s broken

Why this is a problem

Step-by-step proof this breaks

Why bumping master image: alone won'''t fix it

How to fix

Uh oh!

claude Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Why bumping master `image:` alone won'''t fix it