Update DSV4 GB200 MTP 8k/1k vLLM 1p1d-dep8-dep8 recipe#1390
Conversation
| # MegaMOE mid curve: 1 prefill (DEP=8) + 1 decode (DEP=8). | ||
| # 5 nodes total with a dedicated NATS/etcd infra node. | ||
| - conc-list: [128] | ||
| - conc-list: [128, 256, 512, 1024] |
There was a problem hiding this comment.
🔴 The recipe's model.container/identity.container.image is bumped to vllm/vllm-openai:v0.21.0-ubuntu2404, but the image: field in .github/configs/nvidia-master.yaml for dsv4-fp4-gb200-dynamo-vllm-mtp2 (line 8102) is still v0.20.1-ubuntu2404 — per AGENTS.md line 111 these must match, since the launcher registers ${IMAGE} as the container-alias key in srtslurm.yaml and srtctl resolves model.container against it. Note that the three sibling recipes under this same config-key (agg-gb200-low-latency-mtp2, disagg-gb200-low-latency-mtp2, disagg-gb200-high-tpt-megamoe-mtp2) are all still on v0.20.1, so the single top-level image: can'''t satisfy all four — either bump every sibling recipe together with the master image:, or split this recipe into its own config-key entry with its own image:.
Extended reasoning...
What'''s broken
This PR bumps the mid-curve recipe at benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-mid-curve-megamoe-mtp2.yaml from vLLM v0.20.1 to v0.21.0 in two places (model.container and identity.container.image), but leaves the top-level image: field in .github/configs/nvidia-master.yaml (line 8102, under dsv4-fp4-gb200-dynamo-vllm-mtp2) unchanged at vllm/vllm-openai:v0.20.1-ubuntu2404.
Why this is a problem
AGENTS.md line 111 spells out the rule directly:
For image bumps,
model.containermust equalimage:, since the launcher uses the latter as the container-alias key.
The mechanism is concrete. In runners/launch_gb200-nv.sh:
-
Line 71:
SQUASH_FILE=".../$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"— the on-disk squash file path is derived from$IMAGE(sourced fromnvidia-master.yaml'''s top-levelimage:). -
Line 74:
enroot import -o $SQUASH_FILE docker://$IMAGE— the actual container pulled and squashed is the master'''simage:. -
Lines 223–226: the generated
srtslurm.yamlcontains:containers: dynamo-trtllm: ${SQUASH_FILE} dynamo-sglang: ${SQUASH_FILE} "${IMAGE}": ${SQUASH_FILE}
So the alias key in the
containers:map is literally${IMAGE}— i.e., the masterimage:string.srtctlthen resolves the recipe'''smodel.containeragainst this map.
Step-by-step proof this breaks
- The sweep workflow reads
nvidia-master.yaml. For config-keydsv4-fp4-gb200-dynamo-vllm-mtp2it picks upimage: vllm/vllm-openai:v0.20.1-ubuntu2404. - The launcher (
launch_gb200-nv.sh) setsIMAGE=vllm/vllm-openai:v0.20.1-ubuntu2404, imports it into a squash file, and emitssrtslurm.yamlwithcontainers: { "vllm/vllm-openai:v0.20.1-ubuntu2404": <squash> }. srtctlopens the mid-curve recipe yaml after this PR and readsmodel.container: vllm/vllm-openai:v0.21.0-ubuntu2404.- It looks up that key in
containers:— there is nov0.21.0-ubuntu2404entry — alias resolution fails (or, worse, silently falls back to the v0.20.1 squash, producing benchmark results that report as v0.21.0 but actually ran v0.20.1).
Why bumping master image: alone won'''t fix it
The same config-key dsv4-fp4-gb200-dynamo-vllm-mtp2 has four sibling recipes wired in via additional-settings CONFIG_FILE=... entries (lines 8124, 8140, 8157, 8174):
agg-gb200-low-latency-mtp2.yaml— line 5:container: "vllm/vllm-openai:v0.20.1-ubuntu2404"disagg-gb200-low-latency-mtp2.yaml— line 5:container: "vllm/vllm-openai:v0.20.1-ubuntu2404"disagg-gb200-high-tpt-megamoe-mtp2.yaml— line 5:container: "vllm/vllm-openai:v0.20.1-ubuntu2404"disagg-gb200-mid-curve-megamoe-mtp2.yaml— line 5:container: "vllm/vllm-openai:v0.21.0-ubuntu2404"(this PR)
There is exactly one top-level image: field for the config-key; it cannot match both v0.20.1 and v0.21.0 simultaneously. Whichever value is chosen, three of the four recipes will mismatch.
How to fix
Two acceptable resolutions per the AGENTS.md rule:
- Bump all four sibling recipes and the master
image:together tov0.21.0-ubuntu2404so the alias key matches every recipe under the config-key. Note that the comment on line 8099 (# ... Uses the vLLM 0.20.1 image ...) should be updated to match. - Split
disagg-gb200-mid-curve-megamoe-mtp2into its own config-key entry innvidia-master.yamlwith its ownimage: vllm/vllm-openai:v0.21.0-ubuntu2404, leaving the other three siblings under the existing v0.20.1 config-key.
Either way, this is benchmark-blocking as it stands — the mid-curve sweep that this PR is supposed to enable cannot resolve its container alias against the registered srtslurm.yaml containers map.
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1271 | ||
| - "Upgrade GB200 DSV4 MTP2 1P/1D DEP8 MegaMOE mid-curve recipe to vLLM v0.21.0" | ||
| - "Extend mid-curve concurrency sweep to also benchmark conc=256, conc=512, and conc=1024" | ||
| pr-link: TBD |
There was a problem hiding this comment.
🟡 The new perf-changelog entry at line 2495 has pr-link: TBD, but the repo convention (per AGENTS.md and utils/merge_with_reuse.sh) is to use https://github.com/SemiAnalysisAI/InferenceX/pull/XXX as the placeholder, which the merge tooling auto-replaces with the real PR number. Replace TBD with either https://github.com/SemiAnalysisAI/InferenceX/pull/1390 or the /pull/XXX placeholder before merge.
Extended reasoning...
What is the bug
perf-changelog.yaml line 2495 sets pr-link: TBD for the new dsv4-fp4-gb200-dynamo-vllm-mtp2 entry. Every other entry in the file uses a fully-qualified URL of the form https://github.com/SemiAnalysisAI/InferenceX/pull/<n>, and the documented placeholder convention (see AGENTS.md and benchmarks/multi_node/srt-slurm-recipes/RECIPES.md) is https://github.com/SemiAnalysisAI/InferenceX/pull/XXX.
Why existing tooling does not catch it
The Pydantic ChangelogEntry model in utils/matrix_logic/validation.py types pr_link as a plain str — it does not assert URL shape — so TBD passes validation. The merge automation in utils/merge_with_reuse.sh is the piece that normally substitutes placeholders: it scans for the literal substring XXX in pr-link and rewrites it to the merged PR's number, then asserts the final entry's pr-link ends with /pull/<PR-number>. TBD contains no XXX, so the substitution is skipped, and the trailing-PR-number assertion would then fail on merge-with-reuse paths.
Step-by-step proof
- PR is merged as PR Update DSV4 GB200 MTP 8k/1k vLLM 1p1d-dep8-dep8 recipe #1390 with
pr-link: TBDon line 2495. utils/merge_with_reuse.shwalks changelog entries and applies its substitution ruleif "XXX" not in link and not link.endswith(f"/pull/{pr}")—TBDmatches neither branch, so no rewrite happens.- The post-condition assertion
assert last['pr-link'].endswith('/pull/1390')fails because'TBD'does not end with/pull/1390. - Even on the non-merge-with-reuse path (direct merge), the changelog is left with a dangling
TBDstring that breaks any downstream consumer that parsespr-linkas a URL (e.g. anything that constructs a clickable link from the field).
Addressing the refutation
A reviewer argued this is a duplicate of an already-refuted earlier bug and that TBD is just a pre-merge placeholder the author will swap before merging. Two points respond to that: (1) the repo already has an established placeholder convention — /pull/XXX — and the merge tooling depends on it; using TBD defeats that automation and is inconsistent with every other entry in the file, so it is a real (if small) deviation, not a stylistic preference. (2) Pydantic validation accepts TBD but the merge-with-reuse assertion does not, so the "author will fix it before merge" assumption is exactly the failure mode worth flagging in review — that is the moment to flip TBD to /pull/1390 (or /pull/XXX).
How to fix
Change line 2495 to pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1390 (or .../pull/XXX to let the merge tooling fill in the number). Severity is nit — a one-line find/replace, but worth doing before merge to keep the changelog consistent and the merge-with-reuse path green.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25945612428 |
48c5660 to
6ed44ff
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25948327237 |
Sweep higher concurrency and use vLLM v0.21.0.