Skip to content

Update DSV4 GB200 MTP 8k/1k vLLM 1p1d-dep8-dep8 recipe#1390

Open
hjjq wants to merge 2 commits into
mainfrom
hjjq/gb200-mtp
Open

Update DSV4 GB200 MTP 8k/1k vLLM 1p1d-dep8-dep8 recipe#1390
hjjq wants to merge 2 commits into
mainfrom
hjjq/gb200-mtp

Conversation

@hjjq
Copy link
Copy Markdown
Collaborator

@hjjq hjjq commented May 15, 2026

Sweep higher concurrency and use vLLM v0.21.0.

@hjjq hjjq requested a review from a team May 15, 2026 18:13
@hjjq hjjq requested a review from kedarpotdar-nv as a code owner May 15, 2026 18:13
@hjjq hjjq requested a review from jgangani as a code owner May 15, 2026 18:13
# MegaMOE mid curve: 1 prefill (DEP=8) + 1 decode (DEP=8).
# 5 nodes total with a dedicated NATS/etcd infra node.
- conc-list: [128]
- conc-list: [128, 256, 512, 1024]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The recipe's model.container/identity.container.image is bumped to vllm/vllm-openai:v0.21.0-ubuntu2404, but the image: field in .github/configs/nvidia-master.yaml for dsv4-fp4-gb200-dynamo-vllm-mtp2 (line 8102) is still v0.20.1-ubuntu2404 — per AGENTS.md line 111 these must match, since the launcher registers ${IMAGE} as the container-alias key in srtslurm.yaml and srtctl resolves model.container against it. Note that the three sibling recipes under this same config-key (agg-gb200-low-latency-mtp2, disagg-gb200-low-latency-mtp2, disagg-gb200-high-tpt-megamoe-mtp2) are all still on v0.20.1, so the single top-level image: can'''t satisfy all four — either bump every sibling recipe together with the master image:, or split this recipe into its own config-key entry with its own image:.

Extended reasoning...

What'''s broken

This PR bumps the mid-curve recipe at benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-mid-curve-megamoe-mtp2.yaml from vLLM v0.20.1 to v0.21.0 in two places (model.container and identity.container.image), but leaves the top-level image: field in .github/configs/nvidia-master.yaml (line 8102, under dsv4-fp4-gb200-dynamo-vllm-mtp2) unchanged at vllm/vllm-openai:v0.20.1-ubuntu2404.

Why this is a problem

AGENTS.md line 111 spells out the rule directly:

For image bumps, model.container must equal image:, since the launcher uses the latter as the container-alias key.

The mechanism is concrete. In runners/launch_gb200-nv.sh:

  • Line 71: SQUASH_FILE=".../$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh" — the on-disk squash file path is derived from $IMAGE (sourced from nvidia-master.yaml'''s top-level image:).

  • Line 74: enroot import -o $SQUASH_FILE docker://$IMAGE — the actual container pulled and squashed is the master'''s image:.

  • Lines 223–226: the generated srtslurm.yaml contains:

    containers:
      dynamo-trtllm: ${SQUASH_FILE}
      dynamo-sglang: ${SQUASH_FILE}
      "${IMAGE}": ${SQUASH_FILE}

    So the alias key in the containers: map is literally ${IMAGE} — i.e., the master image: string. srtctl then resolves the recipe'''s model.container against this map.

Step-by-step proof this breaks

  1. The sweep workflow reads nvidia-master.yaml. For config-key dsv4-fp4-gb200-dynamo-vllm-mtp2 it picks up image: vllm/vllm-openai:v0.20.1-ubuntu2404.
  2. The launcher (launch_gb200-nv.sh) sets IMAGE=vllm/vllm-openai:v0.20.1-ubuntu2404, imports it into a squash file, and emits srtslurm.yaml with containers: { "vllm/vllm-openai:v0.20.1-ubuntu2404": <squash> }.
  3. srtctl opens the mid-curve recipe yaml after this PR and reads model.container: vllm/vllm-openai:v0.21.0-ubuntu2404.
  4. It looks up that key in containers: — there is no v0.21.0-ubuntu2404 entry — alias resolution fails (or, worse, silently falls back to the v0.20.1 squash, producing benchmark results that report as v0.21.0 but actually ran v0.20.1).

Why bumping master image: alone won'''t fix it

The same config-key dsv4-fp4-gb200-dynamo-vllm-mtp2 has four sibling recipes wired in via additional-settings CONFIG_FILE=... entries (lines 8124, 8140, 8157, 8174):

  • agg-gb200-low-latency-mtp2.yaml — line 5: container: "vllm/vllm-openai:v0.20.1-ubuntu2404"
  • disagg-gb200-low-latency-mtp2.yaml — line 5: container: "vllm/vllm-openai:v0.20.1-ubuntu2404"
  • disagg-gb200-high-tpt-megamoe-mtp2.yaml — line 5: container: "vllm/vllm-openai:v0.20.1-ubuntu2404"
  • disagg-gb200-mid-curve-megamoe-mtp2.yaml — line 5: container: "vllm/vllm-openai:v0.21.0-ubuntu2404" (this PR)

There is exactly one top-level image: field for the config-key; it cannot match both v0.20.1 and v0.21.0 simultaneously. Whichever value is chosen, three of the four recipes will mismatch.

How to fix

Two acceptable resolutions per the AGENTS.md rule:

  1. Bump all four sibling recipes and the master image: together to v0.21.0-ubuntu2404 so the alias key matches every recipe under the config-key. Note that the comment on line 8099 (# ... Uses the vLLM 0.20.1 image ...) should be updated to match.
  2. Split disagg-gb200-mid-curve-megamoe-mtp2 into its own config-key entry in nvidia-master.yaml with its own image: vllm/vllm-openai:v0.21.0-ubuntu2404, leaving the other three siblings under the existing v0.20.1 config-key.

Either way, this is benchmark-blocking as it stands — the mid-curve sweep that this PR is supposed to enable cannot resolve its container alias against the registered srtslurm.yaml containers map.

Comment thread perf-changelog.yaml Outdated
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1271
- "Upgrade GB200 DSV4 MTP2 1P/1D DEP8 MegaMOE mid-curve recipe to vLLM v0.21.0"
- "Extend mid-curve concurrency sweep to also benchmark conc=256, conc=512, and conc=1024"
pr-link: TBD
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new perf-changelog entry at line 2495 has pr-link: TBD, but the repo convention (per AGENTS.md and utils/merge_with_reuse.sh) is to use https://github.com/SemiAnalysisAI/InferenceX/pull/XXX as the placeholder, which the merge tooling auto-replaces with the real PR number. Replace TBD with either https://github.com/SemiAnalysisAI/InferenceX/pull/1390 or the /pull/XXX placeholder before merge.

Extended reasoning...

What is the bug

perf-changelog.yaml line 2495 sets pr-link: TBD for the new dsv4-fp4-gb200-dynamo-vllm-mtp2 entry. Every other entry in the file uses a fully-qualified URL of the form https://github.com/SemiAnalysisAI/InferenceX/pull/<n>, and the documented placeholder convention (see AGENTS.md and benchmarks/multi_node/srt-slurm-recipes/RECIPES.md) is https://github.com/SemiAnalysisAI/InferenceX/pull/XXX.

Why existing tooling does not catch it

The Pydantic ChangelogEntry model in utils/matrix_logic/validation.py types pr_link as a plain str — it does not assert URL shape — so TBD passes validation. The merge automation in utils/merge_with_reuse.sh is the piece that normally substitutes placeholders: it scans for the literal substring XXX in pr-link and rewrites it to the merged PR's number, then asserts the final entry's pr-link ends with /pull/<PR-number>. TBD contains no XXX, so the substitution is skipped, and the trailing-PR-number assertion would then fail on merge-with-reuse paths.

Step-by-step proof

  1. PR is merged as PR Update DSV4 GB200 MTP 8k/1k vLLM 1p1d-dep8-dep8 recipe #1390 with pr-link: TBD on line 2495.
  2. utils/merge_with_reuse.sh walks changelog entries and applies its substitution rule if "XXX" not in link and not link.endswith(f"/pull/{pr}")TBD matches neither branch, so no rewrite happens.
  3. The post-condition assertion assert last['pr-link'].endswith('/pull/1390') fails because 'TBD' does not end with /pull/1390.
  4. Even on the non-merge-with-reuse path (direct merge), the changelog is left with a dangling TBD string that breaks any downstream consumer that parses pr-link as a URL (e.g. anything that constructs a clickable link from the field).

Addressing the refutation

A reviewer argued this is a duplicate of an already-refuted earlier bug and that TBD is just a pre-merge placeholder the author will swap before merging. Two points respond to that: (1) the repo already has an established placeholder convention — /pull/XXX — and the merge tooling depends on it; using TBD defeats that automation and is inconsistent with every other entry in the file, so it is a real (if small) deviation, not a stylistic preference. (2) Pydantic validation accepts TBD but the merge-with-reuse assertion does not, so the "author will fix it before merge" assumption is exactly the failure mode worth flagging in review — that is the moment to flip TBD to /pull/1390 (or /pull/XXX).

How to fix

Change line 2495 to pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1390 (or .../pull/XXX to let the merge tooling fill in the number). Severity is nit — a one-line find/replace, but worth doing before merge to keep the changelog consistent and the merge-with-reuse path green.

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants