Trigger H200 multinode evals & revert MI355X image to mori-0227-3#1094
Trigger H200 multinode evals & revert MI355X image to mori-0227-3#1094Oseltamivir wants to merge 8 commits intomainfrom
Conversation
Missed staging this change before merging #1000.
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow If additional help is needed, PR authors can reach out to core maintainers over Slack. |
There was a problem hiding this comment.
LGTM — simple image tag revert with no bugs found.
Extended reasoning...
Overview
Two-line change in .github/configs/amd-master.yaml reverting the Docker image tag for dsr1-fp4-mi355x-sglang-disagg and dsr1-fp4-mi355x-sglang-disagg-mtp from mori-0313-2 back to mori-0227-3. No logic, model, or search-space configuration is touched.
Security risks
None. This is a config-only image tag change with no auth, crypto, or permission implications.
Level of scrutiny
Minimal — this is a mechanical revert of a missed staging step from a prior merge, as described in the PR summary. The change is exactly two lines and purely declarative.
Other factors
No bugs were reported by the automated system. The PR timeline contains only the standard bot reminder, and there are no outstanding reviewer comments. The change pattern matches other image-tag reverts in this repo.
Same approach as B200 launcher — overrides max_attempts to 720 in the srt-slurm config before submitting. Default 180 (30 min) is too short for disagg SGLang EAGLE cold start.
H200 recipes don't have a health_check section (unlike B200), so the sed replacement was a silent no-op. Now appends the block if max_attempts isn't found in the config file.
Summary
dsr1-fp4-mi355x-sglang-disagganddsr1-fp4-mi355x-sglang-disagg-mtpimage frommori-0313-2back tomori-0227-3