Add b300-cw (CoreWeave B300) runner launch script and pool by JordanNanos · Pull Request #1730 · SemiAnalysisAI/InferenceX

JordanNanos · 2026-06-12T22:10:22Z

Summary

Adds the launcher and runner-pool entry for the new CoreWeave B300 cluster (b300-cw), so its login-node runner(s) can pick up B300 single-node benchmark jobs through Slurm.

The cluster is 5 nodes × 8× B300 (x86_64), Slurm partition b300, compute nodes have enroot only (no Docker). The runner lives on the login node with _work on shared NFS; the launcher sallocs a node and sruns the benchmark into it via pyxis.

Changes

New launcher (runners/launch_b300-cw.sh):

Follows the launch_b200-cw.sh CoreWeave template: salloc one node, import the container to that node's local /tmp under flock (serializes concurrent imports), then srun --container-image in the same allocation, passing the squash path as-is (it lives on the worker's /tmp, not visible from the login host).
Importing to node-local /tmp rather than shared /mnt/vast NFS avoids the enroot aufs-whiteout failures root-squash NFS triggers (documented in launch_b300-nv.sh).
Benchmark-script selection: framework-tagged name first (<model>_<prec>_b300_<fw>.sh), then the legacy bare/_trt fallback.
Single-node only; the multi-node (srt-slurm/dynamo) path is not wired up here.

Runner pool (.github/configs/runners.yaml):

New b300-cw key listing the registered runner b300-cw_0. Kept as its own pool (not folded into b300) so CoreWeave jobs stay separate from the NVIDIA B300 fleet.

Validation

This exact launcher was exercised end-to-end by a gpt-oss agentic smoke test on the live b300-cw runner: the job matched the b300-cw label, the launcher's salloc granted a node, and srun ran the container import + benchmark on it. (Smoke run was off a separate branch based on the agentx-v0.4 agentic harness; this PR is the runner-plumbing half.)
bash -n clean.

Notes for reviewers

The registered runner is named b300-cw_0 (single digit), matching the gb300-cw convention; targeting is by the shared b300-cw label, so the exact name only matters for run-sweep.yml distribution.
Its label set must be slurm,b300-cw — not bare b300, which would make it eligible for NVIDIA-fleet B300 jobs.

🤖 Generated with Claude Code

New CoreWeave B300 cluster: 5 nodes of 8x B300, Slurm partition b300, shared storage on /mnt/vast. Single-node launcher adapted from launch_h200-cw.sh (same CoreWeave salloc + enroot/pyxis pattern) with the framework-tagged benchmark-script selection from launch_b300-nv.sh. Multi-node is not wired up yet and exits with a clear error. Registers pool key b300-cw with one runner (b300-cw_0), following the gb300-cw naming convention. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Replace the initial /mnt/vast (shared NFS) import with the launch_b200-cw.sh node-local /tmp pattern: import the container on the allocated worker under flock and pass the squash path as-is. Avoids the enroot aufs-whiteout failures root-squash NFS triggers (documented in launch_b300-nv.sh), and matches the launcher exercised by the b300-cw smoke test. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

JordanNanos requested a review from a team June 12, 2026 22:10

github-project-automation Bot added this to InferenceMAX Board Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add b300-cw (CoreWeave B300) runner launch script and pool#1730

Add b300-cw (CoreWeave B300) runner launch script and pool#1730
JordanNanos wants to merge 2 commits into
mainfrom
jordan/b300-cw-runner

JordanNanos commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JordanNanos commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Validation

Notes for reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JordanNanos commented Jun 12, 2026 •

edited

Loading