Skip to content

Add b300-cw (CoreWeave B300) runner launch script and pool#1730

Open
JordanNanos wants to merge 2 commits into
mainfrom
jordan/b300-cw-runner
Open

Add b300-cw (CoreWeave B300) runner launch script and pool#1730
JordanNanos wants to merge 2 commits into
mainfrom
jordan/b300-cw-runner

Conversation

@JordanNanos

@JordanNanos JordanNanos commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds the launcher and runner-pool entry for the new CoreWeave B300 cluster (b300-cw), so its login-node runner(s) can pick up B300 single-node benchmark jobs through Slurm.

The cluster is 5 nodes × 8× B300 (x86_64), Slurm partition b300, compute nodes have enroot only (no Docker). The runner lives on the login node with _work on shared NFS; the launcher sallocs a node and sruns the benchmark into it via pyxis.

Changes

New launcher (runners/launch_b300-cw.sh):

  • Follows the launch_b200-cw.sh CoreWeave template: salloc one node, import the container to that node's local /tmp under flock (serializes concurrent imports), then srun --container-image in the same allocation, passing the squash path as-is (it lives on the worker's /tmp, not visible from the login host).
  • Importing to node-local /tmp rather than shared /mnt/vast NFS avoids the enroot aufs-whiteout failures root-squash NFS triggers (documented in launch_b300-nv.sh).
  • Benchmark-script selection: framework-tagged name first (<model>_<prec>_b300_<fw>.sh), then the legacy bare/_trt fallback.
  • Single-node only; the multi-node (srt-slurm/dynamo) path is not wired up here.

Runner pool (.github/configs/runners.yaml):

  • New b300-cw key listing the registered runner b300-cw_0. Kept as its own pool (not folded into b300) so CoreWeave jobs stay separate from the NVIDIA B300 fleet.

Validation

  • This exact launcher was exercised end-to-end by a gpt-oss agentic smoke test on the live b300-cw runner: the job matched the b300-cw label, the launcher's salloc granted a node, and srun ran the container import + benchmark on it. (Smoke run was off a separate branch based on the agentx-v0.4 agentic harness; this PR is the runner-plumbing half.)
  • bash -n clean.

Notes for reviewers

  • The registered runner is named b300-cw_0 (single digit), matching the gb300-cw convention; targeting is by the shared b300-cw label, so the exact name only matters for run-sweep.yml distribution.
  • Its label set must be slurm,b300-cwnot bare b300, which would make it eligible for NVIDIA-fleet B300 jobs.

🤖 Generated with Claude Code

New CoreWeave B300 cluster: 5 nodes of 8x B300, Slurm partition b300,
shared storage on /mnt/vast. Single-node launcher adapted from
launch_h200-cw.sh (same CoreWeave salloc + enroot/pyxis pattern) with
the framework-tagged benchmark-script selection from launch_b300-nv.sh.
Multi-node is not wired up yet and exits with a clear error.

Registers pool key b300-cw with one runner (b300-cw_0), following the
gb300-cw naming convention.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Replace the initial /mnt/vast (shared NFS) import with the launch_b200-cw.sh
node-local /tmp pattern: import the container on the allocated worker under
flock and pass the squash path as-is. Avoids the enroot aufs-whiteout
failures root-squash NFS triggers (documented in launch_b300-nv.sh), and
matches the launcher exercised by the b300-cw smoke test.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant