roachprod: wait for NodeID before starting next node by Dev-Kyle · Pull Request #170809 · cockroachdb/cockroach

Dev-Kyle · 2026-05-22T14:20:34Z

Previously, SyncedCluster.Start launched cockroach processes sequentially but did not wait for each node to finish joining the cluster before starting the next. In the vast majority of cases the join sequence is fast enough to serialize naturally on n1's node-idgen counter. However, if anything delays n1's lease acquisition on the system range (e.g., the leader-lease store liveness gap that was fixed in #142150), JoinNodeRequests from later nodes can queue up and be processed out of order. The result is that a node's host suffix no longer matches its cockroach NodeID.

This commit adds a per-node wait after each startNode call. The new waitForNodeID helper polls the just-started node via SELECT crdb_internal.node_id() until it reports its expected NodeID, with a one-minute retry budget. On a 16-node local cluster the additional startup time is ~3s.

Resolves: #142313
Epic: none

Release note: None

trunk-io · 2026-05-22T14:20:39Z

Merging to master in this repository is managed by Trunk.

To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

cockroach-teamcity · 2026-05-22T14:20:55Z

This change is

golgeek

The direction makes sense: roachprod start should not continue starting later nodes until the just-started node has actually joined, otherwise JoinNodeRequests can still race and assign NodeIDs out of roachprod-host order.

I left a few comments because the current implementation applies that invariant more broadly than it holds. In particular, roachprod supports startup modes where the VM suffix is not the Cockroach NodeID (InitTarget != 1, subset clusters, copied stores), and modes where Start() intentionally leaves the cluster uninitialized (SkipInit). Those cases need to be handled explicitly so the fix for #142313 does not regress existing roachtests.

I think the complete fix should make the “wait for expected NodeID” check apply only where the expected mapping is actually known, or thread the expected mapping/opt-out through StartOpts. The SQL polling should also use parseable output and a bounded overall wait so failures are quick and actionable.

golgeek · 2026-05-22T15:09:59Z

 			}
+			// Wait for this node to persist its NodeID before starting the next
+			// one, so concurrent JoinNodeRequests cannot race (#142313).
+			if err := c.waitForNodeID(ctx, l, node); err != nil {


This wait now runs unconditionally after every startNode, but Start() also supports SkipInit=true with IsRestart=false. In that path shouldInit is false, Start() intentionally does not run cockroach init, and the test body may initialize the cluster later.

A concrete example is pkg/cmd/roachtest/tests/cluster_init.go: the test sets SkipInit=true, starts the nodes, and then explicitly runs cockroach init from the test body. With this change, Start() tries to execute SELECT crdb_internal.node_id() before the cluster has been initialized. That SQL query cannot succeed in the intended pre-init state, so Start() will retry until it fails before the test gets a
chance to call init.

There are also restart-style callers that use SkipInit=true because the cluster was already initialized earlier. Those should not be treated the same way as fresh bootstrap: on restart, the expected NodeID is whatever is persisted in the existing store, not necessarily int(node).

This wait should be limited to modes where SQL is expected to be available and the just-started process is expected to have joined as part of this Start() call. At minimum, it should not run for SkipInit && !IsRestart, and restarts need separate handling from fresh bootstrap.

Agree with containing the waitForNodeID logic within a shouldInit clause. I think restart logic should be fine in this case, as the restart process is handled separately

if startOpts.IsRestart { return c.Parallel(ctx, l, WithNodes(c.Nodes).WithDisplay("starting nodes"), func(ctx context.Context, node Node) (*RunResultDetails, error) { return c.startNodeWithResult(ctx, l, node, &startOpts) }) }

so waitForNodeID should never even occur in the event of a restart.

golgeek · 2026-05-22T15:11:09Z

+	retryOpts := retry.Options{
+		InitialBackoff: 100 * time.Millisecond,
+		MaxBackoff:     time.Second,
+		MaxRetries:     60,


With InitialBackoff: 100ms and MaxBackoff: 1s, this is roughly a ~60s per-node budget. Since the loop is serial, several bad nodes in a selected range can compound and significantly delay the error return.

The retry loop should be bounded by an overall startup deadline, and we should prefer a single overall deadline via ctx (e.g. ctx, cancel := context.WithTimeout(ctx, ...)) rather than a per-node retry count.

The serial loop is tied to the number of nodes, so I'm not sure a fixed overall deadline would fit all cases. If we had a much larger cluster (50 nodes), it would take significantly longer than a standard 4 node cluster. I agree however that with a 60s per-node budget this could significantly delay the error return. Realistically each node should only take a couple of seconds to start, so maybe a 5-10s per-node budget would be more appropriate and minimize the delay?

blathers-crl · 2026-05-22T16:21:55Z

Detected infrastructure failure (matched: self-hosted runner lost communication with the server). Automatically rerunning failed jobs. (run link)

Previously, SyncedCluster.Start launched cockroach processes sequentially but did not wait for each node to finish joining the cluster before starting the next. In the vast majority of cases the join sequence is fast enough to serialize naturally on n1's node-idgen counter. However, if anything delays n1's lease acquisition on the system range (e.g., the leader-lease store liveness gap that was fixed in cockroachdb#142150), JoinNodeRequests from later nodes can queue up and be processed out of order. The result is that a node's host suffix no longer matches its cockroach NodeID. This commit adds a per-node wait after each startNode call. The new waitForNodeID helper polls the just-started node via SELECT crdb_internal.node_id() until it reports its expected NodeID, with a one-minute retry budget. On a 16-node local cluster the additional startup time is ~3s. Resolves: cockroachdb#142313 Epic: none Release note: None

Dev-Kyle requested a review from a team as a code owner May 22, 2026 14:20

Dev-Kyle requested review from cpj2195 and shailendra-patel and removed request for a team May 22, 2026 14:20

golgeek reviewed May 22, 2026

View reviewed changes

Dev-Kyle force-pushed the nodeid branch from 6ab99b0 to abb2dd8 Compare May 22, 2026 18:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachprod: wait for NodeID before starting next node#170809

roachprod: wait for NodeID before starting next node#170809
Dev-Kyle wants to merge 1 commit into
cockroachdb:masterfrom
Dev-Kyle:nodeid

Dev-Kyle commented May 22, 2026

Uh oh!

trunk-io Bot commented May 22, 2026

Uh oh!

cockroach-teamcity commented May 22, 2026

Uh oh!

golgeek left a comment

Uh oh!

Uh oh!

golgeek May 22, 2026

Uh oh!

Dev-Kyle May 22, 2026

Uh oh!

Uh oh!

golgeek May 22, 2026

Uh oh!

Dev-Kyle May 22, 2026

Uh oh!

Uh oh!

blathers-crl Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Dev-Kyle commented May 22, 2026

Uh oh!

trunk-io Bot commented May 22, 2026

Uh oh!

cockroach-teamcity commented May 22, 2026

Uh oh!

golgeek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

golgeek May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Dev-Kyle May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

golgeek May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Dev-Kyle May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

blathers-crl Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants