Skip to content

roachprod: wait for NodeID before starting next node#170809

Open
Dev-Kyle wants to merge 1 commit into
cockroachdb:masterfrom
Dev-Kyle:nodeid
Open

roachprod: wait for NodeID before starting next node#170809
Dev-Kyle wants to merge 1 commit into
cockroachdb:masterfrom
Dev-Kyle:nodeid

Conversation

@Dev-Kyle
Copy link
Copy Markdown
Contributor

Previously, SyncedCluster.Start launched cockroach processes sequentially but did not wait for each node to finish joining the cluster before starting the next. In the vast majority of cases the join sequence is fast enough to serialize naturally on n1's node-idgen counter. However, if anything delays n1's lease acquisition on the system range (e.g., the leader-lease store liveness gap that was fixed in #142150), JoinNodeRequests from later nodes can queue up and be processed out of order. The result is that a node's host suffix no longer matches its cockroach NodeID.

This commit adds a per-node wait after each startNode call. The new waitForNodeID helper polls the just-started node via SELECT crdb_internal.node_id() until it reports its expected NodeID, with a one-minute retry budget. On a 16-node local cluster the additional startup time is ~3s.

Resolves: #142313
Epic: none

Release note: None

@Dev-Kyle Dev-Kyle requested a review from a team as a code owner May 22, 2026 14:20
@Dev-Kyle Dev-Kyle requested review from cpj2195 and shailendra-patel and removed request for a team May 22, 2026 14:20
@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented May 22, 2026

Merging to master in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor

@golgeek golgeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The direction makes sense: roachprod start should not continue starting later nodes until the just-started node has actually joined, otherwise JoinNodeRequests can still race and assign NodeIDs out of roachprod-host order.

I left a few comments because the current implementation applies that invariant more broadly than it holds. In particular, roachprod supports startup modes where the VM suffix is not the Cockroach NodeID (InitTarget != 1, subset clusters, copied stores), and modes where Start() intentionally leaves the cluster uninitialized (SkipInit). Those cases need to be handled explicitly so the fix for #142313 does not regress existing roachtests.

I think the complete fix should make the “wait for expected NodeID” check apply only where the expected mapping is actually known, or thread the expected mapping/opt-out through StartOpts. The SQL polling should also use parseable output and a bounded overall wait so failures are quick and actionable.

Comment thread pkg/roachprod/install/cockroach.go Outdated
Comment thread pkg/roachprod/install/cockroach.go Outdated
}
// Wait for this node to persist its NodeID before starting the next
// one, so concurrent JoinNodeRequests cannot race (#142313).
if err := c.waitForNodeID(ctx, l, node); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wait now runs unconditionally after every startNode, but Start() also supports SkipInit=true with IsRestart=false. In that path shouldInit is false, Start() intentionally does not run cockroach init, and the test body may initialize the cluster later.

A concrete example is pkg/cmd/roachtest/tests/cluster_init.go: the test sets SkipInit=true, starts the nodes, and then explicitly runs cockroach init from the test body. With this change, Start() tries to execute SELECT crdb_internal.node_id() before the cluster has been initialized. That SQL query cannot succeed in the intended pre-init state, so Start() will retry until it fails before the test gets a
chance to call init.

There are also restart-style callers that use SkipInit=true because the cluster was already initialized earlier. Those should not be treated the same way as fresh bootstrap: on restart, the expected NodeID is whatever is persisted in the existing store, not necessarily int(node).

This wait should be limited to modes where SQL is expected to be available and the just-started process is expected to have joined as part of this Start() call. At minimum, it should not run for SkipInit && !IsRestart, and restarts need separate handling from fresh bootstrap.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with containing the waitForNodeID logic within a shouldInit clause. I think restart logic should be fine in this case, as the restart process is handled separately

   if startOpts.IsRestart {
		return c.Parallel(ctx, l, WithNodes(c.Nodes).WithDisplay("starting nodes"), func(ctx context.Context, node Node) (*RunResultDetails, error) {
			return c.startNodeWithResult(ctx, l, node, &startOpts)
		})
	}

so waitForNodeID should never even occur in the event of a restart.

Comment thread pkg/roachprod/install/cockroach.go Outdated
retryOpts := retry.Options{
InitialBackoff: 100 * time.Millisecond,
MaxBackoff: time.Second,
MaxRetries: 60,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With InitialBackoff: 100ms and MaxBackoff: 1s, this is roughly a ~60s per-node budget. Since the loop is serial, several bad nodes in a selected range can compound and significantly delay the error return.

The retry loop should be bounded by an overall startup deadline, and we should prefer a single overall deadline via ctx (e.g. ctx, cancel := context.WithTimeout(ctx, ...)) rather than a per-node retry count.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The serial loop is tied to the number of nodes, so I'm not sure a fixed overall deadline would fit all cases. If we had a much larger cluster (50 nodes), it would take significantly longer than a standard 4 node cluster. I agree however that with a 60s per-node budget this could significantly delay the error return. Realistically each node should only take a couple of seconds to start, so maybe a 5-10s per-node budget would be more appropriate and minimize the delay?

Comment thread pkg/roachprod/install/cockroach.go Outdated
@blathers-crl
Copy link
Copy Markdown

blathers-crl Bot commented May 22, 2026

Detected infrastructure failure (matched: self-hosted runner lost communication with the server). Automatically rerunning failed jobs. (run link)

Previously, SyncedCluster.Start launched cockroach processes
sequentially but did not wait for each node to finish joining the
cluster before starting the next. In the vast majority of cases the join
sequence is fast enough to serialize naturally on n1's node-idgen
counter. However, if anything delays n1's lease acquisition on the
system range (e.g., the leader-lease store liveness gap that was fixed
in cockroachdb#142150), JoinNodeRequests from later nodes can queue up and be
processed out of order. The result is that a node's host suffix no
longer matches its cockroach NodeID.

This commit adds a per-node wait after each startNode call. The new
waitForNodeID helper polls the just-started node via SELECT
crdb_internal.node_id() until it reports its expected NodeID, with a
one-minute retry budget. On a 16-node local cluster the additional
startup time is ~3s.

Resolves: cockroachdb#142313
Epic: none

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachprod: potential race in node id allocation

3 participants