roachprod: wait for NodeID before starting next node#170809
Conversation
|
Merging to
After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here |
golgeek
left a comment
There was a problem hiding this comment.
The direction makes sense: roachprod start should not continue starting later nodes until the just-started node has actually joined, otherwise JoinNodeRequests can still race and assign NodeIDs out of roachprod-host order.
I left a few comments because the current implementation applies that invariant more broadly than it holds. In particular, roachprod supports startup modes where the VM suffix is not the Cockroach NodeID (InitTarget != 1, subset clusters, copied stores), and modes where Start() intentionally leaves the cluster uninitialized (SkipInit). Those cases need to be handled explicitly so the fix for #142313 does not regress existing roachtests.
I think the complete fix should make the “wait for expected NodeID” check apply only where the expected mapping is actually known, or thread the expected mapping/opt-out through StartOpts. The SQL polling should also use parseable output and a bounded overall wait so failures are quick and actionable.
| } | ||
| // Wait for this node to persist its NodeID before starting the next | ||
| // one, so concurrent JoinNodeRequests cannot race (#142313). | ||
| if err := c.waitForNodeID(ctx, l, node); err != nil { |
There was a problem hiding this comment.
This wait now runs unconditionally after every startNode, but Start() also supports SkipInit=true with IsRestart=false. In that path shouldInit is false, Start() intentionally does not run cockroach init, and the test body may initialize the cluster later.
A concrete example is pkg/cmd/roachtest/tests/cluster_init.go: the test sets SkipInit=true, starts the nodes, and then explicitly runs cockroach init from the test body. With this change, Start() tries to execute SELECT crdb_internal.node_id() before the cluster has been initialized. That SQL query cannot succeed in the intended pre-init state, so Start() will retry until it fails before the test gets a
chance to call init.
There are also restart-style callers that use SkipInit=true because the cluster was already initialized earlier. Those should not be treated the same way as fresh bootstrap: on restart, the expected NodeID is whatever is persisted in the existing store, not necessarily int(node).
This wait should be limited to modes where SQL is expected to be available and the just-started process is expected to have joined as part of this Start() call. At minimum, it should not run for SkipInit && !IsRestart, and restarts need separate handling from fresh bootstrap.
There was a problem hiding this comment.
Agree with containing the waitForNodeID logic within a shouldInit clause. I think restart logic should be fine in this case, as the restart process is handled separately
if startOpts.IsRestart {
return c.Parallel(ctx, l, WithNodes(c.Nodes).WithDisplay("starting nodes"), func(ctx context.Context, node Node) (*RunResultDetails, error) {
return c.startNodeWithResult(ctx, l, node, &startOpts)
})
}
so waitForNodeID should never even occur in the event of a restart.
| retryOpts := retry.Options{ | ||
| InitialBackoff: 100 * time.Millisecond, | ||
| MaxBackoff: time.Second, | ||
| MaxRetries: 60, |
There was a problem hiding this comment.
With InitialBackoff: 100ms and MaxBackoff: 1s, this is roughly a ~60s per-node budget. Since the loop is serial, several bad nodes in a selected range can compound and significantly delay the error return.
The retry loop should be bounded by an overall startup deadline, and we should prefer a single overall deadline via ctx (e.g. ctx, cancel := context.WithTimeout(ctx, ...)) rather than a per-node retry count.
There was a problem hiding this comment.
The serial loop is tied to the number of nodes, so I'm not sure a fixed overall deadline would fit all cases. If we had a much larger cluster (50 nodes), it would take significantly longer than a standard 4 node cluster. I agree however that with a 60s per-node budget this could significantly delay the error return. Realistically each node should only take a couple of seconds to start, so maybe a 5-10s per-node budget would be more appropriate and minimize the delay?
|
Detected infrastructure failure (matched: self-hosted runner lost communication with the server). Automatically rerunning failed jobs. (run link) |
Previously, SyncedCluster.Start launched cockroach processes sequentially but did not wait for each node to finish joining the cluster before starting the next. In the vast majority of cases the join sequence is fast enough to serialize naturally on n1's node-idgen counter. However, if anything delays n1's lease acquisition on the system range (e.g., the leader-lease store liveness gap that was fixed in cockroachdb#142150), JoinNodeRequests from later nodes can queue up and be processed out of order. The result is that a node's host suffix no longer matches its cockroach NodeID. This commit adds a per-node wait after each startNode call. The new waitForNodeID helper polls the just-started node via SELECT crdb_internal.node_id() until it reports its expected NodeID, with a one-minute retry budget. On a 16-node local cluster the additional startup time is ~3s. Resolves: cockroachdb#142313 Epic: none Release note: None
Previously, SyncedCluster.Start launched cockroach processes sequentially but did not wait for each node to finish joining the cluster before starting the next. In the vast majority of cases the join sequence is fast enough to serialize naturally on n1's node-idgen counter. However, if anything delays n1's lease acquisition on the system range (e.g., the leader-lease store liveness gap that was fixed in #142150), JoinNodeRequests from later nodes can queue up and be processed out of order. The result is that a node's host suffix no longer matches its cockroach NodeID.
This commit adds a per-node wait after each startNode call. The new waitForNodeID helper polls the just-started node via SELECT crdb_internal.node_id() until it reports its expected NodeID, with a one-minute retry budget. On a 16-node local cluster the additional startup time is ~3s.
Resolves: #142313
Epic: none
Release note: None