refactor(tablegroup): make shard reconciliation observed-state truthful#537
Open
Verolop wants to merge 3 commits into
Open
refactor(tablegroup): make shard reconciliation observed-state truthful#537Verolop wants to merge 3 commits into
Verolop wants to merge 3 commits into
Conversation
- Split the TableGroup reconcile path along the lifecycle boundaries called out by the remediation spec, so parent fetch/deletion, child apply/prune, status aggregation, and cleanup requeue decisions are reviewable as separate local phases. - Replace the old apply/list/prune/re-list/status flow with one normal child snapshot plus explicit post-write status semantics, reducing the cache/resourceVersion hazard without pretending that “one read” is sufficient by itself. - Retain the existing PendingDeletion/ReadyForDeletion handshake as the spec’s transitional compatibility path, while ensuring in-flight cleanup publishes current Progressing status before the controller requeues. - Treat TableGroup status as an observed-state API promise: desired-only children count, stale or orphaned child status does not, and a just-applied child is not ready until its own observed generation catches up. Signed-off-by: Verónica López <gveronicalg@gmail.com>
e313dbb to
f474d19
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
niconosenzo
reviewed
Jun 30, 2026
| // generations produced by the apiserver. Later status aggregation uses those | ||
| // generations to avoid treating a just-updated child as ready before the child | ||
| // controller observes its new spec. | ||
| func stepApplyDesiredShards(ctx context.Context, rc *reconcileContext) (stepResult, error) { |
Contributor
There was a problem hiding this comment.
Is there any chance we end up trying to re-add a shard that has been previously annotated with PendingDeletion ?
Contributor
|
Awesome work! Left a small comment, will stamp it after knowing your thoughts. |
niconosenzo
reviewed
Jun 30, 2026
| // parent deletion before normal work, read the child snapshot once, apply the | ||
| // desired children, retire orphans through their drain handshake, then publish | ||
| // parent status from the observed child state. | ||
| rawSteps := []step{ |
Contributor
There was a problem hiding this comment.
This is great. Should we make this list of steps a contract that every controller will eventually implement ?
- Keep TableGroup readiness tied to child state the operator has actually observed, rather than desired intent or freshly applied specs. - Preserve the Shard cleanup lifecycle before same-name replacement so draining or terminating children are not overwritten or counted as active. - Make cleanup waiting explicit and testable so future changes do not regress Kubernetes deletion semantics. Signed-off-by: Verónica López <gveronicalg@gmail.com>
🔬 Go Test Coverage ReportSummary
Status✅ PASS DetailShow New Coverage |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
this PR refactors the TableGroup reconciler around the lifecycle it owns: applying desired child Shards, retiring removed or replaced children through the existing drain handshake, and publishing aggregate status from observed child state.
The previous implementation kept child apply, orphan cleanup, child re-listing, status aggregation, event emission, and requeue decisions in one long reconcile flow. That made the controller’s state transitions implicit, especially around removed children and same-reconcile child updates. In those cases, TableGroup could leave stale parent status visible or count child state that no longer represented the current desired generation.
The new structure keeps TableGroup’s responsibility unchanged. It still applies owned Shard specs with Server-Side Apply and it still preserves the PendingDeletion / ReadyForDeletion cleanup flow. The refactor makes the lifecycle boundaries explicit in code and tightens status aggregation so the parent reports convergence honestly while children are draining, terminating or catching up.
Main Changes
Main Changes
Split the TableGroup reconcile flow into package-local phases for parent lifecycle handling, child apply/cleanup, status aggregation, status patching, and cleanup requeue decisions.
Replaced the previous normal-path child re-list with one observed child snapshot, so status is computed from child state the controller actually saw.
Preserved the existing
PendingDeletion/ReadyForDeletioncleanup handshake while preventing removed, replaced, or terminating children from being applied over or counted active.Counted readiness only from desired child Shards whose observed status has caught up to the generation applied by this reconcile.
Added focused unit and envtest coverage for child cleanup, same-name replacement, terminating-child handling, generation-gated readiness, and stable convergence.
Testing
Added coverage around the lifecycle and status invariants that make this refactor risky.
Status aggregation is tested directly: total comes from spec, readiness and degradation come from observed desired children, cleanup children do not make the parent look ready, pending cleanup forces
Progressing, and readiness is gated on observed generation.Child cleanup is tested at the lifecycle boundaries: removed Shards are annotated once, existing annotations are preserved, children are not deleted before
ReadyForDeletion, same-name replacements do not apply over children already in cleanup, and terminating children are waited on rather than restamped.Full reconcile tests verify the controller lists child Shards once on the normal path, publishes cleanup status before requeueing, requeues while parent cleanup is waiting on a terminating child, and does not flap an already-Healthy TableGroup.
Envtest coverage exercises convergence against a real apiserver/cache, including initial child creation, stable terminal status, add/remove reconciliation, retained child cleanup, same-name replacement after deletion, and reconvergence after the replacement child reports Healthy.