Skip to content

feat(controlplane): drift-correct iceberg.enabled in reconcileReady#555

Merged
fuziontech merged 2 commits into
mainfrom
feat/iceberg-drift-correction
May 9, 2026
Merged

feat(controlplane): drift-correct iceberg.enabled in reconcileReady#555
fuziontech merged 2 commits into
mainfrom
feat/iceberg-drift-correction

Conversation

@fuziontech
Copy link
Copy Markdown
Member

Summary

Closes the gap left by #554: when an operator flips iceberg_enabled in the configstore on a warehouse already in Ready state, the provisioner controller now patches spec.iceberg.enabled on the existing Duckling CR. Mirrors the pgbouncer.enabled drift correction pattern.

Why

Without this, the admin-API path to enable Iceberg on an existing warehouse takes two steps: configstore flip + manual kubectl patch. Now the configstore flip alone is sufficient.

What's drift-corrected and what isn't

  • iceberg.enabled — both directions (false→true and true→false). Symmetric with pgbouncer.
  • iceberg.namespace — the XRD's CEL admission rule rejects post-create namespace changes (#11032), so a drift attempt would just hit a 422. Namespace changes require warehouse re-creation.
  • ❌ ACU, image, etc. — same deliberately-narrow scope as today.

What about the table bucket itself?

  • The Crossplane Workspace MR for the table bucket has managementPolicies: ["Observe", "Create"] (set-once). Flipping iceberg.enabled=true provisions the bucket; flipping back to false makes the GoTemplate skip rendering the Workspace, which removes the K8s MR but does NOT delete the AWS table bucket (no Delete in management policies). Data is preserved; the worker just stops attaching the catalog.

Test plan

  • New unit tests TestReconcileReadyPatchesCRWhenIcebergFlipped{On,Off} covering both drift directions.
  • Existing TestReconcileReadyNoDriftDoesNotPatch still passes (no spurious patch).
  • go test -tags kubernetes ./controlplane/... green.
  • go build ./... and go build -tags kubernetes ./... clean.
  • Manual: deploy to dev, flip a test warehouse's iceberg_enabled via admin API, verify the Duckling CR's spec.iceberg.enabled gets patched and the Workspace + bucket get created.

🤖 Generated with Claude Code

fuziontech and others added 2 commits May 8, 2026 19:52
Mirror the pgbouncer.enabled drift correction pattern for iceberg so
flipping iceberg_enabled in the configstore (admin API) is sufficient
to propagate to a warehouse already in Ready state. Without this,
operators have to also kubectl patch the Duckling CR.

iceberg.namespace is intentionally not drift-corrected — the XRD's CEL
admission rule rejects post-create namespace changes, so a drift attempt
would just hit a 422. Namespace changes require warehouse re-creation.

Adds GetIcebergEnabled / SetIcebergEnabled on DucklingClient (RFC 7396
merge patch, only touches the iceberg block) and parallel
TestReconcileReadyPatchesCRWhenIcebergFlipped{On,Off} tests covering both
directions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
client.Register was using context.Background() — no timeout. Any
environment that can't reach Let's Encrypt (LE outage, CI runner
without outbound network, etc.) would hang duckgres startup forever.

Failure was already handled as warn-and-continue further down; the
timeout just lets us reach that path instead of blocking. CI was
hitting this deterministically on TestACMEDNSCertCache (the test
runs through NewACMEDNSManager which goes out to LE before reaching
the AWS-config-failure path the test is actually exercising).

Bundled into this PR rather than separate so the iceberg drift
correction can land — same root unblocker.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@fuziontech fuziontech merged commit 833f140 into main May 9, 2026
22 checks passed
@fuziontech fuziontech deleted the feat/iceberg-drift-correction branch May 9, 2026 04:29
fuziontech added a commit that referenced this pull request May 10, 2026
PR #554 added the Iceberg fields to ManagedWarehouse (configstore) and
to the upsert column allowlist, but missed the strict-decode struct
managedWarehouseRequest that gates which top-level keys a PUT body may
carry. The doc comment on that struct literally warns: "If you add a
field here without a matching tag on ManagedWarehouse, strict decode
will accept it and the merge will silently drop it."

Without this fix, `PUT /orgs/<id>/warehouse` with `{"iceberg":{...}}`
returns 400 "unknown field iceberg" — making it impossible to enable
Iceberg via admin API on existing warehouses (the whole reason the
drift correction in #555 exists).

Adds the matching test mirroring TestPutWarehouseDisablesPgBouncerWhenSetToFalse.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fuziontech added a commit that referenced this pull request May 12, 2026
The Duckling-status → configstore propagation for iceberg (ARN, region,
namespace, state=ready) currently only fires in `reconcileProvisioning`.
Once a warehouse transitions to `Ready`, the controller stops pulling
status fields from the Duckling CR.

That's wrong for the late-iceberg-enable case: an admin flips
iceberg.enabled=true via the admin API on a warehouse that's already
Ready; #555's drift correction patches `spec.iceberg.enabled` on the
Duckling CR; the Crossplane composition provisions the bucket and
writes `status.iceberg.tableBucketArn`; but the controller never reads
that status back because the warehouse is still Ready (never moved
back through Provisioning). The configstore's iceberg.table_bucket_arn
stays empty and the worker activator's IcebergConfig.TableBucket stays
empty — AttachIcebergCatalog gets gated out, iceberg never actually
attaches at the worker.

This was observed on prod-us today: the cascade completed at the
Crossplane layer (Duckling.status.iceberg.tableBucketArn populated)
but `iceberg_state` in the configstore never flipped from "" to
"ready". Recovery required a manual admin API PUT.

Extract the propagation into `addIcebergStatusUpdates`; call from both
`reconcileProvisioning` (initial turn-up) and `reconcileReady`
(late-enable). Added TestReconcileReadyPropagatesIcebergStatusToConfigstore
covering the bug.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant