Skip to content

[PLACEHOLDER — needs review] Fresh-tunnel: HTTPProxy Programmed/ConnectorMetadataProgrammed blocked ~3min on 'Downstream EnvoyPatchPolicy has no status yet' #175

@drewr

Description

@drewr

⚠ Placeholder — unverified. This issue was opened by Claude from CLI-side observations in a debugging session. @drewr — please verify against the operator code before triaging. I do not have direct read access to EnvoyPatchPolicy resources or to the controller logs, so I'm reporting symptoms + a controller-emitted reason string rather than confirmed root cause. Adjust scope / close as not-a-bug if my read is wrong.

Scope (narrowed 2026-06-07): This issue is now scoped specifically to the fresh-tunnel case — first creation of an HTTPProxy + ConnectorAdvertisement, where the controller's Programmed and ConnectorMetadataProgrammed conditions sit at False while the downstream EnvoyPatchPolicy has no status yet. The originally-related symptom of "tunnel resumed with all conditions Ready at 0.1s but proxy URL serves 5xx for ~140s" is a different problem (iroh peer-handshake latency between the edge and a freshly-resumed agent, possibly compounded by Envoy outlier-detection ejection windows) and will be filed separately. See the clarification comment below for the data that motivated the split.

Symptom

A successful tunnel-listen run today produced the following progress timeline (each line is the first moment that condition transitioned to True):

✓ connector ready (3.0s)
✓ tunnel accepted (4.6s)
✓ TLS certificate issued (4.6s)
✓ iroh DNS published (4.6s)
… route programmed still pending after 30s: Downstream EnvoyPatchPolicy has no status yet
… envoy metadata propagated still pending after 30s: Downstream EnvoyPatchPolicy has no status yet
✓ route programmed (199.1s)
✓ envoy metadata propagated (199.1s)

Everything cleared in under 5 seconds except HTTPProxy.status.conditions[Programmed] and HTTPProxy.status.conditions[ConnectorMetadataProgrammed], both of which stayed False for ~194 seconds with the verbatim reason/message:

reason:  (whatever the controller is emitting for "downstream not ready")
message: Downstream EnvoyPatchPolicy has no status yet

Then both flipped True simultaneously. The end-user impact today was a ~3-minute setup time for a feature that's otherwise sub-5-second across all other conditions.

What this likely means (best-effort interpretation — please verify)

The HTTPProxy controller appears to gate Programmed and ConnectorMetadataProgrammed on the readiness of a EnvoyPatchPolicy (Envoy Gateway resource) that it creates as a downstream artifact. While the EnvoyPatchPolicy has no status populated, the HTTPProxy controller surfaces "Downstream EnvoyPatchPolicy has no status yet" and waits.

The 3-minute gap suggests one of:

  1. Envoy Gateway reconciler is slow / running on a long interval. If the EnvoyPatchPolicy controller's resync period is ~3 minutes (or if its work queue is backed up), this is what you'd see — every new EnvoyPatchPolicy waits a full reconcile cycle before status appears.
  2. Status is only written on a trigger the HTTPProxy controller doesn't watch. If the HTTPProxy controller doesn't have a watcher on EnvoyPatchPolicy and only re-reconciles on its own timer (or on its own resource changes), it might not pick up the EnvoyPatchPolicy's status promptly even after the downstream controller writes it.
  3. Cold-start gateway provisioning. If EnvoyPatchPolicy is the first one in this gateway / data plane, there may be a one-time provisioning step that's slow. Subsequent tunnels would be fast.
  4. Some genuine wait-on-actual-Envoy-rollout latency. Envoy Gateway pushes configuration via xDS; the policy's status may legitimately wait for a downstream Envoy to ack the new config.

The 199.1s precise value suggests something more deterministic than "random reconcile backlog" — likely a resync interval or a programming-rollout SLA.

Concrete instance from today's session

  • HTTPProxy: tunnel-gchhg (namespace default, project drewr-y4nd1b)
  • Hostname: career-replay-kyxnz.datumproxy.net
  • Connector: datum-connect-mhxj5 (endpoint id c1469d2f5d1547edbdc2dcbd007d97f92be25e1efbbd9bb728a1d160797fd2b8)
  • Tunnel created at 2026-06-07T18:30:30Z-ish (local CLI clock)
  • Programmed / ConnectorMetadataProgrammed transitioned True at ~2026-06-07T18:33:50Z

This was on the staging environment. The relevant EnvoyPatchPolicy (or whatever the HTTPProxy controller derived) should still be on cluster if not GC'd.

How to diagnose this in the future

Things that would have made today's investigation faster:

  1. Make the controller's wait reason point at the resource. The message Downstream EnvoyPatchPolicy has no status yet doesn't include the name or namespace of the EnvoyPatchPolicy. If the message named it (EnvoyPatchPolicy default/tunnel-gchhg-...) the user / next operator could kubectl describe it directly without grepping the controller source.
  2. Watch the EnvoyPatchPolicy from the HTTPProxy controller if it isn't already, so a downstream status change triggers an immediate re-reconcile rather than waiting for the HTTPProxy controller's own interval.
  3. Surface the downstream controller's progress. If the EnvoyPatchPolicy itself has interim conditions (e.g. Translated, Programmed, Acked), the HTTPProxy could roll the most recent transition message up into its own status message. "Waiting for downstream" tells a user nothing; "Waiting for downstream to reach Acked (currently Translated, last update 12s ago)" tells them what's happening.
  4. Log every reconcile decision with the reason at INFO. A controller log line like tunnel-gchhg: deferring Programmed: downstream EnvoyPatchPolicy default/x has no status (resourceVersion=N, gen=M) would have given a clear timeline to correlate against client-side timing.

Why it caught us

CLI-side, datum-cloud/app@90a7ab3 added a progress streaming view that surfaces controller condition reasons, and that's how this latency became visible to a user for the first time. Without that streaming the user just sees Setting up tunnel... and then Tunnel ready after 199 sec — annoying but not actionable. With it, the controller's own "Downstream EnvoyPatchPolicy has no status yet" message is the actionable signal that points here.

Cross-references


Action for @drewr: verify the symptom against the controller behavior, dig into the actual EnvoyPatchPolicy reconcile path (is it slow, or is the HTTPProxy controller just slow to notice?), and retitle/scope this issue once root cause is known. The CLI-side diagnostic improvements may also be a separate small issue on the operator side (item 1 in the "How to diagnose" list — naming the resource in the message — is cheap and high-leverage).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions