Skip to content

Recover ama-logs workspace key from extension protected secret + checksum restart#1717

Open
jhmadhav wants to merge 3 commits into
microsoft:ci_prodfrom
jhmadhav:hejakkam/fallback-to-existing-amalogs-secret
Open

Recover ama-logs workspace key from extension protected secret + checksum restart#1717
jhmadhav wants to merge 3 commits into
microsoft:ci_prodfrom
jhmadhav:hejakkam/fallback-to-existing-amalogs-secret

Conversation

@jhmadhav

Copy link
Copy Markdown

Summary

Recover the Container Insights workspace key from the extension manager's
protected-parameters secret when it is not re-delivered to the chart, and roll
the agent pods (Linux and Windows) when the effective secret changes.

This prevents a class of failures where ama-logs-secret gets overwritten with
an empty workspace KEY, which silently breaks log ingestion ~10-14 days later
when the previously-cached credentials expire.

Background / root cause

For workspace-key (non-AAD) clusters, the workspace key is delivered to the
chart as a protected parameter. The config agent persists it on-cluster in
protected-ext-parameters-<release> and passes it to Helm as
OmsAgent.workspaceKey (or amalogs.secret.key for Arc).

During an extension auto-update the extension manager can stop re-delivering the
protected parameters. When that happens, .Values.OmsAgent.workspaceKey renders
empty, and the chart overwrites the live ama-logs-secret KEY with an
empty value. The agent keeps working until its cached credentials expire, then
fails — making the root cause hard to correlate with the triggering update.

Importantly, the agent does not clear protected-ext-parameters-<release>
when it drops the CR reference, so that secret remains a valid source of truth
for the key.

What this change does

1. Conditional workspace-key fallback (ama-logs-secret.yaml)

When the incoming workspace key is empty/placeholder, fall back to the existing
on-cluster protected-ext-parameters-<release> secret (data key
OmsAgent.workspaceKey for AKS, amalogs.secret.key for Arc) via Helm lookup.

Semantics (intentionally conditional, not unconditional):

  • Incoming real key always wins → key rotation is preserved.
  • Incoming empty key → use whatever is in the existing secret.

Only the KEY is recovered. The workspace ID (WSID) is a non-protected
parameter that continues to be delivered, so it is taken from values as-is.

2. checksum/secret on the AKS pod templates (Linux + Windows)

The Arc path already had checksum/secret; the AKS (non-Arc) path only had a
WSID annotation, which does not change when the workspace KEY changes.
Added checksum/secret to all three AKS workloads — ama-logs-daemonset.yaml,
ama-logs-daemonset-windows.yaml, and ama-logs-deployment.yaml — so pods roll
when the effective secret actually changes.

Important nuance: AAD/MSI clusters

For OmsAgent.isUsingAADAuth: "true" clusters the workspace key is empty by
design
(auth uses a managed-identity token, not a workspace key). The fix does
not special-case this and does not need to:

  • On an AAD cluster the protected-ext-parameters-<release> secret is itself
    empty, so the fallback resolves to empty — exactly the value we want sent to
    the chart. No guard on isUsingAADAuth is required.

So the KEY-fallback is a no-op for AAD clusters and a fix for legacy
workspace-key clusters that dropped protected settings
. The checksum/secret
change benefits both.

Behavior matrix

Cluster Incoming key Existing protected secret Rendered KEY Correct
AAD auth (healthy) empty (intended) empty empty
Key auth (healthy) real key real key real key (incoming wins)
Key auth (dropped protected params) empty retained key recovered key
Key auth (rotation) new key old key new key (incoming wins)

Testing

Validated against a live AKS cluster running the real azure-monitor-logs
extension
(chart 3.3.0, isUsingAADAuth: true) using
helm template --dry-run=server (executes lookup server-side):

  • Empty incoming key + key present in protected-ext-parameters-*recovered
  • Real incoming key → takes precedence (rotation safe) ✅
  • Current published rendering (no server lookup) + empty key → KEY: ""
    (reproduces the bug) ✅
  • helm lint passes; Arc empty-key path still fails loudly via required.

Notes

  • lookup returns empty during helm template/client-side dry-run/first
    install; on a real helm upgrade (extension manager flow) it returns the live
    secret. On first install the incoming key is populated, so the fallback is a
    no-op there.
  • lookup requires get on Secrets in the release namespace; the extension's
    Helm operator already has this.

…ksum restart

When the extension manager stops re-delivering protectedParameters (e.g. during an
extension auto-update), the rendered .Values workspace key becomes empty and the chart
would overwrite the live ama-logs-secret KEY with an empty value, breaking the agent
~10-14 days later when cached credentials expire.

Changes:
- ama-logs-secret.yaml: when the incoming workspace key is empty/placeholder, fall back
  to the extension manager's on-cluster protected-parameters secret
  'protected-ext-parameters-<release>' (data key 'OmsAgent.workspaceKey' for AKS or
  'amalogs.secret.key' for Arc). This is the secret the config agent persists from
  protectedParameters; the agent does not clear it when it drops the CR reference, so it
  remains the source of truth. Only the KEY is recovered - WSID is non-protected and still
  delivered. lookup is a no-op on first install (incoming key populated).
- ama-logs-daemonset.yaml / ama-logs-daemonset-windows.yaml / ama-logs-deployment.yaml:
  add checksum/secret annotation to the AKS (non-Arc) Linux and Windows pod templates so
  pods roll when the effective secret changes. The previous WSID-only annotation did not
  detect workspace KEY changes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jhmadhav jhmadhav requested a review from a team as a code owner June 17, 2026 22:29
Madhav Jakkampudi and others added 2 commits June 18, 2026 21:14
The protected-parameters key fallback only applies to workspace-key (non-AAD) clusters.
On AAD/managed-identity clusters the workspace key is empty by design (auth uses a token,
not a key), so recovering a key there is meaningless. Evaluate isUsingAADAuth (AKS) /
useAADAuth (Arc) the same way the daemonset/deployment templates do, and skip the lookup +
fallback entirely when AAD auth is in use.

Validated on a live AKS cluster: with a key injected into protected-ext-parameters-*,
isUsingAADAuth=true does NOT recover it (gated), isUsingAADAuth=false does.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Gate the protected-parameters lookup on BOTH non-AAD auth AND the incoming workspace key
being empty/placeholder, so the recovery runs only when the mandatory key is missing
(the cluster is already broken / about to be). Healthy clusters (key supplied) and AAD
clusters skip the live secret read entirely — no unnecessary get-secret API call on the
common path.

Uses an explicit if/else (not ternary) to pick the active path's incoming key:
OmsAgent.workspaceKey for AKS, amalogs.secret.key for Arc.

Validated on a live AKS cluster: non-AAD+empty recovers; non-AAD+real key uses the
supplied key (lookup skipped); AAD+empty stays empty (gated). helm lint clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown
Contributor

This PR is stale because it has been open 7 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant