Recover ama-logs workspace key from extension protected secret + checksum restart#1717
Open
jhmadhav wants to merge 3 commits into
Open
Recover ama-logs workspace key from extension protected secret + checksum restart#1717jhmadhav wants to merge 3 commits into
jhmadhav wants to merge 3 commits into
Conversation
…ksum restart When the extension manager stops re-delivering protectedParameters (e.g. during an extension auto-update), the rendered .Values workspace key becomes empty and the chart would overwrite the live ama-logs-secret KEY with an empty value, breaking the agent ~10-14 days later when cached credentials expire. Changes: - ama-logs-secret.yaml: when the incoming workspace key is empty/placeholder, fall back to the extension manager's on-cluster protected-parameters secret 'protected-ext-parameters-<release>' (data key 'OmsAgent.workspaceKey' for AKS or 'amalogs.secret.key' for Arc). This is the secret the config agent persists from protectedParameters; the agent does not clear it when it drops the CR reference, so it remains the source of truth. Only the KEY is recovered - WSID is non-protected and still delivered. lookup is a no-op on first install (incoming key populated). - ama-logs-daemonset.yaml / ama-logs-daemonset-windows.yaml / ama-logs-deployment.yaml: add checksum/secret annotation to the AKS (non-Arc) Linux and Windows pod templates so pods roll when the effective secret changes. The previous WSID-only annotation did not detect workspace KEY changes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The protected-parameters key fallback only applies to workspace-key (non-AAD) clusters. On AAD/managed-identity clusters the workspace key is empty by design (auth uses a token, not a key), so recovering a key there is meaningless. Evaluate isUsingAADAuth (AKS) / useAADAuth (Arc) the same way the daemonset/deployment templates do, and skip the lookup + fallback entirely when AAD auth is in use. Validated on a live AKS cluster: with a key injected into protected-ext-parameters-*, isUsingAADAuth=true does NOT recover it (gated), isUsingAADAuth=false does. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Gate the protected-parameters lookup on BOTH non-AAD auth AND the incoming workspace key being empty/placeholder, so the recovery runs only when the mandatory key is missing (the cluster is already broken / about to be). Healthy clusters (key supplied) and AAD clusters skip the live secret read entirely — no unnecessary get-secret API call on the common path. Uses an explicit if/else (not ternary) to pick the active path's incoming key: OmsAgent.workspaceKey for AKS, amalogs.secret.key for Arc. Validated on a live AKS cluster: non-AAD+empty recovers; non-AAD+real key uses the supplied key (lookup skipped); AAD+empty stays empty (gated). helm lint clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
|
This PR is stale because it has been open 7 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Recover the Container Insights workspace key from the extension manager's
protected-parameters secret when it is not re-delivered to the chart, and roll
the agent pods (Linux and Windows) when the effective secret changes.
This prevents a class of failures where
ama-logs-secretgets overwritten withan empty workspace
KEY, which silently breaks log ingestion ~10-14 days laterwhen the previously-cached credentials expire.
Background / root cause
For workspace-key (non-AAD) clusters, the workspace key is delivered to the
chart as a protected parameter. The config agent persists it on-cluster in
protected-ext-parameters-<release>and passes it to Helm asOmsAgent.workspaceKey(oramalogs.secret.keyfor Arc).During an extension auto-update the extension manager can stop re-delivering the
protected parameters. When that happens,
.Values.OmsAgent.workspaceKeyrendersempty, and the chart overwrites the live
ama-logs-secretKEYwith anempty value. The agent keeps working until its cached credentials expire, then
fails — making the root cause hard to correlate with the triggering update.
Importantly, the agent does not clear
protected-ext-parameters-<release>when it drops the CR reference, so that secret remains a valid source of truth
for the key.
What this change does
1. Conditional workspace-key fallback (
ama-logs-secret.yaml)When the incoming workspace key is empty/placeholder, fall back to the existing
on-cluster
protected-ext-parameters-<release>secret (data keyOmsAgent.workspaceKeyfor AKS,amalogs.secret.keyfor Arc) via Helmlookup.Semantics (intentionally conditional, not unconditional):
Only the KEY is recovered. The workspace ID (WSID) is a non-protected
parameter that continues to be delivered, so it is taken from values as-is.
2.
checksum/secreton the AKS pod templates (Linux + Windows)The Arc path already had
checksum/secret; the AKS (non-Arc) path only had aWSIDannotation, which does not change when the workspace KEY changes.Added
checksum/secretto all three AKS workloads —ama-logs-daemonset.yaml,ama-logs-daemonset-windows.yaml, andama-logs-deployment.yaml— so pods rollwhen the effective secret actually changes.
Important nuance: AAD/MSI clusters
For
OmsAgent.isUsingAADAuth: "true"clusters the workspace key is empty bydesign (auth uses a managed-identity token, not a workspace key). The fix does
not special-case this and does not need to:
protected-ext-parameters-<release>secret is itselfempty, so the fallback resolves to empty — exactly the value we want sent to
the chart. No guard on
isUsingAADAuthis required.So the KEY-fallback is a no-op for AAD clusters and a fix for legacy
workspace-key clusters that dropped protected settings. The
checksum/secretchange benefits both.
Behavior matrix
Testing
Validated against a live AKS cluster running the real azure-monitor-logs
extension (chart 3.3.0,
isUsingAADAuth: true) usinghelm template --dry-run=server(executeslookupserver-side):protected-ext-parameters-*→ recovered ✅KEY: ""(reproduces the bug) ✅
helm lintpasses; Arc empty-key path still fails loudly viarequired.Notes
lookupreturns empty duringhelm template/client-side dry-run/firstinstall; on a real
helm upgrade(extension manager flow) it returns the livesecret. On first install the incoming key is populated, so the fallback is a
no-op there.
lookuprequiresgeton Secrets in the release namespace; the extension'sHelm operator already has this.