diff --git a/docs/en/solutions/forklift_cli_download_Pod_OOMKilled_After_Migration_Toolkit_Upgrade_Operator_Reverts_Limit_Patches.md b/docs/en/solutions/forklift_cli_download_Pod_OOMKilled_After_Migration_Toolkit_Upgrade_Operator_Reverts_Limit_Patches.md new file mode 100644 index 00000000..74ae5e26 --- /dev/null +++ b/docs/en/solutions/forklift_cli_download_Pod_OOMKilled_After_Migration_Toolkit_Upgrade_Operator_Reverts_Limit_Patches.md @@ -0,0 +1,149 @@ +--- +kind: + - Troubleshooting +products: + - Alauda Container Platform +ProductsVersion: + - 4.1.0,4.2.x +--- +## Issue + +After upgrading the migration toolkit operator (Forklift / MTV style) from a 2.9.x release to a 2.10.x release, the `forklift-cli-download` pod — a small helper pod the toolkit runs to let operators download its CLI binary — enters `CrashLoopBackOff`. The pod is terminated repeatedly by the kernel OOM-killer: + +```yaml +containerStatuses: + - name: forklift-cli-download + ready: false + lastState: + terminated: + exitCode: 137 + reason: OOMKilled + finishedAt: "..." + resources: + limits: { cpu: 100m, memory: 128Mi } + requests: { cpu: 50m, memory: 64Mi } +``` + +The resource limit (`memory: 128Mi`) is lower than the working set the post-upgrade binary needs at start-up, so the kernel kills it during initialisation. Every attempt to hand-edit the `Deployment` to raise `limits.memory` works for seconds — and then the toolkit operator reconciles the Deployment back to its canonical spec, which has the same 128Mi limit. The reconcile loop keeps the limit tight regardless of user edits. + +A cluster alert accompanies the crash: `forklift-cli-download has not matched the expected number of replicas`. + +## Root Cause + +The toolkit's operator owns the `Deployment` for the CLI download pod and reconciles its `resources` block from a hard-coded template. The 2.9.x → 2.10.x upgrade bumped the binary shipped inside the pod (different build, different startup working set) but did not increase the Deployment's resource limits in parallel — the old 128Mi limit that was comfortably above the old binary's working set is now below the new binary's. + +Because the Deployment is operator-managed, every direct `kubectl patch` against the Deployment (or against the underlying template in whatever the operator exposes) is reverted on the next reconcile tick. Users see: + +1. `kubectl patch deployment forklift-cli-download ... memory=512Mi` → pod rolls with 512Mi, briefly runs. +2. Operator reconcile tick (typically within a minute) → Deployment's memory reverts to 128Mi. +3. Rollout starts, new pods lose the OOM budget, CrashLoop resumes. + +The cycle is deterministic; no direct edit to the Deployment or the pod is stable. + +The correct fix is at the operator's own configuration surface — either a `ForkliftController` CR (or whichever top-level CR the toolkit exposes) has a field for overriding the CLI pod's resources, or a newer operator release carries the corrected template. The bug is tracked; fix status depends on the toolkit's release notes. + +## Resolution + +### Preferred — upgrade the toolkit operator to a release where the limits are adequate + +The tracked fix adjusts the Deployment template so the 2.10.x binary's working set fits. Follow the operator-upgrade channel to the release that carries the fix. After the upgrade, the CLI pod reaches `Ready` within its first reconcile. + +Verify: + +```bash +kubectl -n get csv -o custom-columns='NAME:.metadata.name,VERSION:.spec.version' +kubectl -n get pod -l app=forklift-cli-download +# forklift-cli-download- 1/1 Running 0 3m +``` + +No more OOMKilled events on the pod over a full reconcile cycle confirms the fix took. + +### Workaround — override resources via the ForkliftController CR + +Recent operator builds expose a spec field on the `ForkliftController` (or equivalent top-level CR) that lets operators set resource overrides for the managed Deployments. The exact field path varies by release; common locations include: + +```yaml +apiVersion: forklift.konveyor.io/v1beta1 +kind: ForkliftController +metadata: + name: forklift-controller + namespace: +spec: + controller_cli_resources: # name varies — check CRD schema + limits: + memory: 512Mi + cpu: 100m + requests: + memory: 256Mi + cpu: 50m +``` + +Verify by inspecting the CRD's schema to find the right path: + +```bash +kubectl get crd forkliftcontrollers.forklift.konveyor.io -o json | \ + jq '.spec.versions[0].schema.openAPIV3Schema.properties.spec.properties | keys' +``` + +If a field that looks like a resources override exists for the CLI download component, set it; the operator renders the Deployment with the new values and stops reverting. + +### Workaround — suppress the alert, accept the unavailable component + +The CLI-download pod serves a single purpose: letting users download the toolkit's CLI binary. If the cluster has alternative distribution channels for the CLI (internal artifact repo, bundled in a different container image), the CrashLoop is functionally tolerable until the upgrade arrives. Silence the accompanying alert: + +```yaml +# Alertmanager silence +matchers: + - name: alertname + value: KubeDeploymentReplicasMismatch + - name: deployment + value: forklift-cli-download +startsAt: ... +endsAt: ... +comment: "Known OOM bug in ; tracked upstream. Silence until upgrade." +``` + +Short-lived; re-evaluate when the operator upgrade lands. + +### Do not + +- **Do not patch the Deployment directly.** The operator reverts the patch. Every edit is a hall of mirrors — it looks applied, then reverts, which masks whether the change helped. +- **Do not scale the Deployment to zero replicas.** That silences the alert but stops the CLI from being served to users, which may break other tooling that expects the endpoint. + +## Diagnostic Steps + +Confirm the OOM signature is on the specific pod: + +```bash +NS= +POD=$(kubectl -n "$NS" get pod -l app=forklift-cli-download \ + -o jsonpath='{.items[0].metadata.name}') +kubectl -n "$NS" describe pod "$POD" | grep -A5 -E 'Last State|Reason|Exit Code' +``` + +`OOMKilled` with `exitCode: 137` and `Last State: Terminated` pattern is the symptom. + +Verify the operator is reverting limit patches. Edit the Deployment (reversibly, for diagnostic only): + +```bash +kubectl -n "$NS" patch deployment forklift-cli-download \ + --type=merge -p '{"spec":{"template":{"spec":{"containers":[{"name":"forklift-cli-download","resources":{"limits":{"memory":"512Mi"}}}]}}}}' + +# Wait 60 seconds then check the rendered spec. +sleep 60 +kubectl -n "$NS" get deployment forklift-cli-download -o jsonpath='{.spec.template.spec.containers[0].resources.limits.memory}{"\n"}' +# If it still says 128Mi, the operator reverted. +``` + +A reverted patch confirms the operator-managed nature of the Deployment; the fix has to be through the CR (workaround) or the upgrade (preferred), not hand-patching. + +Read the toolkit operator's log for reconcile activity on the Deployment: + +```bash +kubectl -n "$NS" logs -l name=forklift-operator --tail=200 | \ + grep -iE 'cli-download|Deployment.*forklift' | tail -20 +``` + +Lines about reconciling the Deployment back to its desired state confirm the operator is authoritative. + +After the fix (upgrade or CR override), monitor the pod's stability across a meaningful window — the first minute after the fix rolls should show `Running` with `restartCount` stable, and no further OOM events through a business day.