From 4856bc5761cf8b29e774b7273f5ed3c415e9e59f Mon Sep 17 00:00:00 2001 From: Komh Date: Fri, 24 Apr 2026 11:17:53 +0000 Subject: [PATCH] =?UTF-8?q?[configure]=20Node=20Stuck=20"booting=20up"=20W?= =?UTF-8?q?ith=20SSH=20Refused=20and=20Kubelet=20NotReady=20=E2=80=94=20Mi?= =?UTF-8?q?ssing=20`/var/lib/kubelet/config.json`?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ...otReady_Missing_varlibkubeletconfigjson.md | 149 ++++++++++++++++++ 1 file changed, 149 insertions(+) create mode 100644 docs/en/solutions/Node_Stuck_booting_up_With_SSH_Refused_and_Kubelet_NotReady_Missing_varlibkubeletconfigjson.md diff --git a/docs/en/solutions/Node_Stuck_booting_up_With_SSH_Refused_and_Kubelet_NotReady_Missing_varlibkubeletconfigjson.md b/docs/en/solutions/Node_Stuck_booting_up_With_SSH_Refused_and_Kubelet_NotReady_Missing_varlibkubeletconfigjson.md new file mode 100644 index 00000000..d9596dcb --- /dev/null +++ b/docs/en/solutions/Node_Stuck_booting_up_With_SSH_Refused_and_Kubelet_NotReady_Missing_varlibkubeletconfigjson.md @@ -0,0 +1,149 @@ +--- +kind: + - Troubleshooting +products: + - Alauda Container Platform +ProductsVersion: + - 4.1.0,4.2.x +--- +## Issue + +A cluster node stops serving workloads and the control plane reports it as `NotReady`. Attempting to SSH to the node to investigate returns a boot-time refusal message instead of a shell prompt: + +```text +System is booting up. Unprivileged users are not permitted to log in yet. +Please come back later. +For technical details, see pam_nologin(8). +``` + +The node responds to `ping`, so networking to it is basically fine. But: + +- SSH refuses non-root logins. +- The API server reports the node as `NotReady` (kubelet is not sending successful status updates). +- Standard `kubectl debug node/` cycles and may also stall. +- Running services on the node appear to be holding, waiting for something. + +The message from PAM is not a red herring: `pam_nologin(8)` rejects unprivileged logins while `/run/nologin` (or `/etc/nologin`) exists. Systemd creates `/run/nologin` while essential boot-time services have not yet reported `ready`. A service that **hangs during start-up** keeps `/run/nologin` present indefinitely, which is why ordinary SSH stays refused even long after the node's initial boot time. + +## Root Cause + +On cluster nodes, the kubelet's in-cluster identity and image-pull credentials are supplied through a kubelet-side pull-secret at `/var/lib/kubelet/config.json`. Multiple services that run early in the boot sequence depend on that file: + +- `nodeip-configuration.service` — determines the node's node-IP by talking to the platform (which requires authenticated access to the cluster's image registry / API). +- `kubelet.service` — reads the pull-secret to authenticate image pulls for static pods and for any in-cluster image the kubelet fetches directly. + +When `/var/lib/kubelet/config.json` is missing, unreadable, or corrupted: + +- `nodeip-configuration.service` blocks waiting for a response that can never come. +- `kubelet.service` cannot pull static-pod images and stays not-ready. +- systemd keeps `/run/nologin` present because critical boot targets have not completed. +- PAM honours the `nologin` file and refuses non-root interactive logins. + +The node appears completely stuck from the operator's perspective. It is actually not stuck — it is faithfully waiting for the pull-secret to appear, which it never will without intervention. + +Causes of the file going missing: + +- An overly enthusiastic maintenance script deleted files under `/var/lib/kubelet/` (a cleanup that thought it was removing cached container images). +- Disk corruption (filesystem check at next boot truncated or orphaned the file). +- An operator action touching the file without understanding its boot-path criticality. +- Node re-provisioned from an image missing the file; the installer / ignition that should have written it did not. + +## Resolution + +### Restore the pull-secret file on the affected node + +The file's content is identical across nodes that joined the same cluster with the same image-pull credentials. Copy it from a healthy node. + +**If some nodes are still healthy and accessible:** + +```bash +# From a healthy node, read the file content. +HEALTHY_NODE= +kubectl debug node/$HEALTHY_NODE --image=busybox -- \ + chroot /host cat /var/lib/kubelet/config.json > /tmp/kubelet-config.json + +# Verify it's a parseable JSON docker-config. +jq '.auths | keys' /tmp/kubelet-config.json +``` + +**If the affected node can be reached as root via a management channel (ILO, iDRAC, vSphere console):** + +Log in through the out-of-band console, escalate to root (PAM only blocks unprivileged logins; root is unaffected), and write the file: + +```bash +# On the affected node, as root: +install -m 0600 -o root -g root /path/to/healthy-config.json /var/lib/kubelet/config.json +``` + +**If no healthy node exists, reconstruct from the cluster-side source:** + +The file derives from the cluster's own image-pull-secret. Extract it from the cluster and upload to the node: + +```bash +# Capture the pull-secret from the cluster (from any machine with kubectl access). +kubectl -n get secret \ + -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d > /tmp/kubelet-config.json +``` + +Transfer `/tmp/kubelet-config.json` to the node through whatever channel remains (out-of-band console, boot into maintenance mode, mount the disk from a neighbouring node). Place it at `/var/lib/kubelet/config.json` with the same permissions as above. + +### After the file is restored + +The dependent services unblock themselves on their next retry tick, but a restart is faster. From the node's root shell: + +```bash +systemctl restart nodeip-configuration.service +systemctl restart kubelet +``` + +`systemctl status nodeip-configuration.service` should reach `active (exited)` within a few seconds. `systemctl status kubelet` should reach `active (running)`. + +Once those two services reach `active`, systemd completes the rest of the boot targets. `/run/nologin` is removed, PAM accepts unprivileged logins, and the kubelet resumes sending status updates. The API server sees the node transition to `Ready` within one kubelet-report interval. + +### Verify recovery + +```bash +# SSH as a normal user should now succeed. +ssh @ + +# kubelet's own status. +kubectl get node # should show Ready + +# Pull-secret is present with sane permissions. +ls -la /var/lib/kubelet/config.json +# -rw------- 1 root root ... /var/lib/kubelet/config.json +``` + +### Prevent recurrence + +The file is node-critical and should not be deleted by any automation. Two preventive postures: + +- **Back it up as part of the node image / initial-config payload.** If nodes are re-provisioned from a golden image or a provisioning template, ensure the template includes the pull-secret so a rebuild from the image always has it. +- **Monitor for the file's absence.** A small node-exporter textfile collector (or any node-level monitoring agent) can run `test -f /var/lib/kubelet/config.json` periodically and alert on absence. The alert fires before a reboot exposes the problem as a full outage. + +## Diagnostic Steps + +If the node is accessible as root (through an out-of-band console), verify the specific failure shape: + +```bash +# /run/nologin present — boot-target not yet complete. +ls -la /run/nologin + +# The kubelet's pull-secret. +ls -la /var/lib/kubelet/config.json 2>/dev/null || echo "MISSING: pull-secret" + +# Which services are stuck. +systemctl list-units --state=activating --no-pager +``` + +If `/var/lib/kubelet/config.json` is missing, the diagnosis is confirmed. Any service in `activating` for more than a couple of minutes is likely the one blocking the boot target. + +Cross-reference with journal entries for the services named in Root Cause: + +```bash +journalctl -u nodeip-configuration.service -u kubelet --since "30 min ago" --no-pager | tail -40 +``` + +`nodeip-configuration.service` retries forever until its registry/API calls succeed; you will see repeated `waiting for …` log lines. `kubelet` reports it cannot pull images because the authoritative config is missing. + +After restoring the file and restarting the services, the journal should show `nodeip-configuration.service` reporting the discovered IP and `kubelet` reporting a successful `Register node …` event.