alauda · jing2uo · Apr 24, 2026
diff --git a/...Servers_and_etcresolvconf_Options_via_NMState_NodeNetworkConfigurationPolicy.md b/...Servers_and_etcresolvconf_Options_via_NMState_NodeNetworkConfigurationPolicy.md
@@ -0,0 +1,162 @@
+---
+kind:
+   - How To
+products:
+   - Alauda Container Platform
+ProductsVersion:
+   - 4.1.0,4.2.x
+---
+## Issue
+
+The DNS servers and `/etc/resolv.conf` options on a node need to change — a new internal resolver has been rolled out, the node is being moved onto a different search domain, or options like `rotate` / `attempts` / `timeout` need to be tuned. Editing `/etc/resolv.conf` by hand is not durable: NetworkManager regenerates the file on every reboot (and on every network event), so hand-edits vanish.
+
+The clean path is to declare the desired DNS shape as a `NodeNetworkConfigurationPolicy` (NNCP) handled by the kubernetes-nmstate operator. NMState reconciles the node's NetworkManager configuration to match the declared state; the node picks up a new `/etc/resolv.conf` and keeps it across reboots.
+
+## Resolution
+
+### Prerequisites
+
+The cluster must have the kubernetes-nmstate operator installed and a `NMState` CR created so its DaemonSet is running on every target node. Verify with:
+
+```bash
+kubectl get nmstate -o custom-columns='NAME:.metadata.name,AGE:.metadata.creationTimestamp'
+kubectl -n nmstate get pod -l component=kubernetes-nmstate-handler -o wide
+```
+
+If either command returns empty, the operator is not yet active; install it through the cluster's operator-management surface before proceeding.
+
+### Capture the current node DNS state
+
+Before making any change, record what the node currently serves. On the target node:
+
+```bash
+NODE=<node-name>
+kubectl debug node/$NODE --image=busybox -- \
+  chroot /host sh -c 'cat /etc/resolv.conf; echo ---; nmcli dev show | grep -E "DNS|DOMAIN"'
+```
+
+Typical "before" shape:
+
+```text
+# Generated by NetworkManager
+search example.internal
+nameserver 192.168.1.249
+```
+
+This baseline is what the NNCP will replace on the next reconcile.
+
+### Declare the desired state via NNCP
+
+Target a single node first with a nodeSelector so the change is bounded. After the policy reconciles cleanly on that node, widen the selector to the rest of the fleet.
+
+```yaml
+apiVersion: nmstate.io/v1
+kind: NodeNetworkConfigurationPolicy
+metadata:
+  name: dns-worker-2
+spec:
+  nodeSelector:
+    kubernetes.io/hostname: worker-2
+  desiredState:
+    dns-resolver:
+      config:
+        server:
+          - 192.168.1.249
+          - 192.168.1.1
+        search:
+          - example.internal
+          - corp.example.com
+        options:
+          - "rotate"
+          - "attempts:3"
+          - "timeout:2"
+```
+
+Apply and watch the reconcile:
+
+```bash
+kubectl apply -f dns-worker-2.yaml
+
+kubectl get nodenetworkconfigurationpolicy dns-worker-2 \
+  -o jsonpath='{range .status.conditions[*]}{.type}={.status}{" "}{end}{"\n"}'
+# Available=True Degraded=False
+```
+
+NMState publishes per-node progress on a companion `NodeNetworkConfigurationEnactment` (NNCE); check the one that matches the target node:
+
+```bash
+kubectl get nodenetworkconfigurationenactment \
+  -l nmstate.io/policy=dns-worker-2 \
+  -o custom-columns='NODE:.status.nodeName,STATUS:.status.conditions[?(@.type=="Available")].status,MSG:.status.conditions[?(@.type=="Available")].message'
+```
+
+`STATUS=True` means NetworkManager has been reconfigured and the change is live.
+
+### Verify the effect on the node
+
+Re-check the node's `/etc/resolv.conf`:
+
+```bash
+kubectl debug node/$NODE --image=busybox -- \
+  chroot /host cat /etc/resolv.conf
+```
+
+Expected "after" shape:
+
+```text
+# Generated by NetworkManager
+search example.internal corp.example.com
+nameserver 192.168.1.249
+nameserver 192.168.1.1
+options rotate attempts:3 timeout:2
+```
+
+Test resolution from inside a workload pod to confirm the cluster-side DNS path is unaffected (pods continue to resolve through CoreDNS; the NNCP only changes the node's own resolver):
+
+```bash
+kubectl run dns-probe -it --rm --restart=Never \
+  --image=busybox -- \
+  sh -c 'nslookup kubernetes.default.svc; nslookup external-host.example.com'
+```
+
+The first lookup must succeed (pods use CoreDNS, not the node resolver); the second must succeed through the new node-level resolver if the workload egresses through the host network.
+
+### Widen the rollout once validated
+
+After validation on the pilot node, remove the single-host `nodeSelector` (or set a label selector covering the broader fleet). Stagger by node label or by the operator's own batching to avoid reconfiguring every node simultaneously — one node at a time is safest.
+
+```yaml
+spec:
+  nodeSelector:
+    node-role.kubernetes.io/worker: ""
+```
+
+Monitor the `NodeNetworkConfigurationEnactment` list for any node that reports `Available=False`; those will need their NM state inspected before proceeding.
+
+### Roll back
+
+Delete the NNCP and NMState reverts the nodes to the NetworkManager configuration they held before the policy applied. The operator stores the pre-policy state internally; no manual restore step is needed.
+
+```bash
+kubectl delete nodenetworkconfigurationpolicy dns-worker-2
+```
+
+Confirm each node's `/etc/resolv.conf` returns to the baseline recorded at the start of this procedure.
+
+## Diagnostic Steps
+
+If the NNCP goes to `Degraded=True`, read the specific enactment for the failing node to see which NetworkManager operation was rejected:
+
+```bash
+kubectl get nodenetworkconfigurationenactment \
+  -l nmstate.io/policy=<policy-name>,nmstate.io/node=<node-name> \
+  -o yaml
+```
+
+Common failures:
+
+- `dns-resolver.config.server` contains an IP that does not belong to any reachable subnet from the node. NetworkManager applies the resolver list regardless, but the node cannot resolve anything and subsequent probes fail.
+- `search` domains exceed the kernel's `MAXDNSRCH` limit (typically 6 entries). NMState trims the list; document the precedence if that matters for the application.
+- `options` list contains a token that the glibc resolver does not recognise. NetworkManager writes the option verbatim; invalid options are silently ignored by glibc. Verify each option against `man 5 resolv.conf`.
+
+If `/etc/resolv.conf` on the node still shows the old content after `Available=True`, NetworkManager may have been overridden by another service (cloud-init, a cluster addon writing directly to `/etc/resolv.conf`, or a static file). Check the node's `nmcli dev show | grep DNS` versus the file — they should agree. If they disagree, the file is being written after NetworkManager; remove that other writer or let NNCP manage a compatible file instead of `/etc/resolv.conf`.