diff --git a/docs/en/solutions/Change_a_Nodes_DNS_Servers_and_etcresolvconf_Options_via_NMState_NodeNetworkConfigurationPolicy.md b/docs/en/solutions/Change_a_Nodes_DNS_Servers_and_etcresolvconf_Options_via_NMState_NodeNetworkConfigurationPolicy.md new file mode 100644 index 00000000..07282eea --- /dev/null +++ b/docs/en/solutions/Change_a_Nodes_DNS_Servers_and_etcresolvconf_Options_via_NMState_NodeNetworkConfigurationPolicy.md @@ -0,0 +1,162 @@ +--- +kind: + - How To +products: + - Alauda Container Platform +ProductsVersion: + - 4.1.0,4.2.x +--- +## Issue + +The DNS servers and `/etc/resolv.conf` options on a node need to change — a new internal resolver has been rolled out, the node is being moved onto a different search domain, or options like `rotate` / `attempts` / `timeout` need to be tuned. Editing `/etc/resolv.conf` by hand is not durable: NetworkManager regenerates the file on every reboot (and on every network event), so hand-edits vanish. + +The clean path is to declare the desired DNS shape as a `NodeNetworkConfigurationPolicy` (NNCP) handled by the kubernetes-nmstate operator. NMState reconciles the node's NetworkManager configuration to match the declared state; the node picks up a new `/etc/resolv.conf` and keeps it across reboots. + +## Resolution + +### Prerequisites + +The cluster must have the kubernetes-nmstate operator installed and a `NMState` CR created so its DaemonSet is running on every target node. Verify with: + +```bash +kubectl get nmstate -o custom-columns='NAME:.metadata.name,AGE:.metadata.creationTimestamp' +kubectl -n nmstate get pod -l component=kubernetes-nmstate-handler -o wide +``` + +If either command returns empty, the operator is not yet active; install it through the cluster's operator-management surface before proceeding. + +### Capture the current node DNS state + +Before making any change, record what the node currently serves. On the target node: + +```bash +NODE= +kubectl debug node/$NODE --image=busybox -- \ + chroot /host sh -c 'cat /etc/resolv.conf; echo ---; nmcli dev show | grep -E "DNS|DOMAIN"' +``` + +Typical "before" shape: + +```text +# Generated by NetworkManager +search example.internal +nameserver 192.168.1.249 +``` + +This baseline is what the NNCP will replace on the next reconcile. + +### Declare the desired state via NNCP + +Target a single node first with a nodeSelector so the change is bounded. After the policy reconciles cleanly on that node, widen the selector to the rest of the fleet. + +```yaml +apiVersion: nmstate.io/v1 +kind: NodeNetworkConfigurationPolicy +metadata: + name: dns-worker-2 +spec: + nodeSelector: + kubernetes.io/hostname: worker-2 + desiredState: + dns-resolver: + config: + server: + - 192.168.1.249 + - 192.168.1.1 + search: + - example.internal + - corp.example.com + options: + - "rotate" + - "attempts:3" + - "timeout:2" +``` + +Apply and watch the reconcile: + +```bash +kubectl apply -f dns-worker-2.yaml + +kubectl get nodenetworkconfigurationpolicy dns-worker-2 \ + -o jsonpath='{range .status.conditions[*]}{.type}={.status}{" "}{end}{"\n"}' +# Available=True Degraded=False +``` + +NMState publishes per-node progress on a companion `NodeNetworkConfigurationEnactment` (NNCE); check the one that matches the target node: + +```bash +kubectl get nodenetworkconfigurationenactment \ + -l nmstate.io/policy=dns-worker-2 \ + -o custom-columns='NODE:.status.nodeName,STATUS:.status.conditions[?(@.type=="Available")].status,MSG:.status.conditions[?(@.type=="Available")].message' +``` + +`STATUS=True` means NetworkManager has been reconfigured and the change is live. + +### Verify the effect on the node + +Re-check the node's `/etc/resolv.conf`: + +```bash +kubectl debug node/$NODE --image=busybox -- \ + chroot /host cat /etc/resolv.conf +``` + +Expected "after" shape: + +```text +# Generated by NetworkManager +search example.internal corp.example.com +nameserver 192.168.1.249 +nameserver 192.168.1.1 +options rotate attempts:3 timeout:2 +``` + +Test resolution from inside a workload pod to confirm the cluster-side DNS path is unaffected (pods continue to resolve through CoreDNS; the NNCP only changes the node's own resolver): + +```bash +kubectl run dns-probe -it --rm --restart=Never \ + --image=busybox -- \ + sh -c 'nslookup kubernetes.default.svc; nslookup external-host.example.com' +``` + +The first lookup must succeed (pods use CoreDNS, not the node resolver); the second must succeed through the new node-level resolver if the workload egresses through the host network. + +### Widen the rollout once validated + +After validation on the pilot node, remove the single-host `nodeSelector` (or set a label selector covering the broader fleet). Stagger by node label or by the operator's own batching to avoid reconfiguring every node simultaneously — one node at a time is safest. + +```yaml +spec: + nodeSelector: + node-role.kubernetes.io/worker: "" +``` + +Monitor the `NodeNetworkConfigurationEnactment` list for any node that reports `Available=False`; those will need their NM state inspected before proceeding. + +### Roll back + +Delete the NNCP and NMState reverts the nodes to the NetworkManager configuration they held before the policy applied. The operator stores the pre-policy state internally; no manual restore step is needed. + +```bash +kubectl delete nodenetworkconfigurationpolicy dns-worker-2 +``` + +Confirm each node's `/etc/resolv.conf` returns to the baseline recorded at the start of this procedure. + +## Diagnostic Steps + +If the NNCP goes to `Degraded=True`, read the specific enactment for the failing node to see which NetworkManager operation was rejected: + +```bash +kubectl get nodenetworkconfigurationenactment \ + -l nmstate.io/policy=,nmstate.io/node= \ + -o yaml +``` + +Common failures: + +- `dns-resolver.config.server` contains an IP that does not belong to any reachable subnet from the node. NetworkManager applies the resolver list regardless, but the node cannot resolve anything and subsequent probes fail. +- `search` domains exceed the kernel's `MAXDNSRCH` limit (typically 6 entries). NMState trims the list; document the precedence if that matters for the application. +- `options` list contains a token that the glibc resolver does not recognise. NetworkManager writes the option verbatim; invalid options are silently ignored by glibc. Verify each option against `man 5 resolv.conf`. + +If `/etc/resolv.conf` on the node still shows the old content after `Available=True`, NetworkManager may have been overridden by another service (cloud-init, a cluster addon writing directly to `/etc/resolv.conf`, or a static file). Check the node's `nmcli dev show | grep DNS` versus the file — they should agree. If they disagree, the file is being written after NetworkManager; remove that other writer or let NNCP manage a compatible file instead of `/etc/resolv.conf`.