Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
---
kind:
- How To
products:
- Alauda Container Platform
ProductsVersion:
- 4.1.0,4.2.x
---
## Issue

The DNS servers and `/etc/resolv.conf` options on a node need to change — a new internal resolver has been rolled out, the node is being moved onto a different search domain, or options like `rotate` / `attempts` / `timeout` need to be tuned. Editing `/etc/resolv.conf` by hand is not durable: NetworkManager regenerates the file on every reboot (and on every network event), so hand-edits vanish.

The clean path is to declare the desired DNS shape as a `NodeNetworkConfigurationPolicy` (NNCP) handled by the kubernetes-nmstate operator. NMState reconciles the node's NetworkManager configuration to match the declared state; the node picks up a new `/etc/resolv.conf` and keeps it across reboots.

## Resolution

### Prerequisites

The cluster must have the kubernetes-nmstate operator installed and a `NMState` CR created so its DaemonSet is running on every target node. Verify with:

```bash
kubectl get nmstate -o custom-columns='NAME:.metadata.name,AGE:.metadata.creationTimestamp'
kubectl -n nmstate get pod -l component=kubernetes-nmstate-handler -o wide
```

If either command returns empty, the operator is not yet active; install it through the cluster's operator-management surface before proceeding.

### Capture the current node DNS state

Before making any change, record what the node currently serves. On the target node:

```bash
NODE=<node-name>
kubectl debug node/$NODE --image=busybox -- \
chroot /host sh -c 'cat /etc/resolv.conf; echo ---; nmcli dev show | grep -E "DNS|DOMAIN"'
```

Typical "before" shape:

```text
# Generated by NetworkManager
search example.internal
nameserver 192.168.1.249
```

This baseline is what the NNCP will replace on the next reconcile.

### Declare the desired state via NNCP

Target a single node first with a nodeSelector so the change is bounded. After the policy reconciles cleanly on that node, widen the selector to the rest of the fleet.

```yaml
apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
name: dns-worker-2
spec:
nodeSelector:
kubernetes.io/hostname: worker-2
desiredState:
dns-resolver:
config:
server:
- 192.168.1.249
- 192.168.1.1
search:
- example.internal
- corp.example.com
options:
- "rotate"
- "attempts:3"
- "timeout:2"
```

Apply and watch the reconcile:

```bash
kubectl apply -f dns-worker-2.yaml

kubectl get nodenetworkconfigurationpolicy dns-worker-2 \
-o jsonpath='{range .status.conditions[*]}{.type}={.status}{" "}{end}{"\n"}'
# Available=True Degraded=False
```

NMState publishes per-node progress on a companion `NodeNetworkConfigurationEnactment` (NNCE); check the one that matches the target node:

```bash
kubectl get nodenetworkconfigurationenactment \
-l nmstate.io/policy=dns-worker-2 \
-o custom-columns='NODE:.status.nodeName,STATUS:.status.conditions[?(@.type=="Available")].status,MSG:.status.conditions[?(@.type=="Available")].message'
```

`STATUS=True` means NetworkManager has been reconfigured and the change is live.

### Verify the effect on the node

Re-check the node's `/etc/resolv.conf`:

```bash
kubectl debug node/$NODE --image=busybox -- \
chroot /host cat /etc/resolv.conf
```

Expected "after" shape:

```text
# Generated by NetworkManager
search example.internal corp.example.com
nameserver 192.168.1.249
nameserver 192.168.1.1
options rotate attempts:3 timeout:2
```

Test resolution from inside a workload pod to confirm the cluster-side DNS path is unaffected (pods continue to resolve through CoreDNS; the NNCP only changes the node's own resolver):

```bash
kubectl run dns-probe -it --rm --restart=Never \
--image=busybox -- \
sh -c 'nslookup kubernetes.default.svc; nslookup external-host.example.com'
```

The first lookup must succeed (pods use CoreDNS, not the node resolver); the second must succeed through the new node-level resolver if the workload egresses through the host network.

### Widen the rollout once validated

After validation on the pilot node, remove the single-host `nodeSelector` (or set a label selector covering the broader fleet). Stagger by node label or by the operator's own batching to avoid reconfiguring every node simultaneously — one node at a time is safest.

```yaml
spec:
nodeSelector:
node-role.kubernetes.io/worker: ""
```

Monitor the `NodeNetworkConfigurationEnactment` list for any node that reports `Available=False`; those will need their NM state inspected before proceeding.

### Roll back

Delete the NNCP and NMState reverts the nodes to the NetworkManager configuration they held before the policy applied. The operator stores the pre-policy state internally; no manual restore step is needed.

```bash
kubectl delete nodenetworkconfigurationpolicy dns-worker-2
```

Confirm each node's `/etc/resolv.conf` returns to the baseline recorded at the start of this procedure.

## Diagnostic Steps

If the NNCP goes to `Degraded=True`, read the specific enactment for the failing node to see which NetworkManager operation was rejected:

```bash
kubectl get nodenetworkconfigurationenactment \
-l nmstate.io/policy=<policy-name>,nmstate.io/node=<node-name> \
-o yaml
```

Common failures:

- `dns-resolver.config.server` contains an IP that does not belong to any reachable subnet from the node. NetworkManager applies the resolver list regardless, but the node cannot resolve anything and subsequent probes fail.
- `search` domains exceed the kernel's `MAXDNSRCH` limit (typically 6 entries). NMState trims the list; document the precedence if that matters for the application.
- `options` list contains a token that the glibc resolver does not recognise. NetworkManager writes the option verbatim; invalid options are silently ignored by glibc. Verify each option against `man 5 resolv.conf`.

If `/etc/resolv.conf` on the node still shows the old content after `Available=True`, NetworkManager may have been overridden by another service (cloud-init, a cluster addon writing directly to `/etc/resolv.conf`, or a static file). Check the node's `nmcli dev show | grep DNS` versus the file — they should agree. If they disagree, the file is being written after NetworkManager; remove that other writer or let NNCP manage a compatible file instead of `/etc/resolv.conf`.
Loading