rename to vsphere-passthrough-node-controller + add crash-fence controller#7
rename to vsphere-passthrough-node-controller + add crash-fence controller#7Varashi wants to merge 2 commits into
Conversation
…rash) Adds fence.py — a second, optional Deployment (fence.enabled, off by default) sharing this image and reusing the vCenter client + node↔VM map. Automates non-graceful node shutdown for passthrough-GPU workers that vSphere HA can't restart elsewhere during a host crash: applies the node.kubernetes.io/out-of-service taint to a node confirmed dead by BOTH k8s (NotReady) and vCenter (VM connectionState disconnected/inaccessible/ orphaned), sustained graceSeconds, so RWO volumes force-detach and pods reschedule; removes it on recovery (connected + Ready). Disjoint from the maintenance controller (clean power-off stays 'connected', only a crash goes 'disconnected') — no coordination needed. Taint/un-taint only; power-on stays with vSphere HA. Own SA + least-privilege ClusterRole (nodes only) + kill switch + dryRun. test_fence.py covers the two-gate guard, grace hold, no-double-apply, un-fence on recovery, and the partition/notfound guards (8/8). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
📝 WalkthroughWalkthroughThe pull request adds a new optional crash-fence controller ( ChangesCrash-fence controller feature
Sequence Diagram(s)sequenceDiagram
participant FenceController
participant K8sClient
participant VSphereClient
FenceController->>K8sClient: list nodes (GPU label)
K8sClient-->>FenceController: node list
loop for each node
FenceController->>K8sClient: get node Ready status
K8sClient-->>FenceController: Ready state
FenceController->>VSphereClient: get_vm_connection_state
VSphereClient-->>FenceController: "connected" or "disconnected"/"inaccessible"/"orphaned"
alt both gates active + grace elapsed
FenceController->>K8sClient: apply_out_of_service_taint
K8sClient-->>FenceController: taint applied
else both gates inactive + taint present
FenceController->>K8sClient: remove_out_of_service_taint
K8sClient-->>FenceController: taint removed
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@CHANGELOG.md`:
- Line 10: The changelog declares "## [0.5.0]" but the reference links at the
bottom still treat 0.4.3 as the latest; add a "[0.5.0]:" link definition
pointing to the compare URL for v0.4.3...v0.5.0 (or the repo tag URL for v0.5.0)
and update the "[Unreleased]:" reference to compare against v0.5.0 (change its
right-hand/latest tag from v0.4.3 to v0.5.0); locate the "## [0.5.0]" header and
the link definitions named "[0.5.0]:" and "[Unreleased]:" to make these edits so
the release compare navigation is correct.
In `@chart/templates/fence.yaml`:
- Around line 27-30: The ClusterRole rules in fence.yaml grant both "patch" and
"update" on the "nodes" resource; remove the "update" verb and keep only "patch"
(and any existing verbs like "get","list","watch") to enforce least-privilege
for taint operations—locate the rules block (apiGroups: [""], resources:
["nodes"], verbs: [...]) and delete "update" from the verbs array.
- Line 5: The conditional in the template uses the redundant Go template "and"
operator with a single argument ({{- if and .Values.serviceAccount.create }});
change it to a direct boolean check by replacing the use of "and" with a simple
if on .Values.serviceAccount.create (i.e., use {{- if
.Values.serviceAccount.create }}), ensuring any matching {{ end }} remains
unchanged.
In `@controller.py`:
- Around line 484-488: has_out_of_service_taint currently matches taints by key
only which can erroneously detect/remove taints; update it and the related
functions apply_out_of_service_taint and remove_out_of_service_taint to match
the full taint identity (key, value, effect) — or at minimum key+effect —
instead of key-only. Locate uses of OUT_OF_SERVICE_TAINT_KEY and change the
predicate in has_out_of_service_taint to compare (t.key, t.value, t.effect) to
the controller's OUT_OF_SERVICE taint triple, ensure apply_out_of_service_taint
creates the taint with the same (key,value,effect) tuple, and change
remove_out_of_service_taint to filter/remove only taints matching that full
tuple so you don't touch taints the controller didn't create.
In `@README.md`:
- Around line 121-129: Update the README examples to reference the 0.5.0
chart/tag wherever examples show 0.4.3 so the documented fence.* fields (dryRun,
graceSeconds, pollSeconds) match the supported release, and update the
source-layout description that currently says “single controller.py” to reflect
the new split (mention both controller.py and fence.py) so readers know fence.py
exists; search for occurrences of the chart tag/version, the phrase "single
controller.py", and the fence.* example block to make the changes consistently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 8030bd06-6a49-48c2-bef3-c134b0815e0d
📒 Files selected for processing (10)
.gitignoreCHANGELOG.mdDockerfileREADME.mdchart/Chart.yamlchart/templates/fence.yamlchart/values.yamlcontroller.pyfence.pytest_fence.py
|
|
||
| ## [Unreleased] | ||
|
|
||
| ## [0.5.0] — 2026-06-03 |
There was a problem hiding this comment.
Add the 0.5.0 reference links and move Unreleased baseline forward.
0.5.0 is declared, but the link definitions still treat 0.4.3 as latest. This breaks compare navigation for the new release.
Suggested link fixes
-[Unreleased]: https://github.com/Varashi/gpu-node-vsphere-maintenance-controller/compare/v0.4.3...HEAD
+[Unreleased]: https://github.com/Varashi/gpu-node-vsphere-maintenance-controller/compare/v0.5.0...HEAD
+[0.5.0]: https://github.com/Varashi/gpu-node-vsphere-maintenance-controller/compare/v0.4.4...v0.5.0
+[0.4.4]: https://github.com/Varashi/gpu-node-vsphere-maintenance-controller/compare/v0.4.3...v0.4.4
[0.4.3]: https://github.com/Varashi/gpu-node-vsphere-maintenance-controller/compare/v0.4.2...v0.4.3🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@CHANGELOG.md` at line 10, The changelog declares "## [0.5.0]" but the
reference links at the bottom still treat 0.4.3 as the latest; add a "[0.5.0]:"
link definition pointing to the compare URL for v0.4.3...v0.5.0 (or the repo tag
URL for v0.5.0) and update the "[Unreleased]:" reference to compare against
v0.5.0 (change its right-hand/latest tag from v0.4.3 to v0.5.0); locate the "##
[0.5.0]" header and the link definitions named "[0.5.0]:" and "[Unreleased]:" to
make these edits so the release compare navigation is correct.
| {{- $fullname := include "gpu-node-vsphere-maintenance-controller.fullname" . -}} | ||
| {{- $fenceName := printf "%s-fence" $fullname -}} | ||
| {{- $fenceSA := ternary $fenceName (default "default" .Values.serviceAccount.name) .Values.serviceAccount.create -}} | ||
| {{- if and .Values.serviceAccount.create }} |
There was a problem hiding this comment.
Remove redundant and operator from conditional.
The and operator with a single argument is non-idiomatic Go template syntax. This should be simplified to a direct boolean check.
🔧 Proposed fix
-{{- if and .Values.serviceAccount.create }}
+{{- if .Values.serviceAccount.create }}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| {{- if and .Values.serviceAccount.create }} | |
| {{- if .Values.serviceAccount.create }} |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@chart/templates/fence.yaml` at line 5, The conditional in the template uses
the redundant Go template "and" operator with a single argument ({{- if and
.Values.serviceAccount.create }}); change it to a direct boolean check by
replacing the use of "and" with a simple if on .Values.serviceAccount.create
(i.e., use {{- if .Values.serviceAccount.create }}), ensuring any matching {{
end }} remains unchanged.
| rules: | ||
| - apiGroups: [""] | ||
| resources: ["nodes"] | ||
| verbs: ["get", "list", "watch", "patch", "update"] |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | ⚡ Quick win
Consider removing "update" verb for stricter least-privilege.
The ClusterRole includes both "patch" and "update" verbs, but Kubernetes controllers performing taint operations typically only need "patch". The "update" verb enables full object replacement and is more permissive than necessary for this use case.
🔒 Proposed change for stricter RBAC
rules:
- apiGroups: [""]
resources: ["nodes"]
- verbs: ["get", "list", "watch", "patch", "update"]
+ verbs: ["get", "list", "watch", "patch"]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| rules: | |
| - apiGroups: [""] | |
| resources: ["nodes"] | |
| verbs: ["get", "list", "watch", "patch", "update"] | |
| rules: | |
| - apiGroups: [""] | |
| resources: ["nodes"] | |
| verbs: ["get", "list", "watch", "patch"] |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@chart/templates/fence.yaml` around lines 27 - 30, The ClusterRole rules in
fence.yaml grant both "patch" and "update" on the "nodes" resource; remove the
"update" verb and keep only "patch" (and any existing verbs like
"get","list","watch") to enforce least-privilege for taint operations—locate the
rules block (apiGroups: [""], resources: ["nodes"], verbs: [...]) and delete
"update" from the verbs array.
| def has_out_of_service_taint(self, node_name) -> bool: | ||
| node = self.get_node(node_name) | ||
| return any( | ||
| t.key == OUT_OF_SERVICE_TAINT_KEY for t in (node.spec.taints or []) | ||
| ) |
There was a problem hiding this comment.
Match fence taints by full identity, not key-only.
The current checks/removal use only t.key == OUT_OF_SERVICE_TAINT_KEY. If a node has the same key with a different effect/value, fencing can be skipped and un-fencing can remove taints this controller did not create.
Use (key, value, effect) (or at least key + effect) consistently in has_out_of_service_taint, apply_out_of_service_taint, and remove_out_of_service_taint.
Suggested fix
+ def _is_fence_taint(self, t) -> bool:
+ return (
+ t.key == OUT_OF_SERVICE_TAINT_KEY
+ and getattr(t, "value", None) == OUT_OF_SERVICE_TAINT_VALUE
+ and getattr(t, "effect", None) == "NoExecute"
+ )
+
def has_out_of_service_taint(self, node_name) -> bool:
node = self.get_node(node_name)
- return any(
- t.key == OUT_OF_SERVICE_TAINT_KEY for t in (node.spec.taints or [])
- )
+ return any(self._is_fence_taint(t) for t in (node.spec.taints or []))
@@
- if any(t.key == OUT_OF_SERVICE_TAINT_KEY for t in taints):
+ if any(self._is_fence_taint(t) for t in taints):
return # already fenced
@@
- if not any(t.key == OUT_OF_SERVICE_TAINT_KEY for t in taints):
+ if not any(self._is_fence_taint(t) for t in taints):
return
@@
- kept = [t for t in taints if t.key != OUT_OF_SERVICE_TAINT_KEY]
+ kept = [t for t in taints if not self._is_fence_taint(t)]
self._patch_taints(node_name, kept)Also applies to: 500-501, 518-525
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@controller.py` around lines 484 - 488, has_out_of_service_taint currently
matches taints by key only which can erroneously detect/remove taints; update it
and the related functions apply_out_of_service_taint and
remove_out_of_service_taint to match the full taint identity (key, value,
effect) — or at minimum key+effect — instead of key-only. Locate uses of
OUT_OF_SERVICE_TAINT_KEY and change the predicate in has_out_of_service_taint to
compare (t.key, t.value, t.effect) to the controller's OUT_OF_SERVICE taint
triple, ensure apply_out_of_service_taint creates the taint with the same
(key,value,effect) tuple, and change remove_out_of_service_taint to
filter/remove only taints matching that full tuple so you don't touch taints the
controller didn't create.
| Enable it (and start with `dryRun` to watch its decisions): | ||
|
|
||
| ```yaml | ||
| fence: | ||
| enabled: true | ||
| dryRun: true # logs "would fence" without tainting; flip to false when confident | ||
| graceSeconds: 60 # both gates must hold this long before fencing | ||
| pollSeconds: 20 | ||
| ``` |
There was a problem hiding this comment.
Update adjacent README examples to 0.5.0 to keep fence docs actionable.
This section documents fence.* values, but later examples still pin chart tag 0.4.3; that version won’t include this feature. Also, the README still says the source layout is a “single controller.py”, which is now outdated after adding fence.py.
Suggested doc updates
- --version 0.4.3 \
+ --version 0.5.0 \
- tag: 0.4.3
+ tag: 0.5.0
-Source layout is deliberately tiny — a single `controller.py` plus a
+Source layout is deliberately tiny — `controller.py` + `fence.py` plus a🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@README.md` around lines 121 - 129, Update the README examples to reference
the 0.5.0 chart/tag wherever examples show 0.4.3 so the documented fence.*
fields (dryRun, graceSeconds, pollSeconds) match the supported release, and
update the source-layout description that currently says “single controller.py”
to reflect the new split (mention both controller.py and fence.py) so readers
know fence.py exists; search for occurrences of the chart tag/version, the
phrase "single controller.py", and the fence.* example block to make the changes
consistently.
The product now covers two passthrough-node host-lifecycle concerns — planned maintenance and (new) crash fencing — so "maintenance-controller" no longer fits. Renames repo/image/chart/templates/helpers/docs to vsphere-passthrough-node-controller (image + chart at ghcr.io/varashi/[charts/]vsphere-passthrough-node-controller). GitHub repo renamed (redirects cover old git/web refs). The internal maintenance-state annotation domain (vsphere-maintenance.boeye.net/*) is left unchanged — it's controller state, not user-facing, and no nodes carry it currently. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Folded in the product rename gpu-node-vsphere-maintenance-controller → vsphere-passthrough-node-controller (GitHub repo renamed; redirects cover old refs). The product now spans two passthrough-node host-lifecycle concerns (planned maintenance + crash fencing), so the old name no longer fit. New artifacts on release: |
There was a problem hiding this comment.
♻️ Duplicate comments (3)
chart/templates/fence.yaml (2)
27-30: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winConsider removing "update" verb for stricter least-privilege.
The ClusterRole includes both "patch" and "update" verbs, but Kubernetes controllers performing taint operations typically only need "patch". The "update" verb enables full object replacement and is more permissive than necessary for this use case.
🔒 Proposed change for stricter RBAC
rules: - apiGroups: [""] resources: ["nodes"] - verbs: ["get", "list", "watch", "patch", "update"] + verbs: ["get", "list", "watch", "patch"]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@chart/templates/fence.yaml` around lines 27 - 30, The ClusterRole rules for resource "nodes" currently list both "patch" and "update"; remove the "update" verb to adhere to least-privilege (keep "patch" for taint operations) by editing the rules block that targets resources: ["nodes"] (the ClusterRole/rules stanza in the fence template) so it no longer includes "update" while retaining "get", "list", "watch", and "patch".
5-5:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winRemove redundant
andoperator from conditional.The
andoperator with a single argument is non-idiomatic Go template syntax. This should be simplified to a direct boolean check.🔧 Proposed fix
-{{- if and .Values.serviceAccount.create }} +{{- if .Values.serviceAccount.create }}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@chart/templates/fence.yaml` at line 5, The conditional uses the redundant Go template "and" operator with a single argument; replace the expression that references "and .Values.serviceAccount.create" with a direct boolean check of ".Values.serviceAccount.create" in the template conditional (i.e., remove the "and" operator) so the if statement becomes a simple "{{- if .Values.serviceAccount.create }}" style check.CHANGELOG.md (1)
189-200:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd the
0.5.0and0.4.4reference links and move Unreleased baseline forward.The changelog declares
## [0.5.0]and## [0.4.4]but the link definitions still treat0.4.3as latest. This breaks compare navigation for the new releases.🔗 Suggested link fixes
-[Unreleased]: https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.4.3...HEAD +[Unreleased]: https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.5.0...HEAD +[0.5.0]: https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.4.4...v0.5.0 +[0.4.4]: https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.4.3...v0.4.4 [0.4.3]: https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.4.2...v0.4.3🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@CHANGELOG.md` around lines 189 - 200, Update the changelog link definitions so the new releases are referenced and Unreleased compares against the newest tag: add link definitions for [0.4.4] and [0.5.0] (e.g. [0.4.4]: https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.4.3...v0.4.4 and [0.5.0]: https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.4.3...v0.5.0), then change the [Unreleased] baseline to compare from v0.5.0 (https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.5.0...HEAD) so the "Unreleased" link points to the correct latest released tag.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@CHANGELOG.md`:
- Around line 189-200: Update the changelog link definitions so the new releases
are referenced and Unreleased compares against the newest tag: add link
definitions for [0.4.4] and [0.5.0] (e.g. [0.4.4]:
https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.4.3...v0.4.4
and [0.5.0]:
https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.4.3...v0.5.0),
then change the [Unreleased] baseline to compare from v0.5.0
(https://github.com/Varashi/vsphere-passthrough-node-controller/compare/v0.5.0...HEAD)
so the "Unreleased" link points to the correct latest released tag.
In `@chart/templates/fence.yaml`:
- Around line 27-30: The ClusterRole rules for resource "nodes" currently list
both "patch" and "update"; remove the "update" verb to adhere to least-privilege
(keep "patch" for taint operations) by editing the rules block that targets
resources: ["nodes"] (the ClusterRole/rules stanza in the fence template) so it
no longer includes "update" while retaining "get", "list", "watch", and "patch".
- Line 5: The conditional uses the redundant Go template "and" operator with a
single argument; replace the expression that references "and
.Values.serviceAccount.create" with a direct boolean check of
".Values.serviceAccount.create" in the template conditional (i.e., remove the
"and" operator) so the if statement becomes a simple "{{- if
.Values.serviceAccount.create }}" style check.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 714fa2e1-53cb-49df-8080-13f3f34992ac
📒 Files selected for processing (16)
.github/workflows/ci.yaml.github/workflows/release.yamlCHANGELOG.mdDockerfileREADME.mdchart/Chart.yamlchart/templates/NOTES.txtchart/templates/_helpers.tplchart/templates/clusterrole.yamlchart/templates/clusterrolebinding.yamlchart/templates/configmap.yamlchart/templates/deployment.yamlchart/templates/fence.yamlchart/templates/secret.yamlchart/templates/serviceaccount.yamlchart/values.yaml
Adds a second, optional controller (
fence.py,fence.enabled— off by default) sharing this image and reusing the vCenter client + node↔VM map.Why
A passthrough-GPU VM can't be vSphere-HA-restarted elsewhere during a host crash, so the node stays down and its RWO volume stays attached to the dead node — k8s won't auto-detach an unreachable node's volume (split-brain protection). A rescheduled stateful pod then hangs on
Multi-Attachindefinitely. Thenode.kubernetes.io/out-of-servicetaint fixes this (force-detach + force-delete), but must only ever hit a confirmed-dead node.What it does
Two-gate fence, sustained for
fence.graceSeconds:NotReadyruntime.connectionState∈ {disconnected, inaccessible, orphaned}A clean (maintenance) power-off leaves the VM
connected; only a real host loss makes itdisconnected— so this is disjoint from the maintenance controller and the two never collide. Un-fences on recovery (connected + Ready).Taint/un-taint only — power-on stays with vSphere HA (it restarts passthrough VMs on the original host once it reconnects); eviction is handled by
tolerationSeconds+ the taint.Isolation
Separate Deployment, own ServiceAccount, least-privilege ClusterRole (
nodesget/list/watch/patch only — no pods/eviction), independent kill switch (fence.enabled) andfence.dryRun. Off by default — a mis-fire is destructive.Validation
test_fence.py— 8/8: two-gate guard (NotReady-only / disconnected-only → no fence), grace hold, fence-on-grace, no-double-apply, un-fence on recovery, stay-fenced while VM still disconnected,notfound/partition guard.helm lint+ template render (gated; 0 fence resources when disabled).Rollout
v0.5.0(CI builds image incl.fence.py+ publishes chart).Varashi/k8s, setfence.enabled: true+fence.dryRun: true, watch its decisions.dryRun: falseonce trusted.Closes the controller design in Varashi/k8s#166.
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Chores
Tests
Misc