Skip to content

Add pass-1 all-disk health check to check-bm-server#4

Open
guettli wants to merge 4 commits into
mainfrom
tg/check-all-disks-health
Open

Add pass-1 all-disk health check to check-bm-server#4
guettli wants to merge 4 commits into
mainfrom
tg/check-all-disks-health

Conversation

@guettli
Copy link
Copy Markdown
Collaborator

@guettli guettli commented May 26, 2026

Adds a `pass-1-check-all-disks-health` step that checks every disk visible in rescue (not just `rootDeviceHints`) by collecting all disk WWNs via `GetHardwareDetailsStorage()` and passing them to `ssh.CheckDisk()`. Runs in pass 1 only.

Removes the now-redundant `pass-N-check-disk-in-rescue` step — its rootDeviceHints subset is fully covered by the new all-disk check.

Checked with bm-8:

go run . check-bm-server ../cluster-stacks/bm-8.yaml 
WARNING: this will delete all data on disks with WWN(s): eui.0025388b01b5c3df
host "bm-8" (serverID=2928603) 
Type "yes" to continue: yes
overall=0m00s destructive action confirmed for WWN(s): eui.0025388b01b5c3df
overall=0m00s step=load-input state=start timeout=0m30s
overall=0m00s step=load-input state=running elapsed=0m00s used=0.0% remaining=0m30s selected host "bm-8" (serverID=2928603)
overall=0m00s step=load-input state=running elapsed=0m00s used=0.0% remaining=0m30s loaded Robot + SSH credentials from environment
overall=0m00s step=load-input state=success duration=0m00s used=0.0% remaining=0m30s

overall=0m00s step=ensure-robot-ssh-key state=start timeout=1m00s
overall=0m00s step=ensure-robot-ssh-key state=running elapsed=0m00s used=0.4% remaining=1m00s using robot key="shared-2024-07-08" fingerprint="e5:18:e3:69:70:c9:fc:42:ba:1f:0f:eb:8a:16:7a:47"
overall=0m00s step=ensure-robot-ssh-key state=success duration=0m00s used=0.4% remaining=1m00s

overall=0m00s step=fetch-server-details state=start timeout=0m30s
overall=0m01s step=fetch-server-details state=running elapsed=0m00s used=1.5% remaining=0m30s server ip=65.21.93.126
overall=0m01s step=fetch-server-details state=success duration=0m00s used=1.5% remaining=0m30s

overall=0m01s step=pass-1-activate-rescue state=start timeout=0m45s
overall=0m02s step=pass-1-activate-rescue state=running elapsed=0m02s used=3.4% remaining=0m43s rescue mode activated
overall=0m02s step=pass-1-activate-rescue state=success duration=0m02s used=3.4% remaining=0m43s

overall=0m02s step=pass-1-reboot-to-rescue state=start timeout=0m45s
overall=0m07s step=pass-1-reboot-to-rescue state=running elapsed=0m05s used=10.4% remaining=0m40s hardware reboot requested
overall=0m07s step=pass-1-reboot-to-rescue state=success duration=0m05s used=10.4% remaining=0m40s

overall=0m07s step=pass-1-wait-rescue state=start timeout=8m00s
overall=0m12s step=pass-1-wait-rescue state=running elapsed=0m05s used=1.0% remaining=7m55s waiting for rescue ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): dial tcp 65.21.93.126:22: i/o timeout
overall=0m22s step=pass-1-wait-rescue state=running elapsed=0m15s used=3.1% remaining=7m45s waiting for rescue ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): dial tcp 65.21.93.126:22: i/o timeout
overall=0m32s step=pass-1-wait-rescue state=running elapsed=0m25s used=5.2% remaining=7m35s waiting for rescue ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): dial tcp 65.21.93.126:22: i/o timeout
overall=0m42s step=pass-1-wait-rescue state=running elapsed=0m35s used=7.3% remaining=7m25s waiting for rescue ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): dial tcp 65.21.93.126:22: i/o timeout
overall=0m52s step=pass-1-wait-rescue state=running elapsed=0m45s used=9.4% remaining=7m15s waiting for rescue ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): dial tcp 65.21.93.126:22: i/o timeout
overall=1m02s step=pass-1-wait-rescue state=running elapsed=0m55s used=11.5% remaining=7m05s waiting for rescue ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): dial tcp 65.21.93.126:22: i/o timeout
overall=1m12s step=pass-1-wait-rescue state=running elapsed=1m05s used=13.5% remaining=6m55s waiting for rescue ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): dial tcp 65.21.93.126:22: i/o timeout
overall=1m18s step=pass-1-wait-rescue state=running elapsed=1m11s used=14.8% remaining=6m49s waiting for rescue ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): dial tcp 65.21.93.126:22: connect: connection refused
overall=1m27s step=pass-1-wait-rescue state=running elapsed=1m20s used=16.7% remaining=6m40s waiting for rescue ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): dial tcp 65.21.93.126:22: connect: connection refused
overall=1m38s step=pass-1-wait-rescue state=running elapsed=1m31s used=18.9% remaining=6m29s rescue reachable (hostname="rescue")
overall=1m38s step=pass-1-wait-rescue state=success duration=1m31s used=18.9% remaining=6m29s

overall=1m38s step=pass-1-check-all-disks-health state=start timeout=3m00s
overall=1m39s step=pass-1-check-all-disks-health state=running elapsed=0m01s used=0.6% remaining=2m59s check-all-disks ok: 
Checking WWN=eui.0025388b01b5c3df device=nvme0n1
Checking WWN=eui.0025388b01b5c3e8 device=nvme1n1
check-disk passed. Provided WWNs look healthy.

eui.0025388b01b5c3df (/dev/nvme0n1): SMART overall-health self-assessment test result: PASSED
eui.0025388b01b5c3e8 (/dev/nvme1n1): SMART overall-health self-assessment test result: PASSED
overall=1m39s step=pass-1-check-all-disks-health state=success duration=0m01s used=0.6% remaining=2m59s

overall=1m39s step=pass-1-install-ubuntu-24.04 state=start timeout=9m00s
overall=1m40s step=pass-1-install-ubuntu-24.04 state=running elapsed=0m01s used=0.2% remaining=8m59s install target devices: nvme0n1
overall=1m40s step=pass-1-install-ubuntu-24.04 state=running elapsed=0m02s used=0.3% remaining=8m58s autosetup uploaded
overall=1m41s step=pass-1-install-ubuntu-24.04 state=running elapsed=0m03s used=0.5% remaining=8m57s post-install script uploaded
overall=1m42s step=pass-1-install-ubuntu-24.04 state=running elapsed=0m03s used=0.6% remaining=8m57s installimage files uploaded
overall=1m45s step=pass-1-install-ubuntu-24.04 state=running elapsed=0m06s used=1.1% remaining=8m54s installimage started
overall=1m53s step=pass-1-install-ubuntu-24.04 state=running elapsed=0m14s used=2.5% remaining=8m46s installimage is still running
overall=2m03s step=pass-1-install-ubuntu-24.04 state=running elapsed=0m24s used=4.4% remaining=8m36s installimage is still running
overall=2m13s step=pass-1-install-ubuntu-24.04 state=running elapsed=0m34s used=6.3% remaining=8m26s installimage is still running
overall=2m23s step=pass-1-install-ubuntu-24.04 state=running elapsed=0m44s used=8.1% remaining=8m16s installimage is still running
overall=2m33s step=pass-1-install-ubuntu-24.04 state=running elapsed=0m54s used=10.0% remaining=8m06s installimage is still running
overall=2m43s step=pass-1-install-ubuntu-24.04 state=running elapsed=1m04s used=11.8% remaining=7m56s installimage is still running
overall=2m53s step=pass-1-install-ubuntu-24.04 state=running elapsed=1m14s used=13.7% remaining=7m46s installimage is still running
overall=3m03s step=pass-1-install-ubuntu-24.04 state=running elapsed=1m24s used=15.5% remaining=7m36s installimage is still running
overall=3m14s step=pass-1-install-ubuntu-24.04 state=running elapsed=1m35s used=17.7% remaining=7m25s installimage finished and marker found
overall=3m14s step=pass-1-install-ubuntu-24.04 state=success duration=1m35s used=17.7% remaining=7m25s

overall=3m14s step=pass-1-reboot-to-os state=start timeout=0m45s
overall=3m15s step=pass-1-reboot-to-os state=running elapsed=0m01s used=1.2% remaining=0m44s reboot command sent from rescue
overall=3m15s step=pass-1-reboot-to-os state=success duration=0m01s used=1.2% remaining=0m44s

overall=3m15s step=pass-1-wait-os state=start timeout=6m00s
overall=3m15s step=pass-1-wait-os state=running elapsed=0m00s used=0.0% remaining=6m00s waiting for os ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): ssh: handshake failed: read tcp 192.168.178.33:39918->65.21.93.126:22: read: connection reset by peer
overall=3m30s step=pass-1-wait-os state=running elapsed=0m15s used=4.2% remaining=5m45s waiting for os ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): dial tcp 65.21.93.126:22: i/o timeout
overall=3m40s step=pass-1-wait-os state=running elapsed=0m25s used=6.9% remaining=5m35s waiting for os ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): dial tcp 65.21.93.126:22: i/o timeout
overall=3m50s step=pass-1-wait-os state=running elapsed=0m35s used=9.7% remaining=5m25s waiting for os ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): dial tcp 65.21.93.126:22: i/o timeout
overall=4m00s step=pass-1-wait-os state=running elapsed=0m45s used=12.5% remaining=5m15s waiting for os ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): dial tcp 65.21.93.126:22: i/o timeout
overall=4m10s step=pass-1-wait-os state=running elapsed=0m55s used=15.3% remaining=5m05s waiting for os ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): dial tcp 65.21.93.126:22: i/o timeout
overall=4m15s step=pass-1-wait-os state=running elapsed=1m00s used=16.7% remaining=5m00s waiting for os ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): dial tcp 65.21.93.126:22: connect: connection refused
overall=4m26s step=pass-1-wait-os state=running elapsed=1m11s used=19.8% remaining=4m49s os reachable (hostname="bm-8")
overall=4m26s step=pass-1-wait-os state=success duration=1m11s used=19.8% remaining=4m49s

overall=4m26s step=pass-2-activate-rescue state=start timeout=0m45s
overall=4m28s step=pass-2-activate-rescue state=running elapsed=0m02s used=3.8% remaining=0m43s rescue mode activated
overall=4m28s step=pass-2-activate-rescue state=success duration=0m02s used=3.8% remaining=0m43s

overall=4m28s step=pass-2-reboot-to-rescue state=start timeout=0m45s
overall=4m32s step=pass-2-reboot-to-rescue state=running elapsed=0m05s used=10.6% remaining=0m40s hardware reboot requested
overall=4m32s step=pass-2-reboot-to-rescue state=success duration=0m05s used=10.6% remaining=0m40s

overall=4m32s step=pass-2-wait-rescue state=start timeout=8m00s
overall=4m37s step=pass-2-wait-rescue state=running elapsed=0m05s used=1.0% remaining=7m55s waiting for rescue ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): dial tcp 65.21.93.126:22: i/o timeout
overall=4m47s step=pass-2-wait-rescue state=running elapsed=0m15s used=3.1% remaining=7m45s waiting for rescue ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): dial tcp 65.21.93.126:22: i/o timeout
overall=4m57s step=pass-2-wait-rescue state=running elapsed=0m25s used=5.2% remaining=7m35s waiting for rescue ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): dial tcp 65.21.93.126:22: i/o timeout
overall=5m07s step=pass-2-wait-rescue state=running elapsed=0m35s used=7.3% remaining=7m25s waiting for rescue ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): dial tcp 65.21.93.126:22: i/o timeout
overall=5m17s step=pass-2-wait-rescue state=running elapsed=0m45s used=9.4% remaining=7m15s waiting for rescue ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): dial tcp 65.21.93.126:22: i/o timeout
overall=5m27s step=pass-2-wait-rescue state=running elapsed=0m55s used=11.5% remaining=7m05s waiting for rescue ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): dial tcp 65.21.93.126:22: i/o timeout
overall=5m37s step=pass-2-wait-rescue state=running elapsed=1m05s used=13.5% remaining=6m55s waiting for rescue ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): dial tcp 65.21.93.126:22: i/o timeout
overall=5m42s step=pass-2-wait-rescue state=running elapsed=1m10s used=14.6% remaining=6m50s waiting for rescue ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): dial tcp 65.21.93.126:22: connect: connection refused
overall=5m52s step=pass-2-wait-rescue state=running elapsed=1m20s used=16.7% remaining=6m40s waiting for rescue ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): dial tcp 65.21.93.126:22: connect: connection refused
overall=6m03s step=pass-2-wait-rescue state=running elapsed=1m31s used=18.9% remaining=6m29s rescue reachable (hostname="rescue")
overall=6m03s step=pass-2-wait-rescue state=success duration=1m31s used=18.9% remaining=6m29s

overall=6m03s step=pass-2-install-ubuntu-24.04 state=start timeout=9m00s
overall=6m04s step=pass-2-install-ubuntu-24.04 state=running elapsed=0m01s used=0.2% remaining=8m59s install target devices: nvme0n1
overall=6m05s step=pass-2-install-ubuntu-24.04 state=running elapsed=0m02s used=0.3% remaining=8m58s autosetup uploaded
overall=6m06s step=pass-2-install-ubuntu-24.04 state=running elapsed=0m03s used=0.5% remaining=8m57s post-install script uploaded
overall=6m06s step=pass-2-install-ubuntu-24.04 state=running elapsed=0m03s used=0.6% remaining=8m57s installimage files uploaded
overall=6m09s step=pass-2-install-ubuntu-24.04 state=running elapsed=0m06s used=1.1% remaining=8m54s installimage started
overall=6m17s step=pass-2-install-ubuntu-24.04 state=running elapsed=0m14s used=2.6% remaining=8m46s installimage is still running
overall=6m27s step=pass-2-install-ubuntu-24.04 state=running elapsed=0m24s used=4.4% remaining=8m36s installimage is still running
overall=6m37s step=pass-2-install-ubuntu-24.04 state=running elapsed=0m34s used=6.3% remaining=8m26s installimage is still running
overall=6m47s step=pass-2-install-ubuntu-24.04 state=running elapsed=0m44s used=8.1% remaining=8m16s installimage is still running
overall=6m57s step=pass-2-install-ubuntu-24.04 state=running elapsed=0m54s used=10.0% remaining=8m06s installimage is still running
overall=7m07s step=pass-2-install-ubuntu-24.04 state=running elapsed=1m04s used=11.8% remaining=7m56s installimage is still running
overall=7m17s step=pass-2-install-ubuntu-24.04 state=running elapsed=1m14s used=13.7% remaining=7m46s installimage is still running
overall=7m27s step=pass-2-install-ubuntu-24.04 state=running elapsed=1m24s used=15.5% remaining=7m36s installimage is still running
overall=7m38s step=pass-2-install-ubuntu-24.04 state=running elapsed=1m36s used=17.7% remaining=7m24s installimage finished and marker found
overall=7m38s step=pass-2-install-ubuntu-24.04 state=success duration=1m36s used=17.7% remaining=7m24s

overall=7m38s step=pass-2-reboot-to-os state=start timeout=0m45s
overall=7m39s step=pass-2-reboot-to-os state=running elapsed=0m01s used=1.2% remaining=0m44s reboot command sent from rescue
overall=7m39s step=pass-2-reboot-to-os state=success duration=0m01s used=1.2% remaining=0m44s

overall=7m39s step=pass-2-wait-os state=start timeout=6m00s
overall=7m39s step=pass-2-wait-os state=running elapsed=0m00s used=0.0% remaining=6m00s waiting for os ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): ssh: handshake failed: read tcp 192.168.178.33:50572->65.21.93.126:22: read: connection reset by peer
overall=7m54s step=pass-2-wait-os state=running elapsed=0m15s used=4.2% remaining=5m45s waiting for os ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): dial tcp 65.21.93.126:22: i/o timeout
overall=8m04s step=pass-2-wait-os state=running elapsed=0m25s used=6.9% remaining=5m35s waiting for os ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): dial tcp 65.21.93.126:22: i/o timeout
overall=8m14s step=pass-2-wait-os state=running elapsed=0m35s used=9.7% remaining=5m25s waiting for os ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): dial tcp 65.21.93.126:22: i/o timeout
overall=8m24s step=pass-2-wait-os state=running elapsed=0m45s used=12.5% remaining=5m15s waiting for os ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): dial tcp 65.21.93.126:22: i/o timeout
overall=8m34s step=pass-2-wait-os state=running elapsed=0m55s used=15.3% remaining=5m05s waiting for os ssh: failed to dial ssh (user=root host=65.21.93.126 port=22 timeout=5s): dial tcp 65.21.93.126:22: i/o timeout
overall=8m40s step=pass-2-wait-os state=running elapsed=1m01s used=17.0% remaining=4m59s os reachable (hostname="bm-8")
overall=8m40s step=pass-2-wait-os state=success duration=1m01s used=17.0% remaining=4m59s

overall=8m40s all checks passed: machine "bm-8" (serverID=2928603) completed two rescue+install+boot cycles
Hint: set spec.maintenanceMode back to false in ../cluster-stacks/bm-8.yaml now.

Run smartctl -H on every disk visible in rescue (not just rootDeviceHints)
during the first pass only, so secondary drives and other attached disks
are validated without slowing down the second reliability pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@guettli guettli force-pushed the tg/check-all-disks-health branch 2 times, most recently from d5edbfd to 71fc06e Compare May 26, 2026 13:46
Run smartctl -H on every disk in rescue (pass 1 only) by enumerating all
disk WWNs via GetHardwareDetailsStorage() and reusing CAPH's existing
check-disk.sh via ssh.CheckDisk(). Removes the redundant
pass-N-check-disk-in-rescue step that only checked rootDeviceHints disks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@guettli guettli force-pushed the tg/check-all-disks-health branch from 71fc06e to fa1d225 Compare May 26, 2026 13:56
@guettli guettli requested a review from janiskemper May 26, 2026 14:07
@guettli guettli requested a review from batistein May 26, 2026 14:58
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant