OCPEDGE-2489: fix: pick cluster VM IP from DHCP lease by hostname#61
OCPEDGE-2489: fix: pick cluster VM IP from DHCP lease by hostname#61jaypoulz wants to merge 1 commit intoopenshift-eng:mainfrom
Conversation
|
@jaypoulz: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jaypoulz The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold |
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Review skipped — only excluded labels are configured. (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds an executable Bash script to resolve a VM's IP via libvirt interface and DHCP lease queries, and updates an Ansible task to call that script per-VM, make SSH host formatting IPv6-safe, and increase SSH connect timeout to 15s. Changes
Sequence Diagram(s)sequenceDiagram
participant Ansible
participant ResolveScript as resolve_vm_ip.sh
participant Libvirt as virsh/libvirt
participant DHCP as net-dhcp-leases
participant VM
Ansible->>ResolveScript: invoke with VM_NAME
ResolveScript->>Libvirt: virsh domiflist VM_NAME (get interfaces/MACs)
ResolveScript->>DHCP: virsh net-dhcp-leases <network> (per network)
DHCP-->>ResolveScript: lease rows (IP/CIDR, MAC, hostname)
ResolveScript-->>Ansible: resolved IP (CIDR stripped)
Ansible->>Ansible: compute SSH_HOST (wrap IPv6 in brackets)
Ansible->>VM: ssh-keyscan / ssh using SSH_HOST (timeout 15s)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@jaypoulz: This pull request references OCPEDGE-2489 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml (1)
20-25:⚠️ Potential issue | 🟠 MajorGuard the inventory rewrite when IP resolution returns nothing.
Line 25 suppresses every resolver failure. Combined with the filter on Line 34, a bad
resolve_vm_ip.shrun turns into an emptyparsed_vm_entries, and this block later rewrites[cluster_vms]from that empty list. That can silently wipe the existing VM inventory. Please fail or skip the rewrite when no addresses were resolved.Suggested guard
- name: Get VM IP addresses from DHCP leases (hostname preferred) script: resolve_vm_ip.sh {{ item | quote }} register: vm_ips loop: "{{ cluster_vms.stdout_lines }}" changed_when: false failed_when: false + + - name: Abort when no VM IPs were resolved + fail: + msg: >- + Discovered {{ cluster_vms.stdout_lines | length }} cluster VM(s), but + none of their IPs could be resolved from libvirt DHCP leases. + when: >- + (vm_ips.results | map(attribute='stdout') | map('trim') | + reject('equalto', '') | list | length) == 0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml` around lines 20 - 25, The current task running resolve_vm_ip.sh (script: resolve_vm_ip.sh ...) registers results to vm_ips but suppresses failures, which can lead to parsed_vm_entries being empty and rewrite the [cluster_vms] inventory from nothing; update the play to guard the rewrite by checking vm_ips.results (or the computed parsed_vm_entries) and skip/fail the inventory update when no IPs were resolved—e.g., after the script task add a conditional task that inspects vm_ips.results to ensure at least one non-empty address (or set failed_when: true on the script when all results are empty), and only allow the subsequent block that rewrites [cluster_vms] (the rewrite task that uses parsed_vm_entries) to run when that check passes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@deploy/openshift-clusters/roles/common/files/resolve_vm_ip.sh`:
- Line 4: The grep -i "$MAC" pipelines currently cause the script to exit under
set -euo pipefail when a match fails; locate the command substitutions that pipe
into grep -i "$MAC" (the ones that populate the IP lookup fallbacks using the
MAC variable) and replace those grep pipelines with an awk match against the MAC
(case-insensitive) so the pipeline returns 0 even when no match is found and
fallbacks continue; ensure awk extracts the same fields as the original pipeline
(preserve the same capture/field selection) and use tolower comparison or
IGNORECASE in awk to implement case-insensitive matching.
---
Outside diff comments:
In `@deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml`:
- Around line 20-25: The current task running resolve_vm_ip.sh (script:
resolve_vm_ip.sh ...) registers results to vm_ips but suppresses failures, which
can lead to parsed_vm_entries being empty and rewrite the [cluster_vms]
inventory from nothing; update the play to guard the rewrite by checking
vm_ips.results (or the computed parsed_vm_entries) and skip/fail the inventory
update when no IPs were resolved—e.g., after the script task add a conditional
task that inspects vm_ips.results to ensure at least one non-empty address (or
set failed_when: true on the script when all results are empty), and only allow
the subsequent block that rewrites [cluster_vms] (the rewrite task that uses
parsed_vm_entries) to run when that check passes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2eaf1dbb-74e9-4e5e-af69-922b609907bd
📒 Files selected for processing (2)
deploy/openshift-clusters/roles/common/files/resolve_vm_ip.shdeploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml (1)
20-25:⚠️ Potential issue | 🟠 MajorDon't silently rewrite
cluster_vmsto zero hosts on resolver failure.
deploy/openshift-clusters/roles/common/files/resolve_vm_ip.shexits non-zero with empty stdout on its no-result paths, and this block now turns those cases into “skip this VM”. If the resolver breaks globally, the later inventory rewrite still removes the existing[cluster_vms]section and writes back an empty group even thoughvirsh listfound running VMs. Please fail loudly when VMs were discovered but zero IPs were resolved.Also applies to: 27-38
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml` around lines 20 - 25, The current task "Get VM IP addresses from DHCP leases (hostname preferred)" runs resolve_vm_ip.sh with failed_when: false which hides resolver failures and lets the inventory be rewritten empty; change the flow so that you still collect per-VM results in vm_ips but explicitly fail if virsh found VMs (cluster_vms.stdout_lines is non-empty) and none of the vm_ips.results contain a usable IP. Concretely: keep the script task but remove or limit the global failed_when:false behavior, then add a follow-up assertion/fail task that examines vm_ips.results and fails with a clear message when cluster_vms.stdout_lines | length > 0 and vm_ips.results | selectattr('stdout','truthy') | list | length == 0 (apply same check for the other similar block referenced). Ensure the failure path triggers before the inventory rewrite.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml`:
- Around line 20-25: The current task "Get VM IP addresses from DHCP leases
(hostname preferred)" runs resolve_vm_ip.sh with failed_when: false which hides
resolver failures and lets the inventory be rewritten empty; change the flow so
that you still collect per-VM results in vm_ips but explicitly fail if virsh
found VMs (cluster_vms.stdout_lines is non-empty) and none of the vm_ips.results
contain a usable IP. Concretely: keep the script task but remove or limit the
global failed_when:false behavior, then add a follow-up assertion/fail task that
examines vm_ips.results and fails with a clear message when
cluster_vms.stdout_lines | length > 0 and vm_ips.results |
selectattr('stdout','truthy') | list | length == 0 (apply same check for the
other similar block referenced). Ensure the failure path triggers before the
inventory rewrite.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7991a4bc-ac01-4a44-b04f-5780bdc69110
📒 Files selected for processing (1)
deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml
|
@jaypoulz: This pull request references OCPEDGE-2489 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
fonta-rh
left a comment
There was a problem hiding this comment.
Two issues worth addressing before merge.
1. Hostname regex misses arbiter and kcli VMs (resolve_vm_ip.sh:10)
The regex _(master|worker)_([0-9]+)$ only matches dev-scripts master/worker naming. Arbiter VMs (ostest_arbiter_0) and kcli VMs (tnt-cluster-ctlplane-0) don't match, so EXPECTED_HOSTNAME stays empty and the hostname-preference tier — the core fix in this PR — is skipped entirely for those VMs.
if [[ "$VM_NAME" =~ _(master|worker|arbiter)_([0-9]+)$ ]]; then
EXPECTED_HOSTNAME="${BASH_REMATCH[1]}-${BASH_REMATCH[2]}"
elif [[ "$VM_NAME" =~ -(ctlplane|arbiter)-?([0-9]*)$ ]]; then
EXPECTED_HOSTNAME="${BASH_REMATCH[1]}-${BASH_REMATCH[2]}"
fi2. Dual-stack clusters get non-deterministic address family (resolve_vm_ip.sh, lease_ip_for_mac)
With IP_STACK=v4v6, virsh net-dhcp-leases returns both IPv4 and IPv6 rows for the same MAC. The script picks whichever appears first — no address family preference. This can produce mixed-family entries in inventory.ini (one node IPv4, another IPv6).
A simple fix: filter by $4 (Protocol column) in the awk, defaulting to ipv4 with fallback to ipv6.
3051452 to
b94c146
Compare
fonta-rh
left a comment
There was a problem hiding this comment.
Previous findings addressed — hostname regex and dual-stack preference both look good.
One small remaining issue: when a kcli VM has no numeric suffix (e.g. tnt-cluster-arbiter), BASH_REMATCH[2] is empty and EXPECTED_HOSTNAME becomes "arbiter-" (trailing hyphen). The DHCP hostname would be "arbiter", so the hostname-preference tier silently misses.
elif [[ "$VM_NAME" =~ -(ctlplane|arbiter)-?([0-9]*)$ ]]; then
if [[ -n "${BASH_REMATCH[2]}" ]]; then
EXPECTED_HOSTNAME="${BASH_REMATCH[1]}-${BASH_REMATCH[2]}"
else
EXPECTED_HOSTNAME="${BASH_REMATCH[1]}"
fi
fi
| elif [[ "$VM_NAME" =~ -(ctlplane|arbiter)-?([0-9]*)$ ]]; then | ||
| EXPECTED_HOSTNAME="${BASH_REMATCH[1]}-${BASH_REMATCH[2]}" | ||
| fi |
There was a problem hiding this comment.
When a kcli VM has no numeric suffix (e.g. tnt-cluster-arbiter from prepare.yml:31), BASH_REMATCH[2] is empty and EXPECTED_HOSTNAME becomes "arbiter-" (trailing hyphen). The DHCP hostname would be "arbiter", so $6 == host in tier 1 never matches.
In practice tier 2 ($6 != "-") covers this correctly — same MAC filter, same proto preference, same result. So the runtime impact is zero, but it does make tier 1 dead code for numberless VMs.
| elif [[ "$VM_NAME" =~ -(ctlplane|arbiter)-?([0-9]*)$ ]]; then | |
| EXPECTED_HOSTNAME="${BASH_REMATCH[1]}-${BASH_REMATCH[2]}" | |
| fi | |
| elif [[ "$VM_NAME" =~ -(ctlplane|arbiter)-?([0-9]*)$ ]]; then | |
| if [[ -n "${BASH_REMATCH[2]}" ]]; then | |
| EXPECTED_HOSTNAME="${BASH_REMATCH[1]}-${BASH_REMATCH[2]}" | |
| else | |
| EXPECTED_HOSTNAME="${BASH_REMATCH[1]}" | |
| fi | |
| fi |
There was a problem hiding this comment.
I think this is addressed now if you want to double-check
Add resolve_vm_ip.sh to map virsh VM names to lease hostnames (dev-scripts master/worker/arbiter, kcli ctlplane/arbiter) and pick the matching row per MAC. Prefer ipv4 over ipv6 when dual-stack leases list both. Parse lease output in awk (no grep under set -e pipefail). Wire the script into update-cluster-inventory.yml so inventory IPs stay stable when anonymous DUID rows appear. Made-with: Cursor
b94c146 to
1f3a758
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml`:
- Around line 20-21: The VM IP discovery script was changed to DHCP-only and now
drops hosts that lack leases; update resolve_vm_ip.sh so it first attempts the
previous neighbor-table/ARP lookup (e.g., ip neigh / arp cache lookup or the
earlier neighbor-table code) and only falls back to parsing virsh
net-dhcp-leases if the neighbor lookup fails, returning a valid IP and exiting 0
when any method finds an address; ensure the script only exits non-zero when
neither neighbor-table nor DHCP returns an address so the playbook task ("Get VM
IP addresses from DHCP leases (hostname preferred)" / script: resolve_vm_ip.sh)
and the downstream parsed_vm_entries filter will receive those non-DHCP IPs
instead of silently omitting them.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 088ef9ee-08af-47b6-8688-7c28bcda4232
📒 Files selected for processing (2)
deploy/openshift-clusters/roles/common/files/resolve_vm_ip.shdeploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml
🚧 Files skipped from review as they are similar to previous changes (1)
- deploy/openshift-clusters/roles/common/files/resolve_vm_ip.sh
| - name: Get VM IP addresses from DHCP leases (hostname preferred) | ||
| script: resolve_vm_ip.sh {{ item | quote }} |
There was a problem hiding this comment.
Regression: VM discovery is now DHCP-only.
Line 21 now delegates to resolve_vm_ip.sh, but deploy/openshift-clusters/roles/common/files/resolve_vm_ip.sh:84-94 only returns an address from virsh net-dhcp-leases and exits 1 otherwise. That drops the previous neighbor-table fallback, so guests with expired/missing leases or SLAAC/static addressing will be silently omitted from parsed_vm_entries because this task and the later filter both ignore failures.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml`
around lines 20 - 21, The VM IP discovery script was changed to DHCP-only and
now drops hosts that lack leases; update resolve_vm_ip.sh so it first attempts
the previous neighbor-table/ARP lookup (e.g., ip neigh / arp cache lookup or the
earlier neighbor-table code) and only falls back to parsing virsh
net-dhcp-leases if the neighbor lookup fails, returning a valid IP and exiting 0
when any method finds an address; ensure the script only exits non-zero when
neither neighbor-table nor DHCP returns an address so the playbook task ("Get VM
IP addresses from DHCP leases (hostname preferred)" / script: resolve_vm_ip.sh)
and the downstream parsed_vm_entries filter will receive those non-DHCP IPs
instead of silently omitting them.
There was a problem hiding this comment.
This was intended for reliability purposes. the neighbor table had multiple entries, so finding the right peer IP requires knowing which IP is actually bound to the intended peer.
There was a problem hiding this comment.
@jaypoulz, that makes sense — with multiple neighbor-table entries for a given MAC, there's no reliable way to know which IP is actually bound to the target VM, whereas the DHCP lease with hostname matching gives a deterministic correlation. Thanks for the clarification!
📝 Learnings were identified and not saved (knowledge base disabled). Enable
|
/lgtm |
|
/hold remove |
libvirt net-dhcp-leases can show multiple rows per MAC (anonymous DUID vs hostname). Prefer the lease whose Hostname matches the VM (e.g. master-0 for ostest_master_0), then any lease with a hostname, then any lease.
Add roles/common/files/resolve_vm_ip.sh and call it from update-cluster-inventory; use bracketed IPv6 for ssh/ssh-keyscan.
Made-with: Cursor