fix(e2e): retry localdns restart in cold-start validator#8649
fix(e2e): retry localdns restart in cold-start validator#8649jingwenw15 wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a bounded-retry wrapper around systemctl restart localdns in the localdns cold-start validator to reduce test flakiness from transient restart failures during async systemd/network cleanup.
Changes:
- Introduce
restart_localdns_with_retryhelper with logging and bounded retries. - Replace direct
systemctl restart localdnscalls with the retry helper across the cold-start validation flow.
| echo "$1" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$' | ||
| } | ||
|
|
||
| # Helper: restart localdns with bounded retries. |
There was a problem hiding this comment.
the number of failures we are getting would be best to be investigated instead of adding a retry... this is just hiding the root cause. if you need to wait, best to wait for stability ?
There was a problem hiding this comment.
A single restart can fail transiently - what's the underlaying root cause of these failures? can we try to focus on reducing flakiness instead?
There was a problem hiding this comment.
Updated in 60d02fc. I removed the restart retry and changed the validator to an explicit stop -> wait-for-teardown -> start flow.
The intermittent failure was at the stop/start boundary of the test, not because LocalDNS could not work with an empty hosts file. localdns.service runs localdns.sh, which tears down DNS/network state during stop. Part of that teardown converges asynchronously outside the unit, especially around networkctl reload, /run/systemd/resolve/resolv.conf refresh, and dummy interface removal. A start issued immediately after stop could race that teardown and fail transiently.
The validator now waits for those teardown signals directly before starting again: the unit is no longer active/activating/deactivating, 169.254.10.10 is gone from /run/systemd/resolve/resolv.conf, and the localdns dummy interface has been removed. So the change is aimed at reducing the flakiness at the underlying stability boundary rather than masking it with retries.
|
@djsly Updated in a10ac1b. I agreed the blind retry was not the best shape for this. The validator no longer uses a one-shot The underlying issue in the flaky cases was the stop/start boundary, not LocalDNS behavior with an empty hosts file. The new wait focuses on that stability boundary directly: before starting again, the validator now waits for the unit to leave active/deactivating state, for |
What this PR does / why we need it:
This PR hardens
ValidateLocalDNSHostsPluginColdStartagainst a false-negative cold-start failure in the E2E itself.The cold-start validator needs to exercise the case where LocalDNS starts with an empty
/etc/localdns/hostsfile and later picks up populated entries via the hosts plugin reload path. The flaky part was not LocalDNS behavior with an empty hosts file; it was the validator's forced service restart path.localdns.serviceruns a wrapper script that performs multi-step teardown and startup work, including DNS drop-in cleanup,networkctl reload, dummy interface teardown/recreate, and readiness gating. Some of that state converges asynchronously outside the unit, especially around DNS/network refresh.Instead of using a one-shot
systemctl restart localdnsand retrying on failure, this change now cold-starts LocalDNS explicitly by:localdnslocaldnsThe teardown stability check waits for the unit to leave active/deactivating state, for
169.254.10.10to disappear from/run/systemd/resolve/resolv.conf, and for thelocaldnsdummy interface to be removed. This directly targets the flake instead of masking it with a blind retry.This PR only changes the E2E validator. It does not change LocalDNS product behavior, bootstrap order, or scriptless CSE timing.
Which issue(s) this PR fixes:
N/A (test-only flake)