Skip to content

fix(e2e): retry localdns restart in cold-start validator#8649

Open
jingwenw15 wants to merge 4 commits into
mainfrom
jingwenwu/localdns-coldstart-restart-retry-pr
Open

fix(e2e): retry localdns restart in cold-start validator#8649
jingwenw15 wants to merge 4 commits into
mainfrom
jingwenwu/localdns-coldstart-restart-retry-pr

Conversation

@jingwenw15

@jingwenw15 jingwenw15 commented Jun 5, 2026

Copy link
Copy Markdown
Member

What this PR does / why we need it:

This PR hardens ValidateLocalDNSHostsPluginColdStart against a false-negative cold-start failure in the E2E itself.

The cold-start validator needs to exercise the case where LocalDNS starts with an empty /etc/localdns/hosts file and later picks up populated entries via the hosts plugin reload path. The flaky part was not LocalDNS behavior with an empty hosts file; it was the validator's forced service restart path. localdns.service runs a wrapper script that performs multi-step teardown and startup work, including DNS drop-in cleanup, networkctl reload, dummy interface teardown/recreate, and readiness gating. Some of that state converges asynchronously outside the unit, especially around DNS/network refresh.

Instead of using a one-shot systemctl restart localdns and retrying on failure, this change now cold-starts LocalDNS explicitly by:

  • stopping localdns
  • waiting for teardown stability before starting again
  • starting localdns

The teardown stability check waits for the unit to leave active/deactivating state, for 169.254.10.10 to disappear from /run/systemd/resolve/resolv.conf, and for the localdns dummy interface to be removed. This directly targets the flake instead of masking it with a blind retry.

This PR only changes the E2E validator. It does not change LocalDNS product behavior, bootstrap order, or scriptless CSE timing.

Which issue(s) this PR fixes:

N/A (test-only flake)

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a bounded-retry wrapper around systemctl restart localdns in the localdns cold-start validator to reduce test flakiness from transient restart failures during async systemd/network cleanup.

Changes:

  • Introduce restart_localdns_with_retry helper with logging and bounded retries.
  • Replace direct systemctl restart localdns calls with the retry helper across the cold-start validation flow.

Comment thread e2e/validators.go Outdated
Comment thread e2e/validators.go Outdated
Comment thread e2e/validators.go Outdated
Comment thread e2e/validators.go Outdated
echo "$1" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$'
}

# Helper: restart localdns with bounded retries.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the number of failures we are getting would be best to be investigated instead of adding a retry... this is just hiding the root cause. if you need to wait, best to wait for stability ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A single restart can fail transiently - what's the underlaying root cause of these failures? can we try to focus on reducing flakiness instead?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 60d02fc. I removed the restart retry and changed the validator to an explicit stop -> wait-for-teardown -> start flow.

The intermittent failure was at the stop/start boundary of the test, not because LocalDNS could not work with an empty hosts file. localdns.service runs localdns.sh, which tears down DNS/network state during stop. Part of that teardown converges asynchronously outside the unit, especially around networkctl reload, /run/systemd/resolve/resolv.conf refresh, and dummy interface removal. A start issued immediately after stop could race that teardown and fail transiently.

The validator now waits for those teardown signals directly before starting again: the unit is no longer active/activating/deactivating, 169.254.10.10 is gone from /run/systemd/resolve/resolv.conf, and the localdns dummy interface has been removed. So the change is aimed at reducing the flakiness at the underlying stability boundary rather than masking it with retries.

@jingwenw15

jingwenw15 commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

@djsly Updated in a10ac1b. I agreed the blind retry was not the best shape for this.

The validator no longer uses a one-shot systemctl restart localdns with retry. It now does an explicit cold start: stop localdns, wait for teardown stability, then start localdns again.

The underlying issue in the flaky cases was the stop/start boundary, not LocalDNS behavior with an empty hosts file. localdns.service runs a wrapper script that tears down DNS/network state during stop. Part of that cleanup converges asynchronously outside the unit, especially around networkctl reload and /run/systemd/resolve/resolv.conf refresh. So the first restart could fail even though the same empty-hosts-file state worked a few seconds later.

The new wait focuses on that stability boundary directly: before starting again, the validator now waits for the unit to leave active/deactivating state, for 169.254.10.10 to disappear from /run/systemd/resolve/resolv.conf, and for the localdns dummy interface to be removed. So this is still a test-only change, but it is now addressing the flake by waiting for the relevant teardown state rather than retrying the same restart call.

Copilot AI review requested due to automatic review settings June 10, 2026 21:54
@jingwenw15 jingwenw15 requested a review from runzhen as a code owner June 10, 2026 21:54

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.

Comment thread e2e/validators.go
Comment thread e2e/validators.go
Comment thread e2e/validators.go Outdated
Comment thread e2e/validators.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants