Skip to content

test: Fix flaky BasicAuthMSQTest#19593

Open
amaechler wants to merge 7 commits into
apache:masterfrom
amaechler:fix-basic-auth-msq-propagation-flake
Open

test: Fix flaky BasicAuthMSQTest#19593
amaechler wants to merge 7 commits into
apache:masterfrom
amaechler:fix-basic-auth-msq-propagation-flake

Conversation

@amaechler

@amaechler amaechler commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Description

BasicAuthMSQTest is intermittently flaky: a test occasionally fails with 401 Unauthorized instead of the expected 403 Forbidden.

The permission updates in the tests are eventually consistent, but propagate to other services (like the broker) asynchronously, so the MSQ task in the test can reach the Broker before its auth cache has caught up.

Fix

Retry the task submission while it fails with a transient auth errors, so the assertions only run once the Broker's auth cache reflects the test setup. Other failures are not retried, so genuine errors still fail fast. This follows the retry-on-propagation pattern already used by sibling tests (e.g. TLSTest).

Verified by compiling, running checkstyle, and running the test; a fault-injection run that forces a transient 401 then 403 confirms the retries fire and all four tests recover.

Analysis and implementation done with the help of Claude Code.


This PR has:

  • been self-reviewed.

@FrankChen021 FrankChen021 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed the code for correctness, edge cases, concurrency, and integration risks; no issues found.

Reviewed 1 of 1 changed files.


This is an automated review by Codex GPT-5.5

Basic-auth state propagates from the Coordinator to other services
asynchronously: an async push from the Coordinator plus a poll on each
service every druid.auth.basic.common.pollingPeriod. The Broker never
reads the security metadata store directly, so there is a window after a
security API call during which its auth cache is stale.

BasicAuthMSQTest creates the test user/role in @beforeeach and grants
permissions in each test body, then immediately submits an MSQ task to a
Broker. When the request beats the propagation, the test sees:

- 401 Unauthorized instead of the expected 403 Forbidden, when the newly
  created user has not yet propagated (authentication), and
- a transient 403 Forbidden in the positive tests, before the granted
  permission has propagated (authorization).

Retry the task submission while it fails with these transient auth
errors so the assertions only run once the Broker's auth cache reflects
the test setup. Other failures are not retried, so real errors still
fail fast.
Replace the hand-rolled messageContains cause-chain walk with the
ExceptionMatcher already used for the assertion, so the retry predicate
and the assertion inspect the exception through the same mechanism.
@amaechler amaechler force-pushed the fix-basic-auth-msq-propagation-flake branch from a78fbab to dfd90f3 Compare June 19, 2026 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants