Skip to content

[FLINK-39931] Fix flaky AdaptiveBatchSchedulerTest due to unexpected ComponentMainThreadExecutor behavior in async execution#28481

Open
ocb3916 wants to merge 4 commits into
apache:masterfrom
ocb3916:FLINK-39931
Open

[FLINK-39931] Fix flaky AdaptiveBatchSchedulerTest due to unexpected ComponentMainThreadExecutor behavior in async execution#28481
ocb3916 wants to merge 4 commits into
apache:masterfrom
ocb3916:FLINK-39931

Conversation

@ocb3916

@ocb3916 ocb3916 commented Jun 18, 2026

Copy link
Copy Markdown

What is the purpose of the change

This PR fixes flaky failures in AdaptiveBatchSchedulerTest (FLINK-38970) caused by main-thread constraint violations when CompletableFuture callbacks are dispatched from background IO executor threads.

The test previously used SynchronousComponentMainThreadExecutorServiceAdapter, which enforces strict thread-identity checks via assertRunningInMainThread(). Because some CompletableFuture callbacks in the code under test can be completed from IO threads rather than the main thread, this strict check occasionally failed nondeterministically.

The fix replaces it with NoMainThreadCheckComponentMainThreadExecutor, which preserves synchronous execution semantics (tasks still run immediately on the calling thread) while skipping the thread-identity assertion, eliminating the race condition without weakening the actual scheduling behavior under test.


Brief change log

  • AdaptiveBatchSchedulerTest.java
    • Replaced import org.apache.flink.runtime.concurrent.ComponentMainThreadExecutorServiceAdapter with org.apache.flink.runtime.concurrent.NoMainThreadCheckComponentMainThreadExecutor
    • Removed now-unused imports: org.apache.flink.runtime.testutils.DirectScheduledExecutorService, java.util.concurrent.Callable, java.util.concurrent.ScheduledFuture, java.util.concurrent.TimeUnit
    • In setUp(): replaced mainThreadExecutor = new SynchronousComponentMainThreadExecutor(); with mainThreadExecutor = new NoMainThreadCheckComponentMainThreadExecutor();, with a comment explaining the FLINK-38970 rationale
    • Removed the private inner class SynchronousComponentMainThreadExecutor — this logic is now provided by the shared NoMainThreadCheckComponentMainThreadExecutor utility instead of being duplicated locally

Verifying this change

Run the following test cases and confirm all repetitions pass without failure:

# AdaptiveBatchSchedulerTest
./mvnw test -pl flink-runtime \
  -Dtest=AdaptiveBatchSchedulerTest#testVertexInitializationFailureIsLabeled \
  -Denforcer.skip=true

# ExecutionTimeBasedSlowTaskDetectorTest
./mvnw test -pl flink-runtime \
  -Dtest=ExecutionTimeBasedSlowTaskDetectorTest#testUnbalancedInput \
  -Denforcer.skip=true
image image

Expected result: all 10,000 repetitions of testVertexInitializationFailureIsLabeled and all 1,000 repetitions of testUnbalancedInput complete successfully, with no main-thread constraint violation failures.


Does this pull request potentially affect one of the following parts?

  • Dependencies (flink-dist or flink-parent pom)
  • The public API, i.e. changes to interfaces or classes that are @Public / @PublicEvolving
  • Test infrastructure / test execution stability — Replaces the main-thread executor used in test setup to fix flaky behavior; does not affect production scheduling logic.
  • Documentation — No documentation changes are required. This PR only modifies test code, test executor setup, and IDE configuration files. No user-facing behavior is altered.
Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

@flinkbot

flinkbot commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@ocb3916 ocb3916 changed the title [Flink 39931][runtime/tests] Verify stability of testVertexInitializationFailureIsLabeled and testUnbalancedInput with RepeatedTest [FLINK-39931][runtime/tests] Verify stability of testVertexInitializationFailureIsLabeled and testUnbalancedInput with RepeatedTest Jun 18, 2026
@ocb3916 ocb3916 force-pushed the FLINK-39931 branch 2 times, most recently from f99a293 to c14c61c Compare June 18, 2026 12:25
@ocb3916 ocb3916 changed the title [FLINK-39931][runtime/tests] Verify stability of testVertexInitializationFailureIsLabeled and testUnbalancedInput with RepeatedTest [FLINK-39931] Verify stability of testVertexInitializationFailureIsLabeled and testUnbalancedInput with RepeatedTest ---> Fix flaky AdaptiveBatchSchedulerTest due to unexpected ComponentMainThreadExecutor behavior in async execution Jun 18, 2026
@ocb3916 ocb3916 changed the title [FLINK-39931] Verify stability of testVertexInitializationFailureIsLabeled and testUnbalancedInput with RepeatedTest ---> Fix flaky AdaptiveBatchSchedulerTest due to unexpected ComponentMainThreadExecutor behavior in async execution [FLINK-39931] Fix flaky AdaptiveBatchSchedulerTest due to unexpected ComponentMainThreadExecutor behavior in async execution Jun 18, 2026
@ocb3916 ocb3916 closed this Jun 18, 2026
@ocb3916 ocb3916 reopened this Jun 20, 2026
@ocb3916 ocb3916 force-pushed the FLINK-39931 branch 7 times, most recently from 6023c28 to dd03175 Compare June 20, 2026 06:39
@ocb3916 ocb3916 marked this pull request as draft June 20, 2026 06:52
@ocb3916 ocb3916 closed this Jun 20, 2026
@ocb3916 ocb3916 reopened this Jun 20, 2026
… to unexpected ComponentMainThreadExecutor behavior in async execution

Co-authored-by: Yuepeng Pan <hipanyuepeng@gmail.com>
@ocb3916 ocb3916 marked this pull request as ready for review June 20, 2026 09:38
@ocb3916

ocb3916 commented Jun 20, 2026

Copy link
Copy Markdown
Author

Hi @RocMarshal! I have completed the PR and it's now ready for your review.
Whenever you have some time, please take a look. Thank you! 🙏

@RocMarshal RocMarshal self-assigned this Jun 20, 2026

@spuru9 spuru9 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description is out of sync with the actual diff — just please update the change log to match what's actually here.

ocb3916 and others added 2 commits June 21, 2026 16:51
Co-authored-by: Purushottam Sinha <sinhapurushottam911@gmail.com>
Co-authored-by: Purushottam Sinha <sinhapurushottam911@gmail.com>
// CompletableFuture callbacks are dispatched from background IO executor threads.
// The synchronous execution is preserved while eliminating the race condition.
mainThreadExecutor = new SynchronousComponentMainThreadExecutor();
// No-op thread check avoids flakiness from FLINK-38970: scheduler callbacks

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ocb3916 I think the suggestion didnt work as intended, might have to do locally.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

@github-actions github-actions Bot added the community-reviewed PR has been reviewed by the community. label Jun 21, 2026
Co-authored-by: Purushottam Sinha <sinhapurushottam911@gmail.com>

@spuru9 spuru9 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-reviewed PR has been reviewed by the community.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants