Skip to content

Avoid dropping metrics when forceFlush() races with the periodic export#8437

Open
mahitha-ada wants to merge 1 commit into
open-telemetry:mainfrom
mahitha-ada:fix/8433-forceflush-drops-metrics
Open

Avoid dropping metrics when forceFlush() races with the periodic export#8437
mahitha-ada wants to merge 1 commit into
open-telemetry:mainfrom
mahitha-ada:fix/8433-forceflush-drops-metrics

Conversation

@mahitha-ada
Copy link
Copy Markdown

Avoid dropping metrics when forceFlush() races with the periodic export

Fixes #8433

What's the problem this PR is trying to solve?

PeriodicMetricReader.forceFlush() delegates to Scheduled.doRun(), which acquires the export slot via exportAvailable.compareAndSet(true, false). When a force flush happens while the periodic export (or a previous force flush) is still in progress, doRun() takes the else branch:

} else {
  logger.log(Level.FINE, "Exporter busy. Dropping metrics.");
  flushResult.fail();
}

So the force flush logs "Exporter busy. Dropping metrics." and returns a failed result — silently dropping the very metrics the caller asked to flush. The reporter saw this happen on a recurring basis (~0.2% of force flushes on a health-check path that flushes every few seconds), with no exception surfaced to the forceFlush() API.

What is the proposed solution?

Make forceFlush() wait for the in-flight export to finish and then retry, rather than dropping. This mirrors the pattern shutdown() already uses: it joins on flushInProgress before performing its final collection so it doesn't lose the last batch.

  • Scheduled.doRun() is split into:
    • tryDoRun() — returns null when an export is already in progress (instead of dropping).
    • doRun() — preserves the previous drop-and-fail behavior for the periodic schedule (which simply retries on the next tick, so dropping a contended periodic collection is harmless).
  • forceFlush() calls tryDoRun(). On contention (null), it chains off flushInProgress.whenComplete(...) and retries, so the flush reflects the latest metrics.

No public API changes; the periodic path behavior is unchanged.

Tests

Adds forceFlush_whileExportInFlight_waitsAndExportsLatest, which starts a blocking export, fires a second forceFlush() that collides with it, and asserts the second flush:

  1. does not complete/fail while the export is in flight, and
  2. once the in-flight export is released, succeeds and performs its own export of the latest metrics (export count = 2).

This test fails against the current code (the second flush fails immediately with "Dropping metrics") and passes with the fix. All 18 existing PeriodicMetricReaderTest cases continue to pass (19 total).

Verified locally: ./gradlew :sdk:metrics:test --tests "*PeriodicMetricReaderTest" → 19 passed, 0 failed; ./gradlew :sdk:metrics:spotlessApply clean.

@mahitha-ada mahitha-ada requested a review from a team as a code owner May 30, 2026 02:47
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented May 30, 2026

CLA Signed
The committers listed above are authorized under a signed CLA.

  • ✅ login: mahitha-ada / name: Mahitha Adapa (4d50446)

@codecov
Copy link
Copy Markdown

codecov Bot commented May 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.00%. Comparing base (2f1d950) to head (54dafb3).

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #8437      +/-   ##
============================================
+ Coverage     90.96%   91.00%   +0.04%     
- Complexity     7809     7814       +5     
============================================
  Files           892      892              
  Lines         23702    23711       +9     
  Branches       2361     2363       +2     
============================================
+ Hits          21561    21579      +18     
+ Misses         1420     1411       -9     
  Partials        721      721              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PeriodicMetricReader.forceFlush() delegated to Scheduled.doRun(), which
acquires the exportAvailable flag with compareAndSet(true, false). When a
force flush happened while the periodic export (or a previous force flush)
was still in progress, doRun() took the else branch, logged 'Exporter
busy. Dropping metrics.' and returned a failed result -- silently dropping
the metrics the caller asked to flush.

Make forceFlush() wait for the in-flight export to complete and then
retry, mirroring how shutdown() already uses flushInProgress to wait for
an in-flight export before its final collection. doRun() is split into
tryDoRun(), which returns null when an export is already in progress, and
doRun(), which preserves the previous drop-and-fail behavior for the
periodic schedule (it retries on the next tick). forceFlush() uses
tryDoRun() and, on contention, chains off flushInProgress to retry.

Adds a regression test that fails without this change: a forceFlush()
racing an in-flight export now waits and exports the latest metrics
instead of dropping them.

Fixes open-telemetry#8433
@mahitha-ada mahitha-ada force-pushed the fix/8433-forceflush-drops-metrics branch from 4d50446 to 54dafb3 Compare May 30, 2026 04:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Force flush conflicts with periodic background exporting

1 participant