Skip to content

Fix flaky TestBlocksCleaner by awaiting HeartBeat goroutine completion#7386

Open
archy-rock3t-cloud wants to merge 1 commit intocortexproject:masterfrom
sophotechlabs:fix/flaky-blocks-cleaner-heartbeat
Open

Fix flaky TestBlocksCleaner by awaiting HeartBeat goroutine completion#7386
archy-rock3t-cloud wants to merge 1 commit intocortexproject:masterfrom
sophotechlabs:fix/flaky-blocks-cleaner-heartbeat

Conversation

@archy-rock3t-cloud
Copy link
Copy Markdown
Contributor

What this PR does:
Fixes flaky TestBlocksCleaner by ensuring the HeartBeat goroutine completes before the function returns.

Which issue(s) this PR fixes:
Fixes flaky TestBlocksCleaner CI failures.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Artem Muterko <artem@sopho.tech>
Copy link
Copy Markdown
Member

@CharlieTLe CharlieTLe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Overview

This PR fixes a goroutine leak / race in BlocksCleaner. The HeartBeat goroutine is launched to periodically update visit markers, and when the parent function returns, it signals the goroutine to stop via errChan <- nil. However, HeartBeat still does meaningful work after receiving from errChan — it calls MarkWithStatus(ctx, Completed), logs, and (since deleteOnExit=true) calls DeleteVisitMarker. Without this fix, the parent function could return and the test could tear down resources while HeartBeat is still accessing them, causing flaky failures.

The fix adds a doneChan that is closed when HeartBeat returns, and the defer waits on <-doneChan after sending the nil error. This ensures the goroutine has fully completed before the function returns.

Analysis

Correctness: The fix is correct. Both operations (errChan <- nil then <-doneChan) are in the same defer closure, so they execute sequentially in the right order. The errChan is buffered (size 1), so the send is non-blocking even if HeartBeat already exited via ctx.Done(). In that case, doneChan is already closed, so <-doneChan returns immediately. All code paths are safe.

Scope: Minimal and focused — only the two call sites in cleanUpActiveUsers and cleanDeletedUsers are changed, with identical fixes.

Looks good to me. Two minor suggestions:

  1. CHANGELOG: The PR checklist shows CHANGELOG as not updated. Consider adding a [BUGFIX] entry — this addresses a real goroutine leak (not just a test issue). The HeartBeat goroutine could outlive its parent and access stale resources in production too.
  2. Optional refactor: See inline comment about extracting the duplicated pattern into a helper.

defer func() {
errChan <- nil
<-doneChan
}()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This exact pattern is duplicated verbatim in cleanDeletedUsers below. Consider extracting a small helper to reduce the duplication, e.g.:

func runWithHeartBeat(ctx context.Context, mgr *VisitMarkerManager, interval time.Duration, fn func() error) error {
    errChan := make(chan error, 1)
    doneChan := make(chan struct{})
    go func() {
        mgr.HeartBeat(ctx, errChan, interval, true)
        close(doneChan)
    }()
    defer func() {
        errChan <- nil
        <-doneChan
    }()
    return fn()
}

This is optional — the duplication is small and the current code is clear.

@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label Mar 31, 2026
@archy-rock3t-cloud
Copy link
Copy Markdown
Contributor Author

@SungJin1212 @CharlieTLe — friendly bump. CI is green and this has been approved for ~4 weeks. Could one of you flip the merge lever when convenient? Thanks!

@SungJin1212 SungJin1212 requested a review from CharlieTLe April 29, 2026 01:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm This PR has been approved by a maintainer size/S type/flaky-test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants