Skip to content

[CCIP-11717..11727] Fix CCIPReader test 12m-timeout hang: dedicated DB per parallel test#22970

Draft
KodeyThomas wants to merge 1 commit into
developfrom
fix/ccip-reader-test-parallel-db-hang
Draft

[CCIP-11717..11727] Fix CCIPReader test 12m-timeout hang: dedicated DB per parallel test#22970
KodeyThomas wants to merge 1 commit into
developfrom
fix/ccip-reader-test-parallel-db-hang

Conversation

@KodeyThomas

@KodeyThomas KodeyThomas commented Jun 26, 2026

Copy link
Copy Markdown
Member

Problem

8 sibling flaky-test tickets (CCIP-11717/11718/11719/11720/11721/11723/11724/11727), all t.Parallel() tests in integration-tests/smoke/ccip/ccip_reader_test.go, share one root cause: the package intermittently hangs with panic: test timed out after 12m0s.

A CI goroutine dump showed logPoller.pollAndSaveLogs → InsertBlocks stuck in a pgx network read for 11 minutes holding conn.Mutex, 6 other LogPoller DB goroutines blocked behind it, and logPoller.Close() (in t.Cleanup) blocked forever on its WaitGroup.

Root cause

These tests get their LogPoller DB from pgtest.NewSqlxDB, backed by the txdb driver. Under txdb every test in the package shares one physical Postgres database and tables — each caller only gets its own uncommitted transaction, not its own tables. The tests run in parallel and reuse the same evm_chain_id (chainD) while their simulated backends mint blocks 1,2,3…, so parallel tests insert the same (block_number, evm_chain_id) primary key into the shared evm.log_poller_blocks table (PRIMARY KEY (block_number, evm_chain_id), 0115_log_poller.sql).

The duplicate INSERT … ON CONFLICT DO NOTHING takes a speculative-insert lock that waits on another test's never-committed txdb transaction. txdb's conn.ExecContext/QueryContext discard the caller's context, so the blocked insert never times out, holds conn.Mutex, serializes the other LogPoller goroutines, and Close() hangs → the 12m package timeout.

Fix

Replace pgtest.NewSqlxDB with heavyweight.FullTestDBV2 (via a small newHeavyTestDB helper) at every DB-acquisition site. Each test now gets its own migrated database (no cross-test PK contention) and a real connection that honors query timeouts. This mirrors the existing usage in ccip_reader_bench_test.go.

Validation

Local diagnose harness (scoped to the CI test set to exclude unrelated Aptos/E2E tests):

  • TestCCIPReader* (10 parallel tests = max chain-4 collision pressure): 10/10 iterations green, 0 broken/timeout/slow, p50 35s.
  • Test_Get* (MemoryEnvironment group): 5/5 iterations green, 0 broken/timeout/slow, p50 45s.

Zero hangs across 15 collision-forcing iterations, versus the prior 12-minute timeout.

Fixes CCIP-11717, CCIP-11718, CCIP-11719, CCIP-11720, CCIP-11721, CCIP-11723, CCIP-11724, CCIP-11727.

…B per parallel test

The parallel CCIPReader tests obtained their LogPoller DB via pgtest.NewSqlxDB,
which is backed by the txdb driver. Under txdb every test in the package shares
one physical Postgres database and tables — each caller only gets its own
uncommitted transaction, not its own tables. The tests run with t.Parallel()
and reuse the same evm_chain_id (chainD) while their simulated backends mint
blocks 1,2,3..., so parallel tests insert the same (block_number, evm_chain_id)
primary key into the shared evm.log_poller_blocks table
(PRIMARY KEY (block_number, evm_chain_id), 0115_log_poller.sql).

The duplicate INSERT ... ON CONFLICT DO NOTHING takes a speculative-insert lock
that blocks on another test's never-committed txdb transaction. txdb's
conn.ExecContext/QueryContext discard the caller context, so the blocked
LogPoller insert never times out, holds conn.Mutex, serializes the other
LogPoller goroutines, and logPoller.Close() blocks on its WaitGroup forever ->
panic: test timed out after 12m0s (whole package fails).

Replace pgtest.NewSqlxDB with heavyweight.FullTestDBV2 (via a newHeavyTestDB
helper) at every site so each test gets its own migrated database (no cross-test
PK contention) and a real connection that honors query timeouts. Mirrors the
existing usage in ccip_reader_bench_test.go.

Validated with the diagnose harness (sandbox disabled): scoped to the CI test
set, TestCCIPReader* 10/10 iterations green (p50 35s) and Test_Get* 5/5 green
(p50 45s) — zero hangs vs the prior 12m timeout.

Fixes: CCIP-11717, CCIP-11718, CCIP-11719, CCIP-11720, CCIP-11721, CCIP-11723,
CCIP-11724, CCIP-11727
@github-actions

Copy link
Copy Markdown
Contributor

✅ No conflicts with other open PRs targeting develop

@trunk-io

trunk-io Bot commented Jun 26, 2026

Copy link
Copy Markdown

Static BadgeStatic BadgeStatic BadgeStatic Badge

View Full Report ↗︎Docs

@KodeyThomas KodeyThomas marked this pull request as draft June 26, 2026 09:49
@pavel-raykov

Copy link
Copy Markdown
Collaborator

Problem

8 sibling flaky-test tickets (CCIP-11717/11718/11719/11720/11721/11723/11724/11727), all t.Parallel() tests in integration-tests/smoke/ccip/ccip_reader_test.go, share one root cause: the package intermittently hangs with panic: test timed out after 12m0s.

A CI goroutine dump showed logPoller.pollAndSaveLogs → InsertBlocks stuck in a pgx network read for 11 minutes holding conn.Mutex, 6 other LogPoller DB goroutines blocked behind it, and logPoller.Close() (in t.Cleanup) blocked forever on its WaitGroup.

Root cause

These tests get their LogPoller DB from pgtest.NewSqlxDB, backed by the txdb driver. Under txdb every test in the package shares one physical Postgres database and tables — each caller only gets its own uncommitted transaction, not its own tables. The tests run in parallel and reuse the same evm_chain_id (chainD) while their simulated backends mint blocks 1,2,3…, so parallel tests insert the same (block_number, evm_chain_id) primary key into the shared evm.log_poller_blocks table (PRIMARY KEY (block_number, evm_chain_id), 0115_log_poller.sql).

The duplicate INSERT … ON CONFLICT DO NOTHING takes a speculative-insert lock that waits on another test's never-committed txdb transaction. txdb's conn.ExecContext/QueryContext discard the caller's context, so the blocked insert never times out, holds conn.Mutex, serializes the other LogPoller goroutines, and Close() hangs → the 12m package timeout.

Fix

Replace pgtest.NewSqlxDB with heavyweight.FullTestDBV2 (via a small newHeavyTestDB helper) at every DB-acquisition site. Each test now gets its own migrated database (no cross-test PK contention) and a real connection that honors query timeouts. This mirrors the existing usage in ccip_reader_bench_test.go.

Validation

Local diagnose harness (scoped to the CI test set to exclude unrelated Aptos/E2E tests):

  • TestCCIPReader* (10 parallel tests = max chain-4 collision pressure): 10/10 iterations green, 0 broken/timeout/slow, p50 35s.
  • Test_Get* (MemoryEnvironment group): 5/5 iterations green, 0 broken/timeout/slow, p50 45s.

Zero hangs across 15 collision-forcing iterations, versus the prior 12-minute timeout.

Fixes CCIP-11717, CCIP-11718, CCIP-11719, CCIP-11720, CCIP-11721, CCIP-11723, CCIP-11724, CCIP-11727.

sorry, I am not getting "The duplicate INSERT ... ON CONFLICT takes a
// speculative-insert lock that blocks on the other test's never-committed txdb
// transaction, and txdb strips the per-query context deadline, " - are you saying that the postgres db cannot properly serialize concurrent requests ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants