[CCIP-11717..11727] Fix CCIPReader test 12m-timeout hang: dedicated DB per parallel test by KodeyThomas · Pull Request #22970 · smartcontractkit/chainlink

KodeyThomas · 2026-06-26T09:23:36Z

Problem

8 sibling flaky-test tickets (CCIP-11717/11718/11719/11720/11721/11723/11724/11727), all t.Parallel() tests in integration-tests/smoke/ccip/ccip_reader_test.go, share one root cause: the package intermittently hangs with panic: test timed out after 12m0s.

A CI goroutine dump showed logPoller.pollAndSaveLogs → InsertBlocks stuck in a pgx network read for 11 minutes holding conn.Mutex, 6 other LogPoller DB goroutines blocked behind it, and logPoller.Close() (in t.Cleanup) blocked forever on its WaitGroup.

Root cause

These tests get their LogPoller DB from pgtest.NewSqlxDB, backed by the txdb driver. Under txdb every test in the package shares one physical Postgres database and tables — each caller only gets its own uncommitted transaction, not its own tables. The tests run in parallel and reuse the same evm_chain_id (chainD) while their simulated backends mint blocks 1,2,3…, so parallel tests insert the same (block_number, evm_chain_id) primary key into the shared evm.log_poller_blocks table (PRIMARY KEY (block_number, evm_chain_id), 0115_log_poller.sql).

The duplicate INSERT … ON CONFLICT DO NOTHING takes a speculative-insert lock that waits on another test's never-committed txdb transaction. txdb's conn.ExecContext/QueryContext discard the caller's context, so the blocked insert never times out, holds conn.Mutex, serializes the other LogPoller goroutines, and Close() hangs → the 12m package timeout.

Fix

Replace pgtest.NewSqlxDB with heavyweight.FullTestDBV2 (via a small newHeavyTestDB helper) at every DB-acquisition site. Each test now gets its own migrated database (no cross-test PK contention) and a real connection that honors query timeouts. This mirrors the existing usage in ccip_reader_bench_test.go.

Validation

Local diagnose harness (scoped to the CI test set to exclude unrelated Aptos/E2E tests):

TestCCIPReader* (10 parallel tests = max chain-4 collision pressure): 10/10 iterations green, 0 broken/timeout/slow, p50 35s.
Test_Get* (MemoryEnvironment group): 5/5 iterations green, 0 broken/timeout/slow, p50 45s.

Zero hangs across 15 collision-forcing iterations, versus the prior 12-minute timeout.

Fixes CCIP-11717, CCIP-11718, CCIP-11719, CCIP-11720, CCIP-11721, CCIP-11723, CCIP-11724, CCIP-11727.

…B per parallel test The parallel CCIPReader tests obtained their LogPoller DB via pgtest.NewSqlxDB, which is backed by the txdb driver. Under txdb every test in the package shares one physical Postgres database and tables — each caller only gets its own uncommitted transaction, not its own tables. The tests run with t.Parallel() and reuse the same evm_chain_id (chainD) while their simulated backends mint blocks 1,2,3..., so parallel tests insert the same (block_number, evm_chain_id) primary key into the shared evm.log_poller_blocks table (PRIMARY KEY (block_number, evm_chain_id), 0115_log_poller.sql). The duplicate INSERT ... ON CONFLICT DO NOTHING takes a speculative-insert lock that blocks on another test's never-committed txdb transaction. txdb's conn.ExecContext/QueryContext discard the caller context, so the blocked LogPoller insert never times out, holds conn.Mutex, serializes the other LogPoller goroutines, and logPoller.Close() blocks on its WaitGroup forever -> panic: test timed out after 12m0s (whole package fails). Replace pgtest.NewSqlxDB with heavyweight.FullTestDBV2 (via a newHeavyTestDB helper) at every site so each test gets its own migrated database (no cross-test PK contention) and a real connection that honors query timeouts. Mirrors the existing usage in ccip_reader_bench_test.go. Validated with the diagnose harness (sandbox disabled): scoped to the CI test set, TestCCIPReader* 10/10 iterations green (p50 35s) and Test_Get* 5/5 green (p50 45s) — zero hangs vs the prior 12m timeout. Fixes: CCIP-11717, CCIP-11718, CCIP-11719, CCIP-11720, CCIP-11721, CCIP-11723, CCIP-11724, CCIP-11727

github-actions · 2026-06-26T09:24:49Z

✅ No conflicts with other open PRs targeting develop

trunk-io · 2026-06-26T09:40:58Z

_{View Full Report ↗︎ ⋅ Docs}

pavel-raykov · 2026-06-26T10:03:17Z

Problem

8 sibling flaky-test tickets (CCIP-11717/11718/11719/11720/11721/11723/11724/11727), all t.Parallel() tests in integration-tests/smoke/ccip/ccip_reader_test.go, share one root cause: the package intermittently hangs with panic: test timed out after 12m0s.

A CI goroutine dump showed logPoller.pollAndSaveLogs → InsertBlocks stuck in a pgx network read for 11 minutes holding conn.Mutex, 6 other LogPoller DB goroutines blocked behind it, and logPoller.Close() (in t.Cleanup) blocked forever on its WaitGroup.

Root cause

These tests get their LogPoller DB from pgtest.NewSqlxDB, backed by the txdb driver. Under txdb every test in the package shares one physical Postgres database and tables — each caller only gets its own uncommitted transaction, not its own tables. The tests run in parallel and reuse the same evm_chain_id (chainD) while their simulated backends mint blocks 1,2,3…, so parallel tests insert the same (block_number, evm_chain_id) primary key into the shared evm.log_poller_blocks table (PRIMARY KEY (block_number, evm_chain_id), 0115_log_poller.sql).

The duplicate INSERT … ON CONFLICT DO NOTHING takes a speculative-insert lock that waits on another test's never-committed txdb transaction. txdb's conn.ExecContext/QueryContext discard the caller's context, so the blocked insert never times out, holds conn.Mutex, serializes the other LogPoller goroutines, and Close() hangs → the 12m package timeout.

Fix

Replace pgtest.NewSqlxDB with heavyweight.FullTestDBV2 (via a small newHeavyTestDB helper) at every DB-acquisition site. Each test now gets its own migrated database (no cross-test PK contention) and a real connection that honors query timeouts. This mirrors the existing usage in ccip_reader_bench_test.go.

Validation

Local diagnose harness (scoped to the CI test set to exclude unrelated Aptos/E2E tests):

TestCCIPReader* (10 parallel tests = max chain-4 collision pressure): 10/10 iterations green, 0 broken/timeout/slow, p50 35s.

Test_Get* (MemoryEnvironment group): 5/5 iterations green, 0 broken/timeout/slow, p50 45s.

Zero hangs across 15 collision-forcing iterations, versus the prior 12-minute timeout.

Fixes CCIP-11717, CCIP-11718, CCIP-11719, CCIP-11720, CCIP-11721, CCIP-11723, CCIP-11724, CCIP-11727.

sorry, I am not getting "The duplicate INSERT ... ON CONFLICT takes a
// speculative-insert lock that blocks on the other test's never-committed txdb
// transaction, and txdb strips the per-query context deadline, " - are you saying that the postgres db cannot properly serialize concurrent requests ?

KodeyThomas requested review from a team as code owners June 26, 2026 09:23

KodeyThomas requested review from AmrMohamedRezk, agusaldasoro, asoliman92, carte7000, chris-de-leon-cll, matYang and winder June 26, 2026 09:23

product-security-plaid-production Bot requested a review from george-dorin June 26, 2026 09:23

KodeyThomas marked this pull request as draft June 26, 2026 09:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CCIP-11717..11727] Fix CCIPReader test 12m-timeout hang: dedicated DB per parallel test#22970

[CCIP-11717..11727] Fix CCIPReader test 12m-timeout hang: dedicated DB per parallel test#22970
KodeyThomas wants to merge 1 commit into
developfrom
fix/ccip-reader-test-parallel-db-hang

KodeyThomas commented Jun 26, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

trunk-io Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

pavel-raykov commented Jun 26, 2026

Problem

Root cause

Fix

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

KodeyThomas commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Fix

Validation

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

trunk-io Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pavel-raykov commented Jun 26, 2026

Problem

Root cause

Fix

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KodeyThomas commented Jun 26, 2026 •

edited

Loading

trunk-io Bot commented Jun 26, 2026 •

edited

Loading