[PECOBLR-2461] fix(tests/e2e): de-collide MST + UC Volume tests across concurrent CI jobs#805
Merged
vikrantpuppala merged 2 commits intoMay 26, 2026
Conversation
…urrent CI jobs `_unique_table_name` derived the table name purely from the test node id, so two CI runs racing on the same warehouse + catalog (e.g. PR + push to main landing within seconds, as happened in run 26410038645) would both target `peco.default.mst_pysql_<test_name>` and step on each other's CREATE / DROP / DML. This caused the three "flaky" failures we kept seeing in `test_transactions.py`: - test_multi_table_commit → TRANSACTION_ROLLBACK_REQUIRED_AFTER_ABORT - test_executemany_rollback_in_txn → TABLE_OR_VIEW_NOT_FOUND - test_write_conflict_single_table → TABLE_OR_VIEW_ALREADY_EXISTS The companion helper `_unique_table_name_raw` already appended a uuid4 suffix for exactly this reason; the fixture helper was an oversight from #775. Add the same suffix here, reserving room so the result still fits in the 80-char cap. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
29d9003 to
ce5b170
Compare
gopalldb
approved these changes
May 26, 2026
…ation
While verifying the transactions fix, a second flake surfaced — the same
cross-CI race pattern, this time on a UC Volume file path:
tests/e2e/test_driver.py::TestPySQLCoreSuite::
test_uc_volume_put_fails_if_file_exists_and_overwrite_not_set
→ Failed: DID NOT RAISE <class 'databricks.sql.exc.ServerOperationError'>
Two tests in `tests/e2e/common/uc_volume_tests.py` PUT/GET/REMOVE against
the hardcoded path `/Volumes/{catalog}/{schema}/e2etests/file1.csv`. When
two CI runs overlap (as happened on this PR after the force-push for DCO
sign-off), one run's REMOVE deletes the other run's file between its two
PUTs, so the expected `FILE_IN_STAGING_PATH_ALREADY_EXISTS` never fires.
Suffix the volume path with `uuid4().hex[:8]` in the two affected tests
(test_uc_volume_life_cycle and test_uc_volume_put_fails_if_file_exists_
and_overwrite_not_set). The other UC Volume tests that reference the same
path are exercising client-side / server-parse failure paths that never
touch the file — leaving those alone keeps the diff focused.
Also add a `concurrency` block to the E2E workflow:
- On pull_request, cancel-in-progress: a new push / force-push on a PR
cancels the previous run on that ref. Eliminates the entire class of
"force-pushed during CI → two runs racing on shared warehouse state"
failures.
- On push to main, runs are NOT cancelled — each merge commit keeps its
own clean CI signal so a regression on commit N can't be hidden by
commit N+1 landing seconds later. Concurrent main runs can still
collide on shared state, but the uuid-suffix conventions in the e2e
tests are what keep that isolated.
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
|
Integration test approval reset. New commits were pushed to this PR. The A maintainer must re-review the changes and re-add the label to trigger tests again. Latest commit: 665b370 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three changes, all targeting the same underlying problem: e2e tests share a single Databricks warehouse + catalog + UC volume across CI runs, and were racing on shared paths.
tests/e2e/test_transactions.py—_unique_table_namederived the Delta table name purely from the pytest node id, so two CI jobs racing on the same warehouse would both targetpeco.default.mst_pysql_<test_name>and step on each other's CREATE / DROP / DML. Apply the sameuuid4().hex[:8]suffix that the sibling helper_unique_table_name_rawalready used. (Oversight from [PECOBLR-2461] Add comprehensive MST transaction E2E tests #775.)tests/e2e/common/uc_volume_tests.py—test_uc_volume_life_cycleandtest_uc_volume_put_fails_if_file_exists_and_overwrite_not_setPUT/GET/REMOVE against the hardcoded path/Volumes/{catalog}/{schema}/e2etests/file1.csv. When runs overlap, one run's REMOVE deletes the other's file between its two PUTs and the expectedFILE_IN_STAGING_PATH_ALREADY_EXISTSnever fires. Suffix the volume path the same way. The other UC Volume tests exercise client-side / server-parse failure paths that never touch the file, so they're left alone..github/workflows/code-coverage.yml— add aconcurrencygroup so a new push to a PR cancels the previous run on that ref.cancel-in-progressis enabled only forpull_requestevents, not for pushes tomain, so each merge commit keeps its own clean CI signal (a regression on commit N can't be hidden by commit N+1 landing seconds later).Why this matters
The "flaky" failures we've been seeing on
main(e.g. run 26410038645, which started 43 s after another E2E run onmainfinished) are all explained by this one shared-warehouse collision:TestMstCorrectness::test_multi_table_commitTRANSACTION_ROLLBACK_REQUIRED_AFTER_ABORT.PREVIOUS_QUERY_FAILUREfinallyblock hit the aborted stateTestMstExecuteVariants::test_executemany_rollback_in_txnTABLE_OR_VIEW_NOT_FOUNDTestMstCorrectness::test_write_conflict_single_tableTABLE_OR_VIEW_ALREADY_EXISTS(setup ERROR)DROP IF EXISTSandCREATETestPySQLCoreSuite::test_uc_volume_put_fails_if_file_exists_and_overwrite_not_setFailed: DID NOT RAISE ServerOperationErrorThe fourth row was caught on this PR itself — the force-push to add DCO sign-off triggered a second E2E run on the same PR ref while the first was still running. That's the trigger the new
concurrencyblock targets.I also audited the rest of the e2e suite (
test_complex_types,test_variant_types,test_parameterized_queries,test_driver) — they all already useuuid4()-suffixed names.test_transactions.pyanduc_volume_tests.pywere the only outliers.Test plan
yaml.safe_load).concurrencycancels the in-progress run on the PR ref (and that pushes to main still queue independently).