Skip to content

Implement full UUID instance/node IDs (#129)#238

Open
crprashant wants to merge 2 commits into
microsoft:mainfrom
crprashant:crprashant/issue-129-full-uuid
Open

Implement full UUID instance/node IDs (#129)#238
crprashant wants to merge 2 commits into
microsoft:mainfrom
crprashant:crprashant/issue-129-full-uuid

Conversation

@crprashant

@crprashant crprashant commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Fixes #129.

Summary

Replaces the collision-prone 8-character short IDs (32 bits of entropy, ~50% PK-collision probability near 65k instances) with full canonical UUIDs (122 bits) for df.instances and df.nodes IDs.

Changes

  • types.rs: add full_uuid(); retain short_id() for pre-0.2.3 schemas
  • dsl.rs: gate the ID format on full_uuid_ids_enabled() (installed extversion >= 0.2.3) and thread the choice through start()/insert_nodes() so node IDs and node references also become UUIDs
  • lib.rs: widen the six id columns to TEXT and relax the *_format_chk regexes to accept a legacy 8-char id or a canonical UUID; add a node-id format test
  • explain.rs: classify both legacy 8-char and full-UUID instance IDs
  • load_function_graph.rs: cast the id columns to text in the BGW's cached SELECTs so an ALTER EXTENSION UPDATE that widens them to TEXT under a live pooled connection cannot trip cached plan must not change result type
  • fold the UUID-widening DDL (drop FKs/checks, ALTER COLUMN ... TYPE text, re-add checks/FKs NOT VALID) into the existing, still-unreleased upgrade script sql/pg_durable--0.2.2--0.2.3.sql; no extension version bump
  • docs: USER_GUIDE, ARCHITECTURE, upgrade-testing, and a CHANGELOG entry

Backward compatibility

The 0.2.3 .so keeps emitting 8-char IDs against pre-0.2.3 schemas until ALTER EXTENSION UPDATE runs (Scenario B1); existing 8-char rows stay valid and coexist with UUIDs after the upgrade (Scenario B2). The gate flips to UUIDs only inside the same transaction that widens the columns.

Testing

  • cargo fmt -p pg_durable -- --check: clean
  • cargo clippy --features pg17: clean (only the pre-existing extract_host dead-code warning)
  • unit (cargo pgrx test pg17): 171 passed / 0 failed
  • e2e (scripts/test-e2e-local.sh): 33/33
  • upgrade (scripts/test-upgrade.sh): 20/20 — schema comparison + binary backward compatibility
  • pgspot: clean

@pinodeca pinodeca left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have not yet released v0.2.3, so my primary request is to reverse the version bump in Cargo.toml and to fold the 0.2.3->0.2.4 upgrade script into 0.2.2->0.2.3

Opus 4.7 also suggested these changes, which we can discuss individually and potentially address as follow-ups:

Minor suggestions (non-blocking)

  1. explain.rs UUID detection is too permissive. explain.rs uses uuid::Uuid::parse_str(trimmed).is_ok(), which accepts simple (32-char no hyphens), hyphenated, braced ({...}), and URN (urn:uuid:...) forms. Only canonical hyphenated UUIDs ever appear in df.instances.id, so non-canonical forms will silently fall through to "not found". Tightening to require length 36 + parse_str would make df.explain('garbage32hexcharsthatlookslikeauuid') behave as a DSL-parse error rather than an instance lookup miss. Trivial impact.

  2. Schema snapshot doesn't compare CHECK bodies. Per the PR's own constraint-drift note, test-upgrade.sh doesn't pin pg_get_constraintdef() / convalidated. The new id-format regex is duplicated byte-for-byte in lib.rs and sql/pg_durable--0.2.3--0.2.4.sql; a future drift in one but not the other would only be caught by pgspot + the functional B1/B2 path. Hardening the snapshot to include pg_get_constraintdef() is filed implicitly in the docs — worth a follow-up issue.

  3. No CHANGELOG.md entry. CHANGELOG.md exists at repo root but isn't in the diff. The upgrade-testing.md "v0.2.3 → v0.2.4" section is great, but a CHANGELOG line for the version bump is conventional.

  4. full_uuid() could just be Uuid::new_v4().to_string() inline. It's a one-liner used in exactly one place (new_id). The doc comment is valuable, but the wrapper is borderline. Optional.

  5. Existing-row revalidation. The upgrade re-adds all six format CHECKs as NOT VALID to skip the table scan, which is appropriate for a maintenance-window upgrade. Worth a brief operator-doc line that anyone wanting full constraint validation later can run ALTER TABLE … VALIDATE CONSTRAINT … (rows already conform since 8-char IDs match the relaxed regex).

Suggested follow-ups (separate PRs)

  • Tighten explain.rs UUID detection to canonical form only.
  • Extend test-upgrade.sh schema snapshot to include pg_get_constraintdef() and convalidated.
  • Add a CHANGELOG.md entry

@crprashant crprashant force-pushed the crprashant/issue-129-full-uuid branch from cc4f375 to 74ad427 Compare June 16, 2026 18:55
@crprashant

crprashant commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the review, @pinodeca! Addressed the blocking item plus the cheap niceties.

Blocking — version fold

  • Reverted the version bump. There's no longer any Cargo.toml/Cargo.lock change, so the diff is now net-zero against main (0.2.3 is the in-progress version, not a new one).
  • Folded the UUID-widening DDL into the existing sql/pg_durable--0.2.2--0.2.3.sql — it now carries both the df.duroxide_schema() change and the id-column widening (disjoint objects, so order is irrelevant) — and deleted the separate 0.2.3--0.2.4.sql.
  • The runtime gate is now extversion >= 0.2.3.

Minor suggestions (numbered as in your review)

  • Suggestion 1 (explain.rs) — tightened to the canonical 36-char form (trimmed.len() == 36 && Uuid::parse_str(...).is_ok()), so non-canonical UUID spellings now surface as a DSL-parse error instead of a silent instance-lookup miss. ✅
  • Suggestion 3 (CHANGELOG) — added a [0.2.3] - Unreleased entry. ✅
  • Suggestion 5 (VALIDATE CONSTRAINT) — added an operator note in docs/upgrade-testing.md: because the relaxed regex also matches every legacy 8-char id, an operator can run ALTER TABLE … VALIDATE CONSTRAINT … later to clear the NOT VALID marker, and the validating scan cannot fail. ✅
  • Suggestion 4 (full_uuid() wrapper) — left as-is for the doc comment; happy to inline it if you'd prefer.
  • Suggestion 2 (schema-snapshot CHECK-body / convalidated hardening) — agree this is worth doing. I'll file it as a separate follow-up as you suggested, rather than expanding this PR.

Re-ran the full CONTRIBUTING dev workflow locally: fmt/build/clippy clean, unit 171/0, e2e 33/33, and upgrade 20/20 — Scenario A confirms the folded 0.2.2→0.2.3 upgrade converges with a fresh install, and B1/B2 confirm a 0.2.3 .so still emits 8-char ids against an un-upgraded 0.2.2 schema and that existing rows survive.

Replace the collision-prone 8-character short IDs (32 bits of entropy,
~50% PK-collision probability near 65k instances) with full canonical
UUIDs (122 bits) for df.instances and df.nodes IDs.

- types.rs: add full_uuid(); retain short_id() for the not-yet-upgraded
  pre-0.2.3 schema
- dsl.rs: gate ID format on full_uuid_ids_enabled() (extversion >= 0.2.3)
  and thread the choice through start()/insert_nodes() so node IDs and
  node references also become UUIDs
- lib.rs: widen the six id columns to TEXT and relax the *_format_chk
  regexes to accept a legacy 8-char id or a canonical UUID; add a
  node-id format test
- explain.rs: classify both legacy 8-char and canonical 36-char UUID
  instance IDs
- load_function_graph.rs: cast the id columns to text in the BGW's cached
  SELECTs so an ALTER EXTENSION UPDATE that widens them to TEXT under a
  live pooled connection cannot trip "cached plan must not change result
  type"
- fold the UUID-widening DDL (drop FKs/checks, ALTER COLUMN TYPE text,
  re-add checks/FKs NOT VALID) into the existing, still-unreleased
  sql/pg_durable--0.2.2--0.2.3.sql upgrade script instead of minting a
  new version; no Cargo version bump
- docs: USER_GUIDE, ARCHITECTURE, upgrade-testing (incl. an operator note
  on running VALIDATE CONSTRAINT later) and a CHANGELOG entry

Backward compatible: the 0.2.3 .so keeps emitting 8-char IDs against the
not-yet-upgraded 0.2.2 schema until ALTER EXTENSION UPDATE runs (B1);
existing 8-char rows stay valid and coexist with UUIDs (B2). Verified
locally: unit (171), e2e, and upgrade (Scenario A + B1 + B2) suites pass;
pgspot clean.
@crprashant crprashant force-pushed the crprashant/issue-129-full-uuid branch from 74ad427 to 12b50e4 Compare June 16, 2026 19:38
@crprashant

Copy link
Copy Markdown
Contributor Author

CI fix: hardened e2e test 45_connection_limit_timeout against a pre-existing status/output race

The Clippy & Tests (PG17) job failed on one e2e test (30 of 31 passing):

45_connection_limit_timeout.sql:73: ERROR: TEST FAILED: victim output is NULL

I've pushed a fix (folded into the existing commit). Details below.

Root cause — a pre-existing race, not a UUID change. A failed instance's status and its output come from two different sources:

  • df.instances.status is flipped to failed by the update-instance-status activity, which commits to Postgres immediately (a direct UPDATE df.instances SET status=...).
  • In execute_function_graph's failure path, the orchestration .awaits that activity and only afterwards returns Err(err). duroxide records the orchestration's terminal output at the moment the orchestrator returns.
  • df.instance_info() reads status from df.instances but reads output from duroxide.

So there is a brief window where wait_for_completion() already observes status = 'failed' (it polls df.instances.status every 100ms and returns the instant it sees a terminal status) while the output column has not yet been populated. The test read output exactly once, immediately after wait_for_completion() returned, and under slow CI I/O it landed inside that window → NULL.

Why this is not a UUID regression. instance_info() reads output by the very same full-UUID instance_id used for the status read — and the status assertion (victim_status = 'failed') passed. A key/ID mismatch would have broken that read too, and would have failed many other tests rather than this single one. This is also the only e2e test that reads a failed instance's output immediately after wait_for_completion; it passed on the prior CI run and passes locally.

The fix. Poll for the message to materialize instead of reading it once — the same poll-until pattern already used throughout the suite for status:

FOR i IN 1..100 LOOP
    SELECT output INTO victim_output FROM df.instance_info(victim_id);
    EXIT WHEN victim_output IS NOT NULL;
    PERFORM pg_sleep(0.1);
END LOOP;

Why up to 10s (100 × 100ms). It is a deliberately generous upper bound to absorb worst-case CI scheduling/I/O latency between the status commit and duroxide's terminal-output flush. The loop EXITs as soon as output is non-NULL, so on the normal path it resolves on the first iteration and adds no measurable delay; the 10s ceiling only ever applies if something is genuinely wrong (in which case the assertion message now says so). It stays well within the test's existing 30s wait_for_completion budget and the suite's per-test timeouts.

The change is isolated to tests/e2e/sql/45_connection_limit_timeout.sql — no source, schema, or upgrade-script changes. Verified locally: 45_connection_limit_timeout ... PASS.

Bring the issue microsoft#129 full-UUID branch up to date with upstream main (b4579ae, v0.2.3 changelog). Resolve the CHANGELOG.md conflict by folding the Full UUID instance/node IDs entry into the dated 0.2.3 section's Changed list.
@pinodeca pinodeca mentioned this pull request Jun 17, 2026
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: 8-character hex instance IDs have 32-bit collision risk

2 participants