Reject non-future helper output in composers via GUC marker, rename to df.await_instance by Copilot · Pull Request #164 · microsoft/pg_durable

Copilot · 2026-05-23T12:15:26Z

Summary

df.wait_for_completion() (the test/inspection helper that polls df.instances until a workflow reaches a terminal state) had two distinct misuse traps. Both are fixed here, and the function is renamed to the clearer df.await_instance.

Problem 1 — silent graph corruption at composition time

The helper returns plain text ('completed' / 'failed' / 'cancelled'), but the composer DSL (df.seq, df.join, df.race, ~>, &, |, …) auto-wraps any non-Durofut input as a SQL node via Durofut::ensure. Threading the helper into a composer therefore silently baked the literal status string into the graph, e.g.

SELECT df.seq(
  df.await_instance('abcd1234', 30),
  df.sql($$INSERT INTO audit(who) VALUES ('parent-post')$$)
);

would construct a graph whose first step tried to execute the string completed as SQL.

Problem 2 — worker-thread blocking and self-deadlock inside workflows

df.await_instance polls df.instances in a 100 ms std::thread::sleep loop. Inside a workflow, the calling backend is a background-worker thread, so the call:

pins a BGW worker slot for up to timeout_seconds,
holds a long-running transaction (blocks VACUUM on df.instances),
and, when the instance being waited on is the workflow's own instance_id (e.g. read from a table by the workflow itself), guarantees self-deadlock — the worker is stuck in await_instance instead of advancing the instance, so the polling loop never sees a terminal status.

Fix

Three small layers, all using machinery already in the codebase:

GUC-marker rejection at composition time. Reuse the df.non_future_helper GUC mechanism introduced in fix: correctness bugs from reliability audit #220 (currently used by setvar/unsetvar/clearvars). await_instance now calls mark_non_future_helper_call("df.await_instance") immediately before returning its plain-text status. Durofut::same_statement_non_future_helper_name is widened: previously it short-circuited unless the input was exactly "OK"; now any non-JSON input triggers a marker lookup. When a composer in the same statement (matched via statement_timestamp()) sees the marked value, it raises a precise, named error instead of auto-wrapping the string as SQL.
is_in_workflow_context() guard. await_instance now refuses to run when df.in_workflow = 'true' (the GUC the BGW sets on its connections), matching the exact pattern already used by setvar/unsetvar/clearvars. Closes the worker-pinning, long-transaction, and self-deadlock holes.
Rename df.wait_for_completion → df.await_instance. The old name was misleading: it sounded like a composable wait primitive, encouraging exactly the misuse Problem 1 describes. The new name reads as "block until this instance terminates" — clearly a client-side action, not a DSL node.

Backward compatibility

This is preserved across all axes:

The new .so still exports a df.wait_for_completion(text, integer) #[pg_extern] shim that delegates to await_instance, so the wait_for_completion_wrapper C symbol referenced by v0.2.2 catalog entries continues to resolve. The new B1 upgrade test confirms df.wait_for_completion() still works against the v0.2.2 schema with no ALTER EXTENSION UPDATE.
The 0.2.2 → 0.2.3 upgrade script (sql/pg_durable--0.2.2--0.2.3.sql) creates df.await_instance. EXECUTE on this helper is granted to PUBLIC by default (it is not among the sensitive functions whose EXECUTE is revoked from PUBLIC), so every role that can already call df.wait_for_completion can call df.await_instance immediately after the upgrade — no per-role grants need to be mirrored. df.revoke_usage already iterates pg_proc, so it picks up both names automatically.
All shipped SQL fixtures, E2E tests, examples, USER_GUIDE.md, docs, and SKILL.md have been updated to the new name.

Test coverage

New #[pg_test] test_wait_for_completion_cannot_be_used_in_seq_composition exercises the GUC-marker path directly (the BGW isn't available under cargo pgrx test, so the test seeds the marker via pg_catalog.set_config(...) then runs the composer).
New section in tests/e2e/sql/09_graph_and_validation.sql verifies df.await_instance raises "cannot be called inside a workflow" when invoked from a workflow (the in-workflow guard requires the live BGW, so it must be tested end-to-end).
All 171 unit tests pass.
All 33 E2E test suites pass (including the renamed tests).
All 20 upgrade tests pass, including B1 backward compatibility of df.wait_for_completion against the v0.2.2 schema.

Previous approach (discarded)

The original PR added a looks_like_sql_statement(s) helper that performed a case-insensitive prefix match against an allow-list of SQL keywords (SELECT, WITH, INSERT, UPDATE, DELETE, CALL, EXPLAIN, MERGE, …) inside Durofut::ensure. Any non-Durofut input that didn't start with one of those keywords was rejected as a "shape error".

Why it was discarded:

Brittle false positives. Many legitimate SQL statements are not on any reasonable allow-list — TABLE foo, VALUES (...), SHOW ..., NOTIFY ..., LISTEN ..., BEGIN, COMMIT, SAVEPOINT ..., DO $$ ... $$, COPY ..., SET ..., FETCH ..., leading comments, leading whitespace + dollar-quoted bodies, etc. Each one would silently start failing composition and require chasing the allow-list forever.
Doesn't actually identify the offending helper. The error said only "this isn't valid SQL"; it had no way to say "you called df.wait_for_completion here". With the GUC marker we can name the helper precisely, which is what the user actually needs.
Doesn't address the inside-the-workflow case at all. The composition-time check fires only when the misuse appears in the same SQL statement as the helper. A workflow that reads its own instance_id from a table and calls df.await_instance(self_id) at execution time would still self-deadlock.

The current approach reuses the marker mechanism already shipped in #220 (one widened conditional + one call), adds the in-workflow guard already used by three other DSL functions, and renames the function so its non-composable nature is obvious from the name. No new heuristics; nothing to maintain.

pinodeca · 2026-06-14T21:23:54Z

Opus 4.8 reviewed copilot's original attempt and pointed out that there's a much simpler fix:

Is it the best solution?

Probably not — there's a more targeted approach already in the codebase. PR #220 just introduced exactly this pattern, but it didn't cover wait_for_completion:

mark_non_future_helper_call(name) writes a per-statement GUC marker.
Durofut::ensure[_strict] checks the marker and rejects helper output in the same statement.
The check is wired into df.setvar/unsetvar/clearvars (which return "OK").

The reason it doesn't catch wait_for_completion is just that same_statement_non_future_helper_name in types.rs hard-codes if s != "OK" { return None; }, and wait_for_completion doesn't call mark_non_future_helper_call. The minimal, precise fix is two lines:

Have wait_for_completion call mark_non_future_helper_call("df.wait_for_completion") before returning.
Drop (or generalize) the s != "OK" short-circuit so the marker itself is the authoritative signal.

That solution: (a) reuses an existing mechanism, (b) is precise — no false positives or negatives, (c) doesn't require any allow-list, and (d) is roughly ~10 lines instead of ~217.

tjgreen42

Maybe good to do this in two steps: deprecate the old function name (emit a WARNING) but permit it, while adding the new function name. Also double-check with @AbeOmor that this doesn't break AI pipelines.

crprashant

CHANGELOG.md and docs/upgrade-testing.md should be updated as well for this change.

…to df.await_instance Two related problems with df.wait_for_completion (now df.await_instance): 1. Its plain-text return value ("completed"/"failed"/"cancelled") could be silently auto-wrapped as a SQL statement when threaded into composers like df.seq/df.join/df.race, producing graphs that later tried to execute the literal string "completed" as SQL. 2. Called inside a workflow, it blocked a BGW thread on a polling loop -- pinning a worker slot for up to timeout_seconds and (when waiting on the current instance) deadlocking the workflow on itself. Fixes: * Reuse the GUC-marker mechanism from #220 (df.non_future_helper) rather than introducing a SQL-keyword allow-list heuristic. wait_for_completion now calls mark_non_future_helper_call("df.await_instance") on success, and Durofut::same_statement_non_future_helper_name was widened to look up the marker for any non-JSON input (was previously hard-coded to "OK"). Composers now reject the value by name at composition time, in the same statement. * Add an is_in_workflow_context() guard, matching the pattern used by setvar/unsetvar/clearvars. Prevents the worker-blocking and self-deadlock scenarios. * Rename df.wait_for_completion -> df.await_instance. The old name is kept as a deprecated Rust shim (#[pg_extern] alias) so the new .so still services the df.wait_for_completion catalog binding present in v0.2.2 installs (B1 backward compat). The 0.2.2 -> 0.2.3 upgrade script creates df.await_instance and mirrors EXECUTE grants from df.wait_for_completion so callers who used df.grant_usage() pre-0.2.3 can adopt the new name without re-running it. * Update all E2E tests, docs, examples, and SKILL.md to use the new name. Tests: * New pg_test (test_wait_for_completion_cannot_be_used_in_seq_composition) exercises the GUC-marker path directly (BGW isn't available under pg_test). * New E2E test in 09_graph_and_validation.sql verifies df.await_instance raises "cannot be called inside a workflow" when invoked from a workflow. * All 171 unit tests, 33 E2E test suites, and 20 upgrade tests pass (including B1 backward compat for df.wait_for_completion against the v0.2.2 schema).

…mirroring - Update expected/*.out baselines (simple, sequence, variables, parallel, conditional) to call df.await_instance, matching the renamed sql/*.sql inputs. This fixes the failing pg_regress CI jobs. - Remove the EXECUTE grant-mirroring DO block from the 0.2.2->0.2.3 upgrade script. EXECUTE on df.await_instance is granted to PUBLIC by default (the helper is not among the sensitive functions whose EXECUTE is revoked), so no per-role grants need to be mirrored from df.wait_for_completion.

The private integration-test helper in mod tests happened to share the name of the SQL-facing df.await_instance function (crate::dsl::await_instance), but they are unrelated: the helper talks to the duroxide Client directly and has none of the GUC-marker / in-workflow guard logic. Rename it to poll_until_terminal to remove the misleading name collision.

Copilot AI assigned Copilot and pinodeca May 23, 2026

Copilot started work on behalf of pinodeca May 23, 2026 12:15 View session

Copilot AI linked an issue May 23, 2026 that may be closed by this pull request

df.wait_for_completion returns plain text and silently corrupts composers (df.seq, df.join, …) #151

Open

Copilot AI changed the title ~~[WIP] Fix df.wait_for_completion to return future envelope instead of plain text~~ Reject plain-text helper results in composers instead of silently turning them into SQL May 23, 2026

Copilot finished work on behalf of pinodeca May 23, 2026 12:27

Copilot AI requested a review from pinodeca May 23, 2026 12:27

pinodeca marked this pull request as ready for review May 23, 2026 23:52

pinodeca force-pushed the copilot/fix-wait-for-completion-return-type branch from be9ed12 to f65b814 Compare June 14, 2026 21:16

pinodeca force-pushed the copilot/fix-wait-for-completion-return-type branch from f65b814 to 6ec4a96 Compare June 15, 2026 18:13

pinodeca changed the title ~~Reject plain-text helper results in composers instead of silently turning them into SQL~~ Reject non-future helper output in composers via GUC marker, rename to df.await_instance Jun 15, 2026

tjgreen42 requested changes Jun 17, 2026

View reviewed changes

crprashant reviewed Jun 17, 2026

View reviewed changes

pinodeca added 3 commits June 17, 2026 21:39

pinodeca force-pushed the copilot/fix-wait-for-completion-return-type branch from 78e68ec to 257a935 Compare June 17, 2026 21:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reject non-future helper output in composers via GUC marker, rename to df.await_instance#164

Reject non-future helper output in composers via GUC marker, rename to df.await_instance#164
Copilot wants to merge 3 commits into
mainfrom
copilot/fix-wait-for-completion-return-type

Copilot AI commented May 23, 2026 •

edited by pinodeca

Loading

Uh oh!

pinodeca commented Jun 14, 2026

Uh oh!

tjgreen42 left a comment

Uh oh!

crprashant left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Copilot AI commented May 23, 2026 • edited by pinodeca Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem 1 — silent graph corruption at composition time

Problem 2 — worker-thread blocking and self-deadlock inside workflows

Fix

Backward compatibility

Test coverage

Previous approach (discarded)

Uh oh!

pinodeca commented Jun 14, 2026

Is it the best solution?

Uh oh!

tjgreen42 left a comment

Choose a reason for hiding this comment

Uh oh!

crprashant left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Copilot AI commented May 23, 2026 •

edited by pinodeca

Loading