Skip to content

Reject non-future helper output in composers via GUC marker, rename to df.await_instance#164

Open
Copilot wants to merge 3 commits into
mainfrom
copilot/fix-wait-for-completion-return-type
Open

Reject non-future helper output in composers via GUC marker, rename to df.await_instance#164
Copilot wants to merge 3 commits into
mainfrom
copilot/fix-wait-for-completion-return-type

Conversation

Copilot AI commented May 23, 2026

Copy link
Copy Markdown
Contributor

Summary

df.wait_for_completion() (the test/inspection helper that polls df.instances until a workflow reaches a terminal state) had two distinct misuse traps. Both are fixed here, and the function is renamed to the clearer df.await_instance.

Problem 1 — silent graph corruption at composition time

The helper returns plain text ('completed' / 'failed' / 'cancelled'), but the composer DSL (df.seq, df.join, df.race, ~>, &, |, …) auto-wraps any non-Durofut input as a SQL node via Durofut::ensure. Threading the helper into a composer therefore silently baked the literal status string into the graph, e.g.

SELECT df.seq(
  df.await_instance('abcd1234', 30),
  df.sql($$INSERT INTO audit(who) VALUES ('parent-post')$$)
);

would construct a graph whose first step tried to execute the string completed as SQL.

Problem 2 — worker-thread blocking and self-deadlock inside workflows

df.await_instance polls df.instances in a 100 ms std::thread::sleep loop. Inside a workflow, the calling backend is a background-worker thread, so the call:

  • pins a BGW worker slot for up to timeout_seconds,
  • holds a long-running transaction (blocks VACUUM on df.instances),
  • and, when the instance being waited on is the workflow's own instance_id (e.g. read from a table by the workflow itself), guarantees self-deadlock — the worker is stuck in await_instance instead of advancing the instance, so the polling loop never sees a terminal status.

Fix

Three small layers, all using machinery already in the codebase:

  1. GUC-marker rejection at composition time. Reuse the df.non_future_helper GUC mechanism introduced in fix: correctness bugs from reliability audit #220 (currently used by setvar/unsetvar/clearvars). await_instance now calls mark_non_future_helper_call("df.await_instance") immediately before returning its plain-text status. Durofut::same_statement_non_future_helper_name is widened: previously it short-circuited unless the input was exactly "OK"; now any non-JSON input triggers a marker lookup. When a composer in the same statement (matched via statement_timestamp()) sees the marked value, it raises a precise, named error instead of auto-wrapping the string as SQL.

  2. is_in_workflow_context() guard. await_instance now refuses to run when df.in_workflow = 'true' (the GUC the BGW sets on its connections), matching the exact pattern already used by setvar/unsetvar/clearvars. Closes the worker-pinning, long-transaction, and self-deadlock holes.

  3. Rename df.wait_for_completiondf.await_instance. The old name was misleading: it sounded like a composable wait primitive, encouraging exactly the misuse Problem 1 describes. The new name reads as "block until this instance terminates" — clearly a client-side action, not a DSL node.

Backward compatibility

This is preserved across all axes:

  • The new .so still exports a df.wait_for_completion(text, integer) #[pg_extern] shim that delegates to await_instance, so the wait_for_completion_wrapper C symbol referenced by v0.2.2 catalog entries continues to resolve. The new B1 upgrade test confirms df.wait_for_completion() still works against the v0.2.2 schema with no ALTER EXTENSION UPDATE.
  • The 0.2.2 → 0.2.3 upgrade script (sql/pg_durable--0.2.2--0.2.3.sql) creates df.await_instance. EXECUTE on this helper is granted to PUBLIC by default (it is not among the sensitive functions whose EXECUTE is revoked from PUBLIC), so every role that can already call df.wait_for_completion can call df.await_instance immediately after the upgrade — no per-role grants need to be mirrored. df.revoke_usage already iterates pg_proc, so it picks up both names automatically.
  • All shipped SQL fixtures, E2E tests, examples, USER_GUIDE.md, docs, and SKILL.md have been updated to the new name.

Test coverage

  • New #[pg_test] test_wait_for_completion_cannot_be_used_in_seq_composition exercises the GUC-marker path directly (the BGW isn't available under cargo pgrx test, so the test seeds the marker via pg_catalog.set_config(...) then runs the composer).
  • New section in tests/e2e/sql/09_graph_and_validation.sql verifies df.await_instance raises "cannot be called inside a workflow" when invoked from a workflow (the in-workflow guard requires the live BGW, so it must be tested end-to-end).
  • All 171 unit tests pass.
  • All 33 E2E test suites pass (including the renamed tests).
  • All 20 upgrade tests pass, including B1 backward compatibility of df.wait_for_completion against the v0.2.2 schema.

Previous approach (discarded)

The original PR added a looks_like_sql_statement(s) helper that performed a case-insensitive prefix match against an allow-list of SQL keywords (SELECT, WITH, INSERT, UPDATE, DELETE, CALL, EXPLAIN, MERGE, …) inside Durofut::ensure. Any non-Durofut input that didn't start with one of those keywords was rejected as a "shape error".

Why it was discarded:

  • Brittle false positives. Many legitimate SQL statements are not on any reasonable allow-list — TABLE foo, VALUES (...), SHOW ..., NOTIFY ..., LISTEN ..., BEGIN, COMMIT, SAVEPOINT ..., DO $$ ... $$, COPY ..., SET ..., FETCH ..., leading comments, leading whitespace + dollar-quoted bodies, etc. Each one would silently start failing composition and require chasing the allow-list forever.
  • Doesn't actually identify the offending helper. The error said only "this isn't valid SQL"; it had no way to say "you called df.wait_for_completion here". With the GUC marker we can name the helper precisely, which is what the user actually needs.
  • Doesn't address the inside-the-workflow case at all. The composition-time check fires only when the misuse appears in the same SQL statement as the helper. A workflow that reads its own instance_id from a table and calls df.await_instance(self_id) at execution time would still self-deadlock.

The current approach reuses the marker mechanism already shipped in #220 (one widened conditional + one call), adds the in-workflow guard already used by three other DSL functions, and renames the function so its non-composable nature is obvious from the name. No new heuristics; nothing to maintain.

Copilot AI changed the title [WIP] Fix df.wait_for_completion to return future envelope instead of plain text Reject plain-text helper results in composers instead of silently turning them into SQL May 23, 2026
Copilot AI requested a review from pinodeca May 23, 2026 12:27
@pinodeca pinodeca marked this pull request as ready for review May 23, 2026 23:52
@pinodeca pinodeca force-pushed the copilot/fix-wait-for-completion-return-type branch from be9ed12 to f65b814 Compare June 14, 2026 21:16
@pinodeca

Copy link
Copy Markdown
Contributor

Opus 4.8 reviewed copilot's original attempt and pointed out that there's a much simpler fix:

Is it the best solution?

Probably not — there's a more targeted approach already in the codebase. PR #220 just introduced exactly this pattern, but it didn't cover wait_for_completion:

  • mark_non_future_helper_call(name) writes a per-statement GUC marker.
  • Durofut::ensure[_strict] checks the marker and rejects helper output in the same statement.
  • The check is wired into df.setvar/unsetvar/clearvars (which return "OK").

The reason it doesn't catch wait_for_completion is just that same_statement_non_future_helper_name in types.rs hard-codes if s != "OK" { return None; }, and wait_for_completion doesn't call mark_non_future_helper_call. The minimal, precise fix is two lines:

  1. Have wait_for_completion call mark_non_future_helper_call("df.wait_for_completion") before returning.
  2. Drop (or generalize) the s != "OK" short-circuit so the marker itself is the authoritative signal.

That solution: (a) reuses an existing mechanism, (b) is precise — no false positives or negatives, (c) doesn't require any allow-list, and (d) is roughly ~10 lines instead of ~217.

@pinodeca pinodeca force-pushed the copilot/fix-wait-for-completion-return-type branch from f65b814 to 6ec4a96 Compare June 15, 2026 18:13
@pinodeca pinodeca changed the title Reject plain-text helper results in composers instead of silently turning them into SQL Reject non-future helper output in composers via GUC marker, rename to df.await_instance Jun 15, 2026

@tjgreen42 tjgreen42 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe good to do this in two steps: deprecate the old function name (emit a WARNING) but permit it, while adding the new function name. Also double-check with @AbeOmor that this doesn't break AI pipelines.

@crprashant crprashant left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHANGELOG.md and docs/upgrade-testing.md should be updated as well for this change.

pinodeca added 3 commits June 17, 2026 21:39
…to df.await_instance

Two related problems with df.wait_for_completion (now df.await_instance):

1. Its plain-text return value ("completed"/"failed"/"cancelled") could
   be silently auto-wrapped as a SQL statement when threaded into composers
   like df.seq/df.join/df.race, producing graphs that later tried to execute
   the literal string "completed" as SQL.

2. Called inside a workflow, it blocked a BGW thread on a polling loop --
   pinning a worker slot for up to timeout_seconds and (when waiting on the
   current instance) deadlocking the workflow on itself.

Fixes:

* Reuse the GUC-marker mechanism from #220 (df.non_future_helper) rather
  than introducing a SQL-keyword allow-list heuristic. wait_for_completion
  now calls mark_non_future_helper_call("df.await_instance") on success,
  and Durofut::same_statement_non_future_helper_name was widened to look
  up the marker for any non-JSON input (was previously hard-coded to "OK").
  Composers now reject the value by name at composition time, in the same
  statement.

* Add an is_in_workflow_context() guard, matching the pattern used by
  setvar/unsetvar/clearvars. Prevents the worker-blocking and self-deadlock
  scenarios.

* Rename df.wait_for_completion -> df.await_instance. The old name is kept
  as a deprecated Rust shim (#[pg_extern] alias) so the new .so still
  services the df.wait_for_completion catalog binding present in v0.2.2
  installs (B1 backward compat). The 0.2.2 -> 0.2.3 upgrade script creates
  df.await_instance and mirrors EXECUTE grants from df.wait_for_completion
  so callers who used df.grant_usage() pre-0.2.3 can adopt the new name
  without re-running it.

* Update all E2E tests, docs, examples, and SKILL.md to use the new name.

Tests:

* New pg_test (test_wait_for_completion_cannot_be_used_in_seq_composition)
  exercises the GUC-marker path directly (BGW isn't available under pg_test).
* New E2E test in 09_graph_and_validation.sql verifies df.await_instance
  raises "cannot be called inside a workflow" when invoked from a workflow.
* All 171 unit tests, 33 E2E test suites, and 20 upgrade tests pass
  (including B1 backward compat for df.wait_for_completion against the
  v0.2.2 schema).
…mirroring

- Update expected/*.out baselines (simple, sequence, variables, parallel,
  conditional) to call df.await_instance, matching the renamed sql/*.sql
  inputs. This fixes the failing pg_regress CI jobs.
- Remove the EXECUTE grant-mirroring DO block from the 0.2.2->0.2.3 upgrade
  script. EXECUTE on df.await_instance is granted to PUBLIC by default (the
  helper is not among the sensitive functions whose EXECUTE is revoked), so
  no per-role grants need to be mirrored from df.wait_for_completion.
The private integration-test helper in mod tests happened to share the
name of the SQL-facing df.await_instance function (crate::dsl::await_instance),
but they are unrelated: the helper talks to the duroxide Client directly and
has none of the GUC-marker / in-workflow guard logic. Rename it to
poll_until_terminal to remove the misleading name collision.
@pinodeca pinodeca force-pushed the copilot/fix-wait-for-completion-return-type branch from 78e68ec to 257a935 Compare June 17, 2026 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

df.wait_for_completion returns plain text and silently corrupts composers (df.seq, df.join, …)

4 participants