Reject non-future helper output in composers via GUC marker, rename to df.await_instance#164
Reject non-future helper output in composers via GUC marker, rename to df.await_instance#164Copilot wants to merge 3 commits into
Conversation
be9ed12 to
f65b814
Compare
|
Opus 4.8 reviewed copilot's original attempt and pointed out that there's a much simpler fix: Is it the best solution?Probably not — there's a more targeted approach already in the codebase. PR #220 just introduced exactly this pattern, but it didn't cover
The reason it doesn't catch
That solution: (a) reuses an existing mechanism, (b) is precise — no false positives or negatives, (c) doesn't require any allow-list, and (d) is roughly ~10 lines instead of ~217. |
f65b814 to
6ec4a96
Compare
crprashant
left a comment
There was a problem hiding this comment.
CHANGELOG.md and docs/upgrade-testing.md should be updated as well for this change.
…to df.await_instance
Two related problems with df.wait_for_completion (now df.await_instance):
1. Its plain-text return value ("completed"/"failed"/"cancelled") could
be silently auto-wrapped as a SQL statement when threaded into composers
like df.seq/df.join/df.race, producing graphs that later tried to execute
the literal string "completed" as SQL.
2. Called inside a workflow, it blocked a BGW thread on a polling loop --
pinning a worker slot for up to timeout_seconds and (when waiting on the
current instance) deadlocking the workflow on itself.
Fixes:
* Reuse the GUC-marker mechanism from #220 (df.non_future_helper) rather
than introducing a SQL-keyword allow-list heuristic. wait_for_completion
now calls mark_non_future_helper_call("df.await_instance") on success,
and Durofut::same_statement_non_future_helper_name was widened to look
up the marker for any non-JSON input (was previously hard-coded to "OK").
Composers now reject the value by name at composition time, in the same
statement.
* Add an is_in_workflow_context() guard, matching the pattern used by
setvar/unsetvar/clearvars. Prevents the worker-blocking and self-deadlock
scenarios.
* Rename df.wait_for_completion -> df.await_instance. The old name is kept
as a deprecated Rust shim (#[pg_extern] alias) so the new .so still
services the df.wait_for_completion catalog binding present in v0.2.2
installs (B1 backward compat). The 0.2.2 -> 0.2.3 upgrade script creates
df.await_instance and mirrors EXECUTE grants from df.wait_for_completion
so callers who used df.grant_usage() pre-0.2.3 can adopt the new name
without re-running it.
* Update all E2E tests, docs, examples, and SKILL.md to use the new name.
Tests:
* New pg_test (test_wait_for_completion_cannot_be_used_in_seq_composition)
exercises the GUC-marker path directly (BGW isn't available under pg_test).
* New E2E test in 09_graph_and_validation.sql verifies df.await_instance
raises "cannot be called inside a workflow" when invoked from a workflow.
* All 171 unit tests, 33 E2E test suites, and 20 upgrade tests pass
(including B1 backward compat for df.wait_for_completion against the
v0.2.2 schema).
…mirroring - Update expected/*.out baselines (simple, sequence, variables, parallel, conditional) to call df.await_instance, matching the renamed sql/*.sql inputs. This fixes the failing pg_regress CI jobs. - Remove the EXECUTE grant-mirroring DO block from the 0.2.2->0.2.3 upgrade script. EXECUTE on df.await_instance is granted to PUBLIC by default (the helper is not among the sensitive functions whose EXECUTE is revoked), so no per-role grants need to be mirrored from df.wait_for_completion.
The private integration-test helper in mod tests happened to share the name of the SQL-facing df.await_instance function (crate::dsl::await_instance), but they are unrelated: the helper talks to the duroxide Client directly and has none of the GUC-marker / in-workflow guard logic. Rename it to poll_until_terminal to remove the misleading name collision.
78e68ec to
257a935
Compare
Summary
df.wait_for_completion()(the test/inspection helper that pollsdf.instancesuntil a workflow reaches a terminal state) had two distinct misuse traps. Both are fixed here, and the function is renamed to the clearerdf.await_instance.Problem 1 — silent graph corruption at composition time
The helper returns plain text (
'completed'/'failed'/'cancelled'), but the composer DSL (df.seq,df.join,df.race,~>,&,|, …) auto-wraps any non-Durofutinput as a SQL node viaDurofut::ensure. Threading the helper into a composer therefore silently baked the literal status string into the graph, e.g.would construct a graph whose first step tried to execute the string
completedas SQL.Problem 2 — worker-thread blocking and self-deadlock inside workflows
df.await_instancepollsdf.instancesin a 100 msstd::thread::sleeploop. Inside a workflow, the calling backend is a background-worker thread, so the call:timeout_seconds,VACUUMondf.instances),instance_id(e.g. read from a table by the workflow itself), guarantees self-deadlock — the worker is stuck inawait_instanceinstead of advancing the instance, so the polling loop never sees a terminal status.Fix
Three small layers, all using machinery already in the codebase:
GUC-marker rejection at composition time. Reuse the
df.non_future_helperGUC mechanism introduced in fix: correctness bugs from reliability audit #220 (currently used bysetvar/unsetvar/clearvars).await_instancenow callsmark_non_future_helper_call("df.await_instance")immediately before returning its plain-text status.Durofut::same_statement_non_future_helper_nameis widened: previously it short-circuited unless the input was exactly"OK"; now any non-JSON input triggers a marker lookup. When a composer in the same statement (matched viastatement_timestamp()) sees the marked value, it raises a precise, named error instead of auto-wrapping the string as SQL.is_in_workflow_context()guard.await_instancenow refuses to run whendf.in_workflow = 'true'(the GUC the BGW sets on its connections), matching the exact pattern already used bysetvar/unsetvar/clearvars. Closes the worker-pinning, long-transaction, and self-deadlock holes.Rename
df.wait_for_completion→df.await_instance. The old name was misleading: it sounded like a composable wait primitive, encouraging exactly the misuse Problem 1 describes. The new name reads as "block until this instance terminates" — clearly a client-side action, not a DSL node.Backward compatibility
This is preserved across all axes:
.sostill exports adf.wait_for_completion(text, integer)#[pg_extern]shim that delegates toawait_instance, so thewait_for_completion_wrapperC symbol referenced by v0.2.2 catalog entries continues to resolve. The new B1 upgrade test confirmsdf.wait_for_completion()still works against the v0.2.2 schema with noALTER EXTENSION UPDATE.sql/pg_durable--0.2.2--0.2.3.sql) createsdf.await_instance. EXECUTE on this helper is granted to PUBLIC by default (it is not among the sensitive functions whose EXECUTE is revoked from PUBLIC), so every role that can already calldf.wait_for_completioncan calldf.await_instanceimmediately after the upgrade — no per-role grants need to be mirrored.df.revoke_usagealready iteratespg_proc, so it picks up both names automatically.Test coverage
#[pg_test] test_wait_for_completion_cannot_be_used_in_seq_compositionexercises the GUC-marker path directly (the BGW isn't available undercargo pgrx test, so the test seeds the marker viapg_catalog.set_config(...)then runs the composer).tests/e2e/sql/09_graph_and_validation.sqlverifiesdf.await_instanceraises "cannot be called inside a workflow" when invoked from a workflow (the in-workflow guard requires the live BGW, so it must be tested end-to-end).df.wait_for_completionagainst the v0.2.2 schema.Previous approach (discarded)
The original PR added a
looks_like_sql_statement(s)helper that performed a case-insensitive prefix match against an allow-list of SQL keywords (SELECT,WITH,INSERT,UPDATE,DELETE,CALL,EXPLAIN,MERGE, …) insideDurofut::ensure. Any non-Durofutinput that didn't start with one of those keywords was rejected as a "shape error".Why it was discarded:
TABLE foo,VALUES (...),SHOW ...,NOTIFY ...,LISTEN ...,BEGIN,COMMIT,SAVEPOINT ...,DO $$ ... $$,COPY ...,SET ...,FETCH ..., leading comments, leading whitespace + dollar-quoted bodies, etc. Each one would silently start failing composition and require chasing the allow-list forever.df.wait_for_completionhere". With the GUC marker we can name the helper precisely, which is what the user actually needs.instance_idfrom a table and callsdf.await_instance(self_id)at execution time would still self-deadlock.The current approach reuses the marker mechanism already shipped in #220 (one widened conditional + one call), adds the in-workflow guard already used by three other DSL functions, and renames the function so its non-composable nature is obvious from the name. No new heuristics; nothing to maintain.