Skip to content

Prevent parent hangs from sub-orchestration instance-id collisions#28

Open
tjgreen42 wants to merge 6 commits into
mainfrom
fix/suborch-id-squat-hang
Open

Prevent parent hangs from sub-orchestration instance-id collisions#28
tjgreen42 wants to merge 6 commits into
mainfrom
fix/suborch-id-squat-hang

Conversation

@tjgreen42

@tjgreen42 tjgreen42 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

A sub-orchestration's child instance id is auto-generated as {parent}::sub::{event_id}. Two situations could leave the parent awaiting a completion that never arrives.

If an unrelated instance already held that id in a terminal state, the dispatcher's terminal fast-ack path discarded the incoming StartOrchestration silently. It now enqueues a SubOrchFailed back to the scheduling parent when the discarded batch's parent differs from the terminal instance's own recorded parent — that mismatch check keeps ordinary at-least-once redelivery of a completed child's own start from re-failing the parent. The client start APIs also reject ids using the reserved sub:: marker; plain :: (e.g. i4::child-1) stays valid.

A parent that schedules a sub-orchestration inside a continue_as_new loop hit the same hang against itself: event ids reset each iteration, so every iteration regenerated the same child id and collided with the now-terminal child from the previous one. Auto-generated ids now include the parent execution after the first ({parent}::sub::{exec}_{event_id}), and the runtime records the parent's current execution so the child's completion routes to the right iteration. The first execution's id format is unchanged and child ids are persisted in history, so in-flight instances and mixed-version clusters are unaffected.

Both hangs are pinned by regression tests in tests/scenarios/suborch_id_collision.rs; each reverts to a parent timeout when its fix is removed (verified). A third test confirms genuine redelivery of a completed child leaves the parent untouched.

tjgreen42 and others added 2 commits June 5, 2026 23:21
A sub-orchestration's child instance id is auto-generated as
{parent}::sub::{event_id}. If an instance with that id already exists in a
terminal state, the orchestration dispatcher's terminal fast-ack path discarded
the incoming StartOrchestration without notifying the scheduling parent, leaving
the parent awaiting a completion that never arrived.

The dispatcher now enqueues a SubOrchFailed back to the scheduling parent when a
discarded terminal batch contains a StartOrchestration whose parent differs from
the terminal instance's own recorded parent, so the parent fails fast. Genuine
redelivery of a completed child's start (same parent and id) is left untouched.

Client::start_orchestration and start_orchestration_versioned also now reject
instance ids that use the reserved sub:: marker, so a caller cannot occupy a
future child id. Other uses of :: remain valid.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Event ids reset on continue-as-new, so a parent that schedules a
sub-orchestration at the same position on every iteration regenerated an
identical child id (`{parent}::sub::{event_id}`) and collided with the
now-terminal child from the previous iteration, hanging the parent.

Auto-generated child ids now include the parent execution after the first
(`{parent}::sub::{exec}_{event_id}`), and the runtime records the parent's
current execution at turn start so the child's completion routes to the
right iteration. The first execution's id format is unchanged and child ids
are persisted in history, so in-flight instances and mixed-version clusters
are unaffected.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tjgreen42 tjgreen42 changed the title Fail sub-orchestration fast when its instance id is already terminal Prevent parent hangs from sub-orchestration instance-id collisions Jun 8, 2026
@tjgreen42 tjgreen42 marked this pull request as draft June 8, 2026 02:56
@tjgreen42 tjgreen42 marked this pull request as ready for review June 8, 2026 18:36
@tjgreen42 tjgreen42 requested review from affandar and pinodeca June 8, 2026 18:36
Comment thread src/runtime/dispatchers/orchestration.rs Outdated
Comment thread src/client/mod.rs Outdated
Comment thread tests/scenarios/suborch_id_collision.rs
Comment thread src/runtime/execution.rs Outdated

pinodeca commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Two non-inline review items from the PAW review did not attach as GitHub diff threads because their target files/lines are outside the PR diff. Capturing them here so they are visible with the submitted review.

  1. Explicit sub-orchestration ids can still use the reserved sub:: namespace

Severity: Should-fix / Medium

The new top-level client validation reserves sub:: and ::sub::, but explicit sub-orchestration ids can still enter the same namespace through schedule_sub_orchestration_with_id and schedule_sub_orchestration_versioned_with_id. That leaves a public workflow API able to produce runtime-shaped ids such as target-parent::sub::2, which is the collision class this PR is defending against.

Suggestion: apply the same reserved-marker validation to explicit child ids before emitting Action::StartSubOrchestration, or document and test the exception as an intentional advanced escape hatch. If explicit ids intentionally remain unrestricted, please add docs and tests that explain why schedule_sub_orchestration_with_id("Child", "parent::sub::2", input) is allowed even though top-level starts reject the same marker.

Evidence: src/client/mod.rs rejects top-level ids with the reserved marker, and tests/instance_id_validation_tests.rs verifies that behavior. src/lib.rs forwards explicit sub-orchestration ids directly into the action without equivalent validation. src/lib.rs also treats ids starting with sub:: as auto-generated when building child ids.

  1. Release and migration notes should mention the new instance-id reservation

Severity: Should-fix / Medium

The client start APIs now reject instance ids containing the reserved sub-orchestration marker. That is an externally observable compatibility change for applications that previously used sub:: or ::sub:: in root orchestration ids, but the changelog and migration guide do not mention the new reservation or how users should audit existing id schemes.

Suggestion: add a changelog entry and migration guidance before release. The notes should call out that root orchestration ids starting with sub:: or containing ::sub:: now return ClientError::InvalidInput, ordinary :: remains valid, and applications using the reserved marker should rename those root ids before upgrading client code.

Suggested wording:

### Changed

- Reserved the `sub::` marker for runtime-generated sub-orchestration instance ids.
  `Client::start_orchestration` and `Client::start_orchestration_versioned` now
  return `ClientError::InvalidInput` for root instance ids that start with `sub::`
  or contain `::sub::`; other uses of `::` remain supported.

Evidence: src/client/mod.rs implements the reserved-marker rejection and applies it to unversioned and versioned starts. tests/instance_id_validation_tests.rs asserts reserved ids are rejected while ordinary :: remains accepted. CHANGELOG.md and docs/migration-guide.md do not currently mention the new reservation.

Route sub-orchestration completion/failure notifications using the parent's
current execution id read from durable provider state (Provider::read) instead
of a process-local cache. A restarted or different runtime previously defaulted
to execution 1 on a cache miss, so a parent on execution 2+ filtered the
notification during replay and hung forever.

Removes the current_execution_ids cache entirely (also eliminating its
unbounded per-turn growth), updates comments to cover the post-continue-as-new
child-id form, documents the explicit-id escape hatch, adds CHANGELOG/migration
notes for the reserved sub:: marker, makes the redelivery test deterministic,
and adds a cross-runtime regression test where a fresh runtime that never ran
the parent must still route the failure to execution 2.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tjgreen42

Copy link
Copy Markdown
Contributor Author

Addressed the two remaining (non-inline) items in eeee0a2:

  • Explicit sub-orch ids bypass reserved-marker validation: documented the escape hatch precisely on schedule_sub_orchestration_with_id/_versioned_with_id (explicit ids are used verbatim, are not validated against the reserved sub:: marker, and collision-shaped explicit ids are allowed — the terminal-collision defense covers them; ids starting with sub:: get re-prefixed by build_child_instance_id). Added explicit_sub_orchestration_id_bypasses_reserved_marker_validation showing the client rejects tenant::sub::99 at top level but the explicit child-id API uses it verbatim.
  • Changelog/migration: added an Unreleased CHANGELOG entry (reserved sub:: marker + the parent-hang/durable-routing fix) and a migration-guide section with accept/reject examples and upgrade guidance.

Full suite (--all-features and no-features), clippy, and doctests all pass.

@tjgreen42 tjgreen42 requested a review from pinodeca June 10, 2026 03:04

@affandar affandar left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two review notes from walking through the failure mode:

  1. The CAN bug is real, but the collision tests should be narrowed to that scenario. Since auto-generated child IDs include the parent instance ID, children from unrelated parents should not collide when parent/root instance IDs are unique across the provider. The convincing case is same parent across continue_as_new executions: execution 1 schedules event 2 as P::sub::2, then execution 2 schedules the same event position and would also generate P::sub::2 without the execution-scoped suffix. The "foreign terminal/squatter" tests are more of a legacy/manual/provider-bypass defense and should not be presented as the normal auto-generated collision case.

  2. Parent notification routing should use the execution that scheduled the child, not resolve the parent's current execution when the child completes. The stronger model is to stamp parent_execution_id onto the child StartOrchestration work item at schedule time, persist/carry that through the child's parent link, and use it directly for SubOrchCompleted/SubOrchFailed. That avoids the process-local cache problem, avoids a provider read, and avoids a TOCTOU race where the parent's current execution at completion time differs from the execution that scheduled the child.

@affandar

Copy link
Copy Markdown
Contributor

Expanding on my earlier review comment with a concrete fix shape.

1. Test/scenario shape

The continue_as_new bug is real, but the strongest regression test should be the same-parent, cross-execution case rather than a foreign/squatter instance. Auto-generated child instance ids already embed the parent instance id, so unrelated parents should not collide if root instance ids are unique across the provider:

parent ABC event 2 -> ABC::sub::2
parent XYZ event 2 -> XYZ::sub::2

The real CAN failure is:

parent P execution 1 event 2 -> P::sub::2
parent P execution 2 event 2 -> P::sub::2   // old behavior: event ids reset after CAN

A focused regression could look like this:

#[tokio::test]
async fn continue_as_new_suborch_child_ids_include_execution_after_first() {
    let store: Arc<dyn Provider> = Arc::new(SqliteProvider::new_in_memory().await.unwrap());
    let parent_id = "can-child-id-job";

    let parent = |ctx: OrchestrationContext, input: String| async move {
        let n: u32 = input.parse().unwrap_or(0);
        let result = ctx.schedule_sub_orchestration("Child", n.to_string()).await?;
        if n < 2 {
            return ctx.continue_as_new((n + 1).to_string()).await;
        }
        Ok(format!("done:{n}:{result}"))
    };

    let child = |_ctx: OrchestrationContext, input: String| async move {
        Ok(format!("child-done:{input}"))
    };

    let orchestrations = OrchestrationRegistry::builder()
        .register("Parent", parent)
        .register("Child", child)
        .build();
    let activities = ActivityRegistry::builder().build();
    let rt = runtime::Runtime::start_with_store(store.clone(), activities, orchestrations).await;
    let client = Client::new(store.clone());

    client.start_orchestration(parent_id, "Parent", "0").await.unwrap();

    let status = client
        .wait_for_orchestration(parent_id, Duration::from_secs(5))
        .await
        .expect("parent must not hang from reusing the same child id after continue-as-new");

    assert!(
        matches!(&status, OrchestrationStatus::Completed { output, .. } if output == "done:2:child-done:2"),
        "parent should complete after three executions; got {status:?}"
    );

    let mut scheduled_child_suffixes = Vec::new();
    for execution_id in 1..=3 {
        let history = store.read_with_execution(parent_id, execution_id).await.unwrap();
        let child_suffix = history
            .iter()
            .find_map(|event| match &event.kind {
                EventKind::SubOrchestrationScheduled { instance, .. } => Some(instance.clone()),
                _ => None,
            })
            .unwrap_or_else(|| panic!("execution {execution_id} should schedule a sub-orchestration"));
        scheduled_child_suffixes.push(child_suffix);
    }

    assert_eq!(
        scheduled_child_suffixes,
        vec!["sub::2".to_string(), "sub::2_2".to_string(), "sub::3_2".to_string()],
    );

    rt.shutdown(None).await;
}

With only the old id generation, this hangs because execution 2 tries to start P::sub::2, finds execution 1's child already terminal, and the parent never receives a completion. This test keeps the proof tied to the actual CAN issue.

The foreign/squatter tests can still be useful as legacy/manual/provider-bypass defense, but I would name and document them that way rather than treating them as the normal auto-generated collision scenario.

2. Route child responses to the scheduling execution

The response should be addressed to the parent execution that scheduled the child, not to whatever execution is current when the child completes.

The current PR fixes the cache miss by reading the parent's current execution from the provider at completion time. That works for the focused awaited CAN test because the parent cannot advance while it is awaiting the child. But it is still a weaker invariant and it keeps a TOCTOU window: the parent's current execution at completion time is not necessarily the execution that scheduled the child.

The stronger model is to stamp the parent execution id onto the child start at schedule time and carry that through to completion/failure routing.

Sketch:

pub enum WorkItem {
    StartOrchestration {
        instance: String,
        orchestration: String,
        input: String,
        version: Option<String>,
        parent_instance: Option<String>,

        #[serde(default, skip_serializing_if = "Option::is_none")]
        parent_execution_id: Option<u64>,

        parent_id: Option<u64>,
        execution_id: u64,
    },
}

When the parent schedules the sub-orchestration:

WorkItem::StartOrchestration {
    instance: child_full,
    orchestration: name.clone(),
    input: input.clone(),
    version: version.clone(),
    parent_instance: Some(instance.to_string()),
    parent_execution_id: Some(execution_id), // source parent execution
    parent_id: Some(*scheduling_event_id),   // source schedule event id
    execution_id: INITIAL_EXECUTION_ID,       // child starts at execution 1
}

Then persist that parent execution id in the child start metadata/history link as optional/defaulted for compatibility. For example:

EventKind::OrchestrationStarted {
    name,
    version,
    input,
    parent_instance,
    #[serde(default, skip_serializing_if = "Option::is_none")]
    parent_execution_id,
    parent_id,
    carry_forward_events,
    initial_custom_status,
}

And when the child completes or fails:

if let Some((parent_instance, parent_execution_id, parent_id)) = parent_link {
    orchestrator_items.push(WorkItem::SubOrchCompleted {
        parent_instance,
        parent_execution_id,
        parent_id,
        result,
    });
}

For rolling upgrade / old history compatibility:

let parent_execution_id = match stamped_parent_execution_id {
    Some(id) => id,
    None => self.parent_execution_id_for_routing(&parent_instance).await, // legacy fallback only
};

That makes the new path exact and durable:

schedule time:   (parent_instance=P, parent_execution_id=2, parent_id=2)
completion time: SubOrchCompleted targets exactly (P, execution 2, event 2)

This avoids the old process-local cache problem, avoids loading parent history on every child completion, and avoids retargeting a child response to a later parent execution just because that later execution is current when the child completes.

Route sub-orchestration completion/failure notifications to the exact
parent execution that scheduled the child by stamping parent_execution_id
onto the child StartOrchestration work item at schedule time and persisting
it in the child's OrchestrationStarted event. All notification sites now use
the stamped value, falling back to a durable provider read only when it is
absent (children/work items from older runtimes). This removes the prior
TOCTOU window where the parent's current execution at completion time could
differ from the execution that scheduled the child, and avoids loading
parent history on every child completion.

Both new fields are Option<u64> with serde default + skip_serializing_if, so
the wire and history formats stay backward compatible and mixed-version
clusters route correctly during rolling upgrades.

Restructure the collision scenario tests: make the same-parent
continue-as-new self-collision the primary regression, asserting the exact
per-execution child suffixes (sub::2, sub::2_2, sub::3_2). Relabel and rename
the foreign-terminal "squatter" tests to legacy_provider_bypass_* and document
them as legacy/provider-bypass defenses rather than the normal auto-generated
collision case.

Update CHANGELOG and the migration guide.
Adds suborch_completion_routes_to_scheduling_execution_not_current, proving
the durable parent_execution_id stamp addresses a child's completion to the
execution that scheduled it rather than the parent's current execution at
completion time.

A child outlives execution 1 via continue-as-new (which does not cancel
outstanding children) and completes while the parent is on execution 2.
With the stamp, the SubOrchCompleted is addressed to execution 1 (terminal)
and the replay execution filter discards it, leaving execution 2 untouched.
Without it, current-execution routing addresses the notification to
execution 2, where event id 2 is not a sub-orchestration schedule, so it is
applied as a nondeterministic completion and poisons the parent. Verified
the test fails ("no matching schedule for sub-orchestration id=2") when
routing falls back to the parent's current execution.
@tjgreen42

Copy link
Copy Markdown
Contributor Author

Both points are addressed — 45849ec (durable routing + test restructure) and 091fdd7 (the routing regression you asked for), now pushed.

1. Test/scenario shape. The same-parent continue-as-new self-collision is now the primary regression: continue_as_new_suborch_child_ids_include_execution_after_first pins the exact per-execution child suffixes (sub::2, sub::2_2, sub::3_2) and hangs under the old id generation. The foreign/terminal "squatter" tests are renamed legacy_provider_bypass_* and documented as legacy/provider-bypass defenses rather than the normal auto-generated case.

2. Route to the scheduling execution. Implemented as you sketched: parent_execution_id: Option<u64> is stamped onto WorkItem::StartOrchestration at schedule time (Some(execution_id)), persisted in EventKind::OrchestrationStarted, and carried to every SubOrchCompleted/SubOrchFailed routing site via resolve_parent_execution_id, which prefers the stamp and falls back to a provider read only for children/work items produced by older runtimes. Both fields are #[serde(default, skip_serializing_if = "Option::is_none")], so wire and history formats stay backward compatible for mixed-version clusters during rolling upgrades. This removes the provider read on the hot path and the TOCTOU window.

To prove the stamp matters beyond the awaited case (where the parent is blocked and the two coincide), 091fdd7 adds suborch_completion_routes_to_scheduling_execution_not_current: a child outlives execution 1 via continue-as-new (CAN does not cancel outstanding children) and completes while the parent is on execution 2. With the stamp the completion is addressed to execution 1 (terminal) and discarded by the replay execution filter, leaving execution 2 untouched. With current-execution routing it lands on execution 2 — where event id 2 is not a sub-orchestration schedule — and poisons the parent with no matching schedule for sub-orchestration id=2. I verified the test fails without the stamp.

Full suite (--all-features and no-features), clippy, and doctests pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants