Prevent parent hangs from sub-orchestration instance-id collisions by tjgreen42 · Pull Request #28 · microsoft/duroxide

tjgreen42 · 2026-06-05T23:21:53Z

A sub-orchestration's child instance id is auto-generated as {parent}::sub::{event_id}. Two situations could leave the parent awaiting a completion that never arrives.

If an unrelated instance already held that id in a terminal state, the dispatcher's terminal fast-ack path discarded the incoming StartOrchestration silently. It now enqueues a SubOrchFailed back to the scheduling parent when the discarded batch's parent differs from the terminal instance's own recorded parent — that mismatch check keeps ordinary at-least-once redelivery of a completed child's own start from re-failing the parent. The client start APIs also reject ids using the reserved sub:: marker; plain :: (e.g. i4::child-1) stays valid.

A parent that schedules a sub-orchestration inside a continue_as_new loop hit the same hang against itself: event ids reset each iteration, so every iteration regenerated the same child id and collided with the now-terminal child from the previous one. Auto-generated ids now include the parent execution after the first ({parent}::sub::{exec}_{event_id}), and the runtime records the parent's current execution so the child's completion routes to the right iteration. The first execution's id format is unchanged and child ids are persisted in history, so in-flight instances and mixed-version clusters are unaffected.

Both hangs are pinned by regression tests in tests/scenarios/suborch_id_collision.rs; each reverts to a parent timeout when its fix is removed (verified). A third test confirms genuine redelivery of a completed child leaves the parent untouched.

A sub-orchestration's child instance id is auto-generated as {parent}::sub::{event_id}. If an instance with that id already exists in a terminal state, the orchestration dispatcher's terminal fast-ack path discarded the incoming StartOrchestration without notifying the scheduling parent, leaving the parent awaiting a completion that never arrived. The dispatcher now enqueues a SubOrchFailed back to the scheduling parent when a discarded terminal batch contains a StartOrchestration whose parent differs from the terminal instance's own recorded parent, so the parent fails fast. Genuine redelivery of a completed child's start (same parent and id) is left untouched. Client::start_orchestration and start_orchestration_versioned also now reject instance ids that use the reserved sub:: marker, so a caller cannot occupy a future child id. Other uses of :: remain valid. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Event ids reset on continue-as-new, so a parent that schedules a sub-orchestration at the same position on every iteration regenerated an identical child id (`{parent}::sub::{event_id}`) and collided with the now-terminal child from the previous iteration, hanging the parent. Auto-generated child ids now include the parent execution after the first (`{parent}::sub::{exec}_{event_id}`), and the runtime records the parent's current execution at turn start so the child's completion routes to the right iteration. The first execution's id format is unchanged and child ids are persisted in history, so in-flight instances and mixed-version clusters are unaffected. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

pinodeca · 2026-06-09T18:11:24Z

Two non-inline review items from the PAW review did not attach as GitHub diff threads because their target files/lines are outside the PR diff. Capturing them here so they are visible with the submitted review.

Explicit sub-orchestration ids can still use the reserved sub:: namespace

Severity: Should-fix / Medium

The new top-level client validation reserves sub:: and ::sub::, but explicit sub-orchestration ids can still enter the same namespace through schedule_sub_orchestration_with_id and schedule_sub_orchestration_versioned_with_id. That leaves a public workflow API able to produce runtime-shaped ids such as target-parent::sub::2, which is the collision class this PR is defending against.

Suggestion: apply the same reserved-marker validation to explicit child ids before emitting Action::StartSubOrchestration, or document and test the exception as an intentional advanced escape hatch. If explicit ids intentionally remain unrestricted, please add docs and tests that explain why schedule_sub_orchestration_with_id("Child", "parent::sub::2", input) is allowed even though top-level starts reject the same marker.

Evidence: src/client/mod.rs rejects top-level ids with the reserved marker, and tests/instance_id_validation_tests.rs verifies that behavior. src/lib.rs forwards explicit sub-orchestration ids directly into the action without equivalent validation. src/lib.rs also treats ids starting with sub:: as auto-generated when building child ids.

Release and migration notes should mention the new instance-id reservation

Severity: Should-fix / Medium

The client start APIs now reject instance ids containing the reserved sub-orchestration marker. That is an externally observable compatibility change for applications that previously used sub:: or ::sub:: in root orchestration ids, but the changelog and migration guide do not mention the new reservation or how users should audit existing id schemes.

Suggestion: add a changelog entry and migration guidance before release. The notes should call out that root orchestration ids starting with sub:: or containing ::sub:: now return ClientError::InvalidInput, ordinary :: remains valid, and applications using the reserved marker should rename those root ids before upgrading client code.

Suggested wording:

### Changed

- Reserved the `sub::` marker for runtime-generated sub-orchestration instance ids.
  `Client::start_orchestration` and `Client::start_orchestration_versioned` now
  return `ClientError::InvalidInput` for root instance ids that start with `sub::`
  or contain `::sub::`; other uses of `::` remain supported.

Evidence: src/client/mod.rs implements the reserved-marker rejection and applies it to unversioned and versioned starts. tests/instance_id_validation_tests.rs asserts reserved ids are rejected while ordinary :: remains accepted. CHANGELOG.md and docs/migration-guide.md do not currently mention the new reservation.

Route sub-orchestration completion/failure notifications using the parent's current execution id read from durable provider state (Provider::read) instead of a process-local cache. A restarted or different runtime previously defaulted to execution 1 on a cache miss, so a parent on execution 2+ filtered the notification during replay and hung forever. Removes the current_execution_ids cache entirely (also eliminating its unbounded per-turn growth), updates comments to cover the post-continue-as-new child-id form, documents the explicit-id escape hatch, adds CHANGELOG/migration notes for the reserved sub:: marker, makes the redelivery test deterministic, and adds a cross-runtime regression test where a fresh runtime that never ran the parent must still route the failure to execution 2. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

tjgreen42 · 2026-06-09T20:36:52Z

Addressed the two remaining (non-inline) items in eeee0a2:

Explicit sub-orch ids bypass reserved-marker validation: documented the escape hatch precisely on schedule_sub_orchestration_with_id/_versioned_with_id (explicit ids are used verbatim, are not validated against the reserved sub:: marker, and collision-shaped explicit ids are allowed — the terminal-collision defense covers them; ids starting with sub:: get re-prefixed by build_child_instance_id). Added explicit_sub_orchestration_id_bypasses_reserved_marker_validation showing the client rejects tenant::sub::99 at top level but the explicit child-id API uses it verbatim.
Changelog/migration: added an Unreleased CHANGELOG entry (reserved sub:: marker + the parent-hang/durable-routing fix) and a migration-guide section with accept/reject examples and upgrade guidance.

Full suite (--all-features and no-features), clippy, and doctests all pass.

…hang # Conflicts: # CHANGELOG.md

affandar

Two review notes from walking through the failure mode:

The CAN bug is real, but the collision tests should be narrowed to that scenario. Since auto-generated child IDs include the parent instance ID, children from unrelated parents should not collide when parent/root instance IDs are unique across the provider. The convincing case is same parent across continue_as_new executions: execution 1 schedules event 2 as P::sub::2, then execution 2 schedules the same event position and would also generate P::sub::2 without the execution-scoped suffix. The "foreign terminal/squatter" tests are more of a legacy/manual/provider-bypass defense and should not be presented as the normal auto-generated collision case.
Parent notification routing should use the execution that scheduled the child, not resolve the parent's current execution when the child completes. The stronger model is to stamp parent_execution_id onto the child StartOrchestration work item at schedule time, persist/carry that through the child's parent link, and use it directly for SubOrchCompleted/SubOrchFailed. That avoids the process-local cache problem, avoids a provider read, and avoids a TOCTOU race where the parent's current execution at completion time differs from the execution that scheduled the child.

affandar · 2026-06-17T03:20:49Z

Expanding on my earlier review comment with a concrete fix shape.

1. Test/scenario shape

The continue_as_new bug is real, but the strongest regression test should be the same-parent, cross-execution case rather than a foreign/squatter instance. Auto-generated child instance ids already embed the parent instance id, so unrelated parents should not collide if root instance ids are unique across the provider:

parent ABC event 2 -> ABC::sub::2
parent XYZ event 2 -> XYZ::sub::2

The real CAN failure is:

parent P execution 1 event 2 -> P::sub::2
parent P execution 2 event 2 -> P::sub::2   // old behavior: event ids reset after CAN

A focused regression could look like this:

#[tokio::test]
async fn continue_as_new_suborch_child_ids_include_execution_after_first() {
    let store: Arc<dyn Provider> = Arc::new(SqliteProvider::new_in_memory().await.unwrap());
    let parent_id = "can-child-id-job";

    let parent = |ctx: OrchestrationContext, input: String| async move {
        let n: u32 = input.parse().unwrap_or(0);
        let result = ctx.schedule_sub_orchestration("Child", n.to_string()).await?;
        if n < 2 {
            return ctx.continue_as_new((n + 1).to_string()).await;
        }
        Ok(format!("done:{n}:{result}"))
    };

    let child = |_ctx: OrchestrationContext, input: String| async move {
        Ok(format!("child-done:{input}"))
    };

    let orchestrations = OrchestrationRegistry::builder()
        .register("Parent", parent)
        .register("Child", child)
        .build();
    let activities = ActivityRegistry::builder().build();
    let rt = runtime::Runtime::start_with_store(store.clone(), activities, orchestrations).await;
    let client = Client::new(store.clone());

    client.start_orchestration(parent_id, "Parent", "0").await.unwrap();

    let status = client
        .wait_for_orchestration(parent_id, Duration::from_secs(5))
        .await
        .expect("parent must not hang from reusing the same child id after continue-as-new");

    assert!(
        matches!(&status, OrchestrationStatus::Completed { output, .. } if output == "done:2:child-done:2"),
        "parent should complete after three executions; got {status:?}"
    );

    let mut scheduled_child_suffixes = Vec::new();
    for execution_id in 1..=3 {
        let history = store.read_with_execution(parent_id, execution_id).await.unwrap();
        let child_suffix = history
            .iter()
            .find_map(|event| match &event.kind {
                EventKind::SubOrchestrationScheduled { instance, .. } => Some(instance.clone()),
                _ => None,
            })
            .unwrap_or_else(|| panic!("execution {execution_id} should schedule a sub-orchestration"));
        scheduled_child_suffixes.push(child_suffix);
    }

    assert_eq!(
        scheduled_child_suffixes,
        vec!["sub::2".to_string(), "sub::2_2".to_string(), "sub::3_2".to_string()],
    );

    rt.shutdown(None).await;
}

With only the old id generation, this hangs because execution 2 tries to start P::sub::2, finds execution 1's child already terminal, and the parent never receives a completion. This test keeps the proof tied to the actual CAN issue.

The foreign/squatter tests can still be useful as legacy/manual/provider-bypass defense, but I would name and document them that way rather than treating them as the normal auto-generated collision scenario.

2. Route child responses to the scheduling execution

The response should be addressed to the parent execution that scheduled the child, not to whatever execution is current when the child completes.

The current PR fixes the cache miss by reading the parent's current execution from the provider at completion time. That works for the focused awaited CAN test because the parent cannot advance while it is awaiting the child. But it is still a weaker invariant and it keeps a TOCTOU window: the parent's current execution at completion time is not necessarily the execution that scheduled the child.

The stronger model is to stamp the parent execution id onto the child start at schedule time and carry that through to completion/failure routing.

Sketch:

pub enum WorkItem {
    StartOrchestration {
        instance: String,
        orchestration: String,
        input: String,
        version: Option<String>,
        parent_instance: Option<String>,

        #[serde(default, skip_serializing_if = "Option::is_none")]
        parent_execution_id: Option<u64>,

        parent_id: Option<u64>,
        execution_id: u64,
    },
}

When the parent schedules the sub-orchestration:

WorkItem::StartOrchestration {
    instance: child_full,
    orchestration: name.clone(),
    input: input.clone(),
    version: version.clone(),
    parent_instance: Some(instance.to_string()),
    parent_execution_id: Some(execution_id), // source parent execution
    parent_id: Some(*scheduling_event_id),   // source schedule event id
    execution_id: INITIAL_EXECUTION_ID,       // child starts at execution 1
}

Then persist that parent execution id in the child start metadata/history link as optional/defaulted for compatibility. For example:

EventKind::OrchestrationStarted {
    name,
    version,
    input,
    parent_instance,
    #[serde(default, skip_serializing_if = "Option::is_none")]
    parent_execution_id,
    parent_id,
    carry_forward_events,
    initial_custom_status,
}

And when the child completes or fails:

if let Some((parent_instance, parent_execution_id, parent_id)) = parent_link {
    orchestrator_items.push(WorkItem::SubOrchCompleted {
        parent_instance,
        parent_execution_id,
        parent_id,
        result,
    });
}

For rolling upgrade / old history compatibility:

let parent_execution_id = match stamped_parent_execution_id {
    Some(id) => id,
    None => self.parent_execution_id_for_routing(&parent_instance).await, // legacy fallback only
};

That makes the new path exact and durable:

schedule time:   (parent_instance=P, parent_execution_id=2, parent_id=2)
completion time: SubOrchCompleted targets exactly (P, execution 2, event 2)

This avoids the old process-local cache problem, avoids loading parent history on every child completion, and avoids retargeting a child response to a later parent execution just because that later execution is current when the child completes.

Route sub-orchestration completion/failure notifications to the exact parent execution that scheduled the child by stamping parent_execution_id onto the child StartOrchestration work item at schedule time and persisting it in the child's OrchestrationStarted event. All notification sites now use the stamped value, falling back to a durable provider read only when it is absent (children/work items from older runtimes). This removes the prior TOCTOU window where the parent's current execution at completion time could differ from the execution that scheduled the child, and avoids loading parent history on every child completion. Both new fields are Option<u64> with serde default + skip_serializing_if, so the wire and history formats stay backward compatible and mixed-version clusters route correctly during rolling upgrades. Restructure the collision scenario tests: make the same-parent continue-as-new self-collision the primary regression, asserting the exact per-execution child suffixes (sub::2, sub::2_2, sub::3_2). Relabel and rename the foreign-terminal "squatter" tests to legacy_provider_bypass_* and document them as legacy/provider-bypass defenses rather than the normal auto-generated collision case. Update CHANGELOG and the migration guide.

Adds suborch_completion_routes_to_scheduling_execution_not_current, proving the durable parent_execution_id stamp addresses a child's completion to the execution that scheduled it rather than the parent's current execution at completion time. A child outlives execution 1 via continue-as-new (which does not cancel outstanding children) and completes while the parent is on execution 2. With the stamp, the SubOrchCompleted is addressed to execution 1 (terminal) and the replay execution filter discards it, leaving execution 2 untouched. Without it, current-execution routing addresses the notification to execution 2, where event id 2 is not a sub-orchestration schedule, so it is applied as a nondeterministic completion and poisons the parent. Verified the test fails ("no matching schedule for sub-orchestration id=2") when routing falls back to the parent's current execution.

tjgreen42 · 2026-06-19T00:45:56Z

Both points are addressed — 45849ec (durable routing + test restructure) and 091fdd7 (the routing regression you asked for), now pushed.

1. Test/scenario shape. The same-parent continue-as-new self-collision is now the primary regression: continue_as_new_suborch_child_ids_include_execution_after_first pins the exact per-execution child suffixes (sub::2, sub::2_2, sub::3_2) and hangs under the old id generation. The foreign/terminal "squatter" tests are renamed legacy_provider_bypass_* and documented as legacy/provider-bypass defenses rather than the normal auto-generated case.

2. Route to the scheduling execution. Implemented as you sketched: parent_execution_id: Option<u64> is stamped onto WorkItem::StartOrchestration at schedule time (Some(execution_id)), persisted in EventKind::OrchestrationStarted, and carried to every SubOrchCompleted/SubOrchFailed routing site via resolve_parent_execution_id, which prefers the stamp and falls back to a provider read only for children/work items produced by older runtimes. Both fields are #[serde(default, skip_serializing_if = "Option::is_none")], so wire and history formats stay backward compatible for mixed-version clusters during rolling upgrades. This removes the provider read on the hot path and the TOCTOU window.

To prove the stamp matters beyond the awaited case (where the parent is blocked and the two coincide), 091fdd7 adds suborch_completion_routes_to_scheduling_execution_not_current: a child outlives execution 1 via continue-as-new (CAN does not cancel outstanding children) and completes while the parent is on execution 2. With the stamp the completion is addressed to execution 1 (terminal) and discarded by the replay execution filter, leaving execution 2 untouched. With current-execution routing it lands on execution 2 — where event id 2 is not a sub-orchestration schedule — and poisons the parent with no matching schedule for sub-orchestration id=2. I verified the test fails without the stamp.

Full suite (--all-features and no-features), clippy, and doctests pass.

tjgreen42 and others added 2 commits June 5, 2026 23:21

tjgreen42 changed the title ~~Fail sub-orchestration fast when its instance id is already terminal~~ Prevent parent hangs from sub-orchestration instance-id collisions Jun 8, 2026

tjgreen42 marked this pull request as draft June 8, 2026 02:56

tjgreen42 marked this pull request as ready for review June 8, 2026 18:36

tjgreen42 requested review from affandar and pinodeca June 8, 2026 18:36

pinodeca requested changes Jun 9, 2026

View reviewed changes

Comment thread src/runtime/dispatchers/orchestration.rs Outdated

Comment thread src/client/mod.rs Outdated

Comment thread tests/scenarios/suborch_id_collision.rs

Comment thread src/runtime/execution.rs Outdated

Merge remote-tracking branch 'origin/main' into fix/suborch-id-squat-…

ece82d2

…hang # Conflicts: # CHANGELOG.md

tjgreen42 requested a review from pinodeca June 10, 2026 03:04

affandar reviewed Jun 17, 2026

View reviewed changes

tjgreen42 added 2 commits June 18, 2026 03:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent parent hangs from sub-orchestration instance-id collisions#28

Prevent parent hangs from sub-orchestration instance-id collisions#28
tjgreen42 wants to merge 6 commits into
mainfrom
fix/suborch-id-squat-hang

tjgreen42 commented Jun 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pinodeca commented Jun 9, 2026 •

edited

Loading

Uh oh!

tjgreen42 commented Jun 9, 2026

Uh oh!

affandar left a comment

Uh oh!

affandar commented Jun 17, 2026

Uh oh!

tjgreen42 commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tjgreen42 commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pinodeca commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tjgreen42 commented Jun 9, 2026

Uh oh!

affandar left a comment

Choose a reason for hiding this comment

Uh oh!

affandar commented Jun 17, 2026

1. Test/scenario shape

2. Route child responses to the scheduling execution

Uh oh!

tjgreen42 commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tjgreen42 commented Jun 5, 2026 •

edited

Loading

pinodeca commented Jun 9, 2026 •

edited

Loading