Prevent parent hangs from sub-orchestration instance-id collisions#28
Prevent parent hangs from sub-orchestration instance-id collisions#28tjgreen42 wants to merge 6 commits into
Conversation
A sub-orchestration's child instance id is auto-generated as
{parent}::sub::{event_id}. If an instance with that id already exists in a
terminal state, the orchestration dispatcher's terminal fast-ack path discarded
the incoming StartOrchestration without notifying the scheduling parent, leaving
the parent awaiting a completion that never arrived.
The dispatcher now enqueues a SubOrchFailed back to the scheduling parent when a
discarded terminal batch contains a StartOrchestration whose parent differs from
the terminal instance's own recorded parent, so the parent fails fast. Genuine
redelivery of a completed child's start (same parent and id) is left untouched.
Client::start_orchestration and start_orchestration_versioned also now reject
instance ids that use the reserved sub:: marker, so a caller cannot occupy a
future child id. Other uses of :: remain valid.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Event ids reset on continue-as-new, so a parent that schedules a
sub-orchestration at the same position on every iteration regenerated an
identical child id (`{parent}::sub::{event_id}`) and collided with the
now-terminal child from the previous iteration, hanging the parent.
Auto-generated child ids now include the parent execution after the first
(`{parent}::sub::{exec}_{event_id}`), and the runtime records the parent's
current execution at turn start so the child's completion routes to the
right iteration. The first execution's id format is unchanged and child ids
are persisted in history, so in-flight instances and mixed-version clusters
are unaffected.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Two non-inline review items from the PAW review did not attach as GitHub diff threads because their target files/lines are outside the PR diff. Capturing them here so they are visible with the submitted review.
Severity: Should-fix / Medium The new top-level client validation reserves Suggestion: apply the same reserved-marker validation to explicit child ids before emitting Evidence:
Severity: Should-fix / Medium The client start APIs now reject instance ids containing the reserved sub-orchestration marker. That is an externally observable compatibility change for applications that previously used Suggestion: add a changelog entry and migration guidance before release. The notes should call out that root orchestration ids starting with Suggested wording: ### Changed
- Reserved the `sub::` marker for runtime-generated sub-orchestration instance ids.
`Client::start_orchestration` and `Client::start_orchestration_versioned` now
return `ClientError::InvalidInput` for root instance ids that start with `sub::`
or contain `::sub::`; other uses of `::` remain supported.Evidence: |
Route sub-orchestration completion/failure notifications using the parent's current execution id read from durable provider state (Provider::read) instead of a process-local cache. A restarted or different runtime previously defaulted to execution 1 on a cache miss, so a parent on execution 2+ filtered the notification during replay and hung forever. Removes the current_execution_ids cache entirely (also eliminating its unbounded per-turn growth), updates comments to cover the post-continue-as-new child-id form, documents the explicit-id escape hatch, adds CHANGELOG/migration notes for the reserved sub:: marker, makes the redelivery test deterministic, and adds a cross-runtime regression test where a fresh runtime that never ran the parent must still route the failure to execution 2. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Addressed the two remaining (non-inline) items in eeee0a2:
Full suite ( |
…hang # Conflicts: # CHANGELOG.md
affandar
left a comment
There was a problem hiding this comment.
Two review notes from walking through the failure mode:
-
The CAN bug is real, but the collision tests should be narrowed to that scenario. Since auto-generated child IDs include the parent instance ID, children from unrelated parents should not collide when parent/root instance IDs are unique across the provider. The convincing case is same parent across
continue_as_newexecutions: execution 1 schedules event 2 asP::sub::2, then execution 2 schedules the same event position and would also generateP::sub::2without the execution-scoped suffix. The "foreign terminal/squatter" tests are more of a legacy/manual/provider-bypass defense and should not be presented as the normal auto-generated collision case. -
Parent notification routing should use the execution that scheduled the child, not resolve the parent's current execution when the child completes. The stronger model is to stamp
parent_execution_idonto the childStartOrchestrationwork item at schedule time, persist/carry that through the child's parent link, and use it directly forSubOrchCompleted/SubOrchFailed. That avoids the process-local cache problem, avoids a provider read, and avoids a TOCTOU race where the parent's current execution at completion time differs from the execution that scheduled the child.
|
Expanding on my earlier review comment with a concrete fix shape. 1. Test/scenario shapeThe The real CAN failure is: A focused regression could look like this: #[tokio::test]
async fn continue_as_new_suborch_child_ids_include_execution_after_first() {
let store: Arc<dyn Provider> = Arc::new(SqliteProvider::new_in_memory().await.unwrap());
let parent_id = "can-child-id-job";
let parent = |ctx: OrchestrationContext, input: String| async move {
let n: u32 = input.parse().unwrap_or(0);
let result = ctx.schedule_sub_orchestration("Child", n.to_string()).await?;
if n < 2 {
return ctx.continue_as_new((n + 1).to_string()).await;
}
Ok(format!("done:{n}:{result}"))
};
let child = |_ctx: OrchestrationContext, input: String| async move {
Ok(format!("child-done:{input}"))
};
let orchestrations = OrchestrationRegistry::builder()
.register("Parent", parent)
.register("Child", child)
.build();
let activities = ActivityRegistry::builder().build();
let rt = runtime::Runtime::start_with_store(store.clone(), activities, orchestrations).await;
let client = Client::new(store.clone());
client.start_orchestration(parent_id, "Parent", "0").await.unwrap();
let status = client
.wait_for_orchestration(parent_id, Duration::from_secs(5))
.await
.expect("parent must not hang from reusing the same child id after continue-as-new");
assert!(
matches!(&status, OrchestrationStatus::Completed { output, .. } if output == "done:2:child-done:2"),
"parent should complete after three executions; got {status:?}"
);
let mut scheduled_child_suffixes = Vec::new();
for execution_id in 1..=3 {
let history = store.read_with_execution(parent_id, execution_id).await.unwrap();
let child_suffix = history
.iter()
.find_map(|event| match &event.kind {
EventKind::SubOrchestrationScheduled { instance, .. } => Some(instance.clone()),
_ => None,
})
.unwrap_or_else(|| panic!("execution {execution_id} should schedule a sub-orchestration"));
scheduled_child_suffixes.push(child_suffix);
}
assert_eq!(
scheduled_child_suffixes,
vec!["sub::2".to_string(), "sub::2_2".to_string(), "sub::3_2".to_string()],
);
rt.shutdown(None).await;
}With only the old id generation, this hangs because execution 2 tries to start The foreign/squatter tests can still be useful as legacy/manual/provider-bypass defense, but I would name and document them that way rather than treating them as the normal auto-generated collision scenario. 2. Route child responses to the scheduling executionThe response should be addressed to the parent execution that scheduled the child, not to whatever execution is current when the child completes. The current PR fixes the cache miss by reading the parent's current execution from the provider at completion time. That works for the focused awaited CAN test because the parent cannot advance while it is awaiting the child. But it is still a weaker invariant and it keeps a TOCTOU window: the parent's current execution at completion time is not necessarily the execution that scheduled the child. The stronger model is to stamp the parent execution id onto the child start at schedule time and carry that through to completion/failure routing. Sketch: pub enum WorkItem {
StartOrchestration {
instance: String,
orchestration: String,
input: String,
version: Option<String>,
parent_instance: Option<String>,
#[serde(default, skip_serializing_if = "Option::is_none")]
parent_execution_id: Option<u64>,
parent_id: Option<u64>,
execution_id: u64,
},
}When the parent schedules the sub-orchestration: WorkItem::StartOrchestration {
instance: child_full,
orchestration: name.clone(),
input: input.clone(),
version: version.clone(),
parent_instance: Some(instance.to_string()),
parent_execution_id: Some(execution_id), // source parent execution
parent_id: Some(*scheduling_event_id), // source schedule event id
execution_id: INITIAL_EXECUTION_ID, // child starts at execution 1
}Then persist that parent execution id in the child start metadata/history link as optional/defaulted for compatibility. For example: EventKind::OrchestrationStarted {
name,
version,
input,
parent_instance,
#[serde(default, skip_serializing_if = "Option::is_none")]
parent_execution_id,
parent_id,
carry_forward_events,
initial_custom_status,
}And when the child completes or fails: if let Some((parent_instance, parent_execution_id, parent_id)) = parent_link {
orchestrator_items.push(WorkItem::SubOrchCompleted {
parent_instance,
parent_execution_id,
parent_id,
result,
});
}For rolling upgrade / old history compatibility: let parent_execution_id = match stamped_parent_execution_id {
Some(id) => id,
None => self.parent_execution_id_for_routing(&parent_instance).await, // legacy fallback only
};That makes the new path exact and durable: This avoids the old process-local cache problem, avoids loading parent history on every child completion, and avoids retargeting a child response to a later parent execution just because that later execution is current when the child completes. |
Route sub-orchestration completion/failure notifications to the exact parent execution that scheduled the child by stamping parent_execution_id onto the child StartOrchestration work item at schedule time and persisting it in the child's OrchestrationStarted event. All notification sites now use the stamped value, falling back to a durable provider read only when it is absent (children/work items from older runtimes). This removes the prior TOCTOU window where the parent's current execution at completion time could differ from the execution that scheduled the child, and avoids loading parent history on every child completion. Both new fields are Option<u64> with serde default + skip_serializing_if, so the wire and history formats stay backward compatible and mixed-version clusters route correctly during rolling upgrades. Restructure the collision scenario tests: make the same-parent continue-as-new self-collision the primary regression, asserting the exact per-execution child suffixes (sub::2, sub::2_2, sub::3_2). Relabel and rename the foreign-terminal "squatter" tests to legacy_provider_bypass_* and document them as legacy/provider-bypass defenses rather than the normal auto-generated collision case. Update CHANGELOG and the migration guide.
Adds suborch_completion_routes_to_scheduling_execution_not_current, proving
the durable parent_execution_id stamp addresses a child's completion to the
execution that scheduled it rather than the parent's current execution at
completion time.
A child outlives execution 1 via continue-as-new (which does not cancel
outstanding children) and completes while the parent is on execution 2.
With the stamp, the SubOrchCompleted is addressed to execution 1 (terminal)
and the replay execution filter discards it, leaving execution 2 untouched.
Without it, current-execution routing addresses the notification to
execution 2, where event id 2 is not a sub-orchestration schedule, so it is
applied as a nondeterministic completion and poisons the parent. Verified
the test fails ("no matching schedule for sub-orchestration id=2") when
routing falls back to the parent's current execution.
|
Both points are addressed — 45849ec (durable routing + test restructure) and 091fdd7 (the routing regression you asked for), now pushed. 1. Test/scenario shape. The same-parent continue-as-new self-collision is now the primary regression: 2. Route to the scheduling execution. Implemented as you sketched: To prove the stamp matters beyond the awaited case (where the parent is blocked and the two coincide), 091fdd7 adds Full suite ( |
A sub-orchestration's child instance id is auto-generated as
{parent}::sub::{event_id}. Two situations could leave the parent awaiting a completion that never arrives.If an unrelated instance already held that id in a terminal state, the dispatcher's terminal fast-ack path discarded the incoming StartOrchestration silently. It now enqueues a SubOrchFailed back to the scheduling parent when the discarded batch's parent differs from the terminal instance's own recorded parent — that mismatch check keeps ordinary at-least-once redelivery of a completed child's own start from re-failing the parent. The client start APIs also reject ids using the reserved
sub::marker; plain::(e.g.i4::child-1) stays valid.A parent that schedules a sub-orchestration inside a continue_as_new loop hit the same hang against itself: event ids reset each iteration, so every iteration regenerated the same child id and collided with the now-terminal child from the previous one. Auto-generated ids now include the parent execution after the first (
{parent}::sub::{exec}_{event_id}), and the runtime records the parent's current execution so the child's completion routes to the right iteration. The first execution's id format is unchanged and child ids are persisted in history, so in-flight instances and mixed-version clusters are unaffected.Both hangs are pinned by regression tests in
tests/scenarios/suborch_id_collision.rs; each reverts to a parent timeout when its fix is removed (verified). A third test confirms genuine redelivery of a completed child leaves the parent untouched.