Skip to content

Add run_type tag to dagrun.duration.failed timeout metric#67765

Open
Vamsi-klu wants to merge 1 commit into
apache:mainfrom
Vamsi-klu:fix/64765-dagrun-duration-failed-run-type-tag
Open

Add run_type tag to dagrun.duration.failed timeout metric#67765
Vamsi-klu wants to merge 1 commit into
apache:mainfrom
Vamsi-klu:fix/64765-dagrun-duration-failed-run-type-tag

Conversation

@Vamsi-klu
Copy link
Copy Markdown
Contributor

What / impact

When a Dag run timed out, the scheduler emitted the dagrun.duration.failed timing metric with only a dag_id tag. Every other dagrun.duration.* emission also carries run_type, so dashboards and alerts that segment by run_type silently dropped timed-out runs (or saw an inconsistent tag set). After this change the timeout path emits both dag_id and run_type, consistent with normal completion.

Why

The timeout branch in _schedule_dag_run hardcoded tags={"dag_id": dag_run.dag_id}, while the canonical emitter DagRun._emit_duration_stats_for_finished_state uses dag_run.stats_tags (which includes run_type). This replaces the hardcoded dict with dag_run.stats_tags. run_type is NOT NULL and low-cardinality, so the new tag set is a safe strict superset of the old one and the metric name is unchanged.

How it unblocks

Makes timeout-duration metrics consistent and usable in run_type-segmented observability. Supersedes #64768, which stalled on CI/static checks rather than design — this keeps the diff to a single value change and ships green static checks, including the metrics-registry sync check that blocked the prior attempt.

Tests

One new regression test in test_scheduler_job.py, verified to fail on the unfixed code, asserting the dagrun.duration.failed timing call carries tags={"dag_id": …, "run_type": …}. Full scheduler suite passes; mypy + pre-commit + manual static checks (incl. metrics-registry sync) are green.

closes: #64765


Was generative AI tooling used to co-author this PR?
  • Yes — Claude Code (Opus 4.8)

Generated-by: Claude Code (Opus 4.8) following the guidelines

@boring-cyborg boring-cyborg Bot added the area:Scheduler including HA (high availability) scheduler label May 30, 2026
When a Dag run timed out, the scheduler emitted dagrun.duration.failed with
only a dag_id tag, while every other dagrun.duration.* metric (emitted by
DagRun._emit_duration_stats_for_finished_state) also carries run_type. This
made the timeout path inconsistent and dropped run_type from dashboards and
alerts for timed-out runs.

Emit the metric with dag_run.stats_tags so it includes both dag_id and
run_type, matching the canonical completion path.

closes: apache#64765
@Vamsi-klu Vamsi-klu force-pushed the fix/64765-dagrun-duration-failed-run-type-tag branch from 0b1b2d3 to 035b401 Compare May 30, 2026 04:20
@Vamsi-klu Vamsi-klu marked this pull request as ready for review May 30, 2026 04:22
@Vamsi-klu Vamsi-klu requested review from XD-DENG and ashb as code owners May 30, 2026 04:22
@Vamsi-klu
Copy link
Copy Markdown
Contributor Author

Vamsi-klu commented May 30, 2026

Gentle ping when you have bandwidth, @potiuk 🙏

One-line metrics fix for #64765: the Dag-run timeout path emitted dagrun.duration.failed with only a dag_id tag, missing the run_type tag that every other dagrun.duration.* emission carries (via DagRun._emit_duration_stats_for_finished_state). This switches it to dag_run.stats_tags so the two paths are consistent.

You commented on the prior stalled attempt (#64768), which stalled on CI/static checks rather than design so this keeps the diff to a single value change and ships green static checks, including the metrics-registry sync check. Regression test included.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dagrun.duration.failed metric missing run_type tag when failure caused by dagrun_timeout

1 participant