Add run_type tag to dagrun.duration.failed timeout metric#67765
Open
Vamsi-klu wants to merge 1 commit into
Open
Add run_type tag to dagrun.duration.failed timeout metric#67765Vamsi-klu wants to merge 1 commit into
Vamsi-klu wants to merge 1 commit into
Conversation
When a Dag run timed out, the scheduler emitted dagrun.duration.failed with only a dag_id tag, while every other dagrun.duration.* metric (emitted by DagRun._emit_duration_stats_for_finished_state) also carries run_type. This made the timeout path inconsistent and dropped run_type from dashboards and alerts for timed-out runs. Emit the metric with dag_run.stats_tags so it includes both dag_id and run_type, matching the canonical completion path. closes: apache#64765
0b1b2d3 to
035b401
Compare
Contributor
Author
|
Gentle ping when you have bandwidth, @potiuk 🙏 One-line metrics fix for #64765: the Dag-run timeout path emitted You commented on the prior stalled attempt (#64768), which stalled on CI/static checks rather than design so this keeps the diff to a single value change and ships green static checks, including the metrics-registry sync check. Regression test included. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What / impact
When a Dag run timed out, the scheduler emitted the
dagrun.duration.failedtiming metric with only adag_idtag. Every otherdagrun.duration.*emission also carriesrun_type, so dashboards and alerts that segment byrun_typesilently dropped timed-out runs (or saw an inconsistent tag set). After this change the timeout path emits bothdag_idandrun_type, consistent with normal completion.Why
The timeout branch in
_schedule_dag_runhardcodedtags={"dag_id": dag_run.dag_id}, while the canonical emitterDagRun._emit_duration_stats_for_finished_stateusesdag_run.stats_tags(which includesrun_type). This replaces the hardcoded dict withdag_run.stats_tags.run_typeisNOT NULLand low-cardinality, so the new tag set is a safe strict superset of the old one and the metric name is unchanged.How it unblocks
Makes timeout-duration metrics consistent and usable in
run_type-segmented observability. Supersedes #64768, which stalled on CI/static checks rather than design — this keeps the diff to a single value change and ships green static checks, including the metrics-registry sync check that blocked the prior attempt.Tests
One new regression test in
test_scheduler_job.py, verified to fail on the unfixed code, asserting thedagrun.duration.failedtiming call carriestags={"dag_id": …, "run_type": …}. Full scheduler suite passes; mypy + pre-commit + manual static checks (incl. metrics-registry sync) are green.closes: #64765
Was generative AI tooling used to co-author this PR?
Generated-by: Claude Code (Opus 4.8) following the guidelines