Fix Stackdriver log read filter to use task_instance_id with backward compatibility#68293
Fix Stackdriver log read filter to use task_instance_id with backward compatibility#6829323tae wants to merge 6 commits into
Conversation
44b8ea2 to
dbca0b3
Compare
There was a problem hiding this comment.
Human Review
I'd note that there's a competing PR (#68293). The first PR that I see fully addressing the bug will be merged.
AI Review
Thanks @23tae — and nice that this follows @ashb's guidance in #68240 to key off ti.id with a backward-compat fallback. That's the right direction. The problem is that, as written, the change doesn't reach the path Bug 2 is actually about — the supervisor read path — so the reported bug stays unfixed. A few specifics:
1. The supervisor read() is never touched, and still crashes. StackdriverRemoteLogIO.read() still does ti_labels = _task_instance_to_labels(ti), and that helper evaluates ti.logical_date.isoformat(). In the supervisor, ti is a RuntimeTaskInstance, whose protocol (task-sdk RuntimeTaskInstanceProtocol) has no logical_date — that field lives on DagRunProtocol and needs a DB lookup the supervisor can't do. So this raises AttributeError before prepare_log_filter is ever called. The new OR logic and task_instance_id label never execute on this path.
2. The supervisor write path doesn't stamp task_instance_id. Even past the crash, the OR primary branch filters labels.task_instance_id=..., but the supervisor processors closure builds labels from the event dict (dag_id/task_id/run_id/try_number/map_index) and never writes a task_instance_id label — so nothing would match. The good news: ti_id is already bound into the log context at task_runner.py:1072 (bind_contextvars(ti_id=str(msg.ti.id), …)), so the write side can stamp it with event.get("ti_id") — it just isn't read today.
3. The OR fallback is logical_date-based, which is wrong for supervisor logs. Since #68292, the supervisor has been writing run_id-labelled logs (never logical_date). The legacy logical_date fallback you've added is correct for the webserver handler's old logs, but it wouldn't recover the run_id-labelled logs the supervisor has actually been emitting. A supervisor-side fallback needs to be run_id-based.
4. Minor — dead-code edit. task_instance_id is added to both the module-level _task_instance_to_labels and the StackdriverTaskHandler._task_instance_to_labels classmethod, but nothing calls the classmethod (every read/emit/url path uses the module-level function), so that half of the change has no effect.
Worth noting the green CI is a bit misleading here: TestStackdriverRemoteLogIO.test_read_logs passes only because it uses an unspec'd MagicMock() ti with a hand-set ti.logical_date, which invents the attribute a real RuntimeTaskInstance doesn't have — masking the crash above. A test using a spec'd runtime TI (no logical_date) would catch it.
To make this the actual Bug-2 fix (still on your ti.id approach), I think it needs:
- Supervisor write (
processorsclosure): addtask_instance_idfromevent.get("ti_id"). - Supervisor read (
read()): build the filter fromti.iddirectly, not via thelogical_datehelper. - Backward-compat: keep the
OR, but make the supervisor fallbackrun_id-based (matching #68292), reserving thelogical_datefallback for the legacy webserver handler.
The webserver-handler precision improvement here is genuinely nice and worth keeping — it's just currently riding ahead of the supervisor fix it's filed under. Happy to look again once the supervisor read/write paths are covered.
🤖 This review was drafted by an AI-assisted tool and may contain mistakes. An Airflow maintainer has reviewed and confirmed it before posting. See the contributing docs for what a maintainer review involves.
|
Thanks for the detailed review, @ashb. I overlooked the supervisor read path. I have updated the PR to address these points:
|
9cf3784 to
a9747ec
Compare
a9747ec to
845eac6
Compare
This PR updates the Stackdriver log read filter to utilize the unique
task_instance_id(ti.id) for accurate log retrieval, while implementing anORfallback query to seamlessly maintain backward compatibility for legacy logs.Description
This PR addresses Bug 2 described in #68240.
With the introduction of Airflow 3,
ti.idprovides a guaranteed unique identifier for task runs across retries. The previous filter logic only relied on legacy labels (task_id,dag_id, etc.), which lacked precision. However, as noted by @ashb in the issue, simply replacing the filter would break reading legacy logs.Key changes
Compatibility logic: Updated
prepare_log_filterto generate anORcondition whenti.idis available.Generated Filter Example:
If the task instance lacks
ti.id(e.g., older logs), it falls back safely to the traditionalANDfilter.Verification Results
I have verified the changes using
prekandbreeze.related: #68240
Was generative AI tooling used to co-author this PR?
Generated-by: Antigravity following the guidelines
{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.Note
✅ Ready for review · @23tae →
@potiuk· 2026-06-12 13:30 UTCThanks @23tae — all checks are green and this PR is marked ready for maintainer review. The ball is with the maintainers now; a maintainer will take the next look.
Automated triage — may be imperfect.