Skip to content

Fix Stackdriver log read filter to use task_instance_id with backward compatibility#68293

Open
23tae wants to merge 6 commits into
apache:mainfrom
23tae:fix-gcl-read-filter
Open

Fix Stackdriver log read filter to use task_instance_id with backward compatibility#68293
23tae wants to merge 6 commits into
apache:mainfrom
23tae:fix-gcl-read-filter

Conversation

@23tae

@23tae 23tae commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

This PR updates the Stackdriver log read filter to utilize the unique task_instance_id (ti.id) for accurate log retrieval, while implementing an OR fallback query to seamlessly maintain backward compatibility for legacy logs.

Description

This PR addresses Bug 2 described in #68240.

With the introduction of Airflow 3, ti.id provides a guaranteed unique identifier for task runs across retries. The previous filter logic only relied on legacy labels (task_id, dag_id, etc.), which lacked precision. However, as noted by @ashb in the issue, simply replacing the filter would break reading legacy logs.

Key changes

  • Compatibility logic: Updated prepare_log_filter to generate an OR condition when ti.id is available.

    Generated Filter Example:

    (labels.task_instance_id="<uuid>" OR (labels.task_id="..." AND labels.dag_id="..." AND labels.try_number="..."))
    

    If the task instance lacks ti.id (e.g., older logs), it falls back safely to the traditional AND filter.

Verification Results

I have verified the changes using prek and breeze.

  • Static Checks (Prek): Passed
  • Unit Tests (Breeze): Passed

related: #68240


Was generative AI tooling used to co-author this PR?
  • Yes

Generated-by: Antigravity following the guidelines


  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

Note

✅ Ready for review · @23tae@potiuk · 2026-06-12 13:30 UTC

Thanks @23tae — all checks are green and this PR is marked ready for maintainer review. The ball is with the maintainers now; a maintainer will take the next look.

Automated triage — may be imperfect.

@shahar1 shahar1 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Human Review

I'd note that there's a competing PR (#68293). The first PR that I see fully addressing the bug will be merged.

AI Review

Thanks @23tae — and nice that this follows @ashb's guidance in #68240 to key off ti.id with a backward-compat fallback. That's the right direction. The problem is that, as written, the change doesn't reach the path Bug 2 is actually about — the supervisor read path — so the reported bug stays unfixed. A few specifics:

1. The supervisor read() is never touched, and still crashes. StackdriverRemoteLogIO.read() still does ti_labels = _task_instance_to_labels(ti), and that helper evaluates ti.logical_date.isoformat(). In the supervisor, ti is a RuntimeTaskInstance, whose protocol (task-sdk RuntimeTaskInstanceProtocol) has no logical_date — that field lives on DagRunProtocol and needs a DB lookup the supervisor can't do. So this raises AttributeError before prepare_log_filter is ever called. The new OR logic and task_instance_id label never execute on this path.

2. The supervisor write path doesn't stamp task_instance_id. Even past the crash, the OR primary branch filters labels.task_instance_id=..., but the supervisor processors closure builds labels from the event dict (dag_id/task_id/run_id/try_number/map_index) and never writes a task_instance_id label — so nothing would match. The good news: ti_id is already bound into the log context at task_runner.py:1072 (bind_contextvars(ti_id=str(msg.ti.id), …)), so the write side can stamp it with event.get("ti_id") — it just isn't read today.

3. The OR fallback is logical_date-based, which is wrong for supervisor logs. Since #68292, the supervisor has been writing run_id-labelled logs (never logical_date). The legacy logical_date fallback you've added is correct for the webserver handler's old logs, but it wouldn't recover the run_id-labelled logs the supervisor has actually been emitting. A supervisor-side fallback needs to be run_id-based.

4. Minor — dead-code edit. task_instance_id is added to both the module-level _task_instance_to_labels and the StackdriverTaskHandler._task_instance_to_labels classmethod, but nothing calls the classmethod (every read/emit/url path uses the module-level function), so that half of the change has no effect.

Worth noting the green CI is a bit misleading here: TestStackdriverRemoteLogIO.test_read_logs passes only because it uses an unspec'd MagicMock() ti with a hand-set ti.logical_date, which invents the attribute a real RuntimeTaskInstance doesn't have — masking the crash above. A test using a spec'd runtime TI (no logical_date) would catch it.

To make this the actual Bug-2 fix (still on your ti.id approach), I think it needs:

  1. Supervisor write (processors closure): add task_instance_id from event.get("ti_id").
  2. Supervisor read (read()): build the filter from ti.id directly, not via the logical_date helper.
  3. Backward-compat: keep the OR, but make the supervisor fallback run_id-based (matching #68292), reserving the logical_date fallback for the legacy webserver handler.

The webserver-handler precision improvement here is genuinely nice and worth keeping — it's just currently riding ahead of the supervisor fix it's filed under. Happy to look again once the supervisor read/write paths are covered.


🤖 This review was drafted by an AI-assisted tool and may contain mistakes. An Airflow maintainer has reviewed and confirmed it before posting. See the contributing docs for what a maintainer review involves.

@23tae

23tae commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the detailed review, @ashb. I overlooked the supervisor read path.

I have updated the PR to address these points:

  • Write Path: Extracted ti_id directly from the event dict in processors.
  • Read/Fallback Path: Changed to safely fetch LABEL_LOGICAL_DATE, and added a run_id fallback specifically for AF3 supervisor logs.
  • Tests: Updated all read() tests to strictly use spec=RuntimeTaskInstanceProtocol in AF3+, ensuring they now properly catch missing attribute crashes.

@23tae 23tae force-pushed the fix-gcl-read-filter branch from a9747ec to 845eac6 Compare June 17, 2026 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:logging area:providers provider:google Google (including GCP) related issues ready for maintainer review Set after triaging when all criteria pass.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants