feat(snowflake): opt-in ACCESS_HISTORY lineage path (POC) by ulixius9 · Pull Request #28149 · open-metadata/OpenMetadata

ulixius9 · 2026-05-15T13:04:27Z

Summary

New Snowflake lineage path that reads precomputed table and column lineage from ACCOUNT_USAGE.ACCESS_HISTORY instead of parsing every relevant query client-side. Targets large query windows where the current sqlglot/sqlfluff/sqlparse pipeline becomes a wall-clock bottleneck.
Opt-in via connectionOptions.useAccessHistory: \"true\" on the Snowflake service connection. Default off — existing pipelines see zero change. A runtime probe demotes silently to the legacy parser path when ACCESS_HISTORY isn't readable (Standard Edition or missing IMPORTED PRIVILEGES ON DATABASE SNOWFLAKE).
Single combined SQL: MAX_BY dedup on table edges, ARRAY_AGG(DISTINCT OBJECT_CONSTRUCT(...)) on column pairs per (downstream, upstream) edge, LEFT JOIN'd so the cursor streams one row per directed edge with column lineage already attached. Constant client memory regardless of edge count.
COPY_HISTORY surfaced separately for external stage → Container lineage. External stages (s3://, azure://, gcs://, https://) are resolved against ingested Container entities via es_search_container_by_path; internal Snowflake stages (@~/, @%table/, @db.schema.stage/) are skipped silently.
Stored-procedure body lineage path (StoredProcedureLineageMixin) is unchanged and continues to run regardless of the flag.
POC scope: configured via existing connectionOptions Map<String,String> on the Snowflake connection — no JSON schema or generated-model changes. If validated, a follow-up PR will promote the key to a first-class field.

How to enable

Customer pipeline YAML:

serviceConnection:
  config:
    type: Snowflake
    ...
    connectionOptions:
      useAccessHistory: \"true\"
sourceConfig:
  config:
    queryLogDuration: 180  # set as wide as needed; ACCESS_HISTORY retains ~365 days

Files changed

`ingestion/src/metadata/ingestion/source/database/snowflake/queries.py` — three new SQL constants: `SNOWFLAKE_ACCESS_HISTORY_PROBE`, `SNOWFLAKE_ACCESS_HISTORY_LINEAGE` (combined table+column), `SNOWFLAKE_COPY_HISTORY_LINEAGE`.
`ingestion/src/metadata/ingestion/source/database/snowflake/connection.py` — `probe_access_history_available(engine, schema)` helper; failures logged at INFO (Standard Edition is a legitimate state).
`ingestion/src/metadata/ingestion/source/database/snowflake/lineage.py` — `init` reads and pops the option, runs probe, demotes on failure; `yield_query_lineage` dispatches to the new path or falls through to `super()`; new private yielders for ACCESS_HISTORY and COPY_HISTORY with LRU-cached entity resolution; `_parse_column_pairs` decodes the VARIANT array (handles both Python-list and JSON-string forms).
`ingestion/tests/unit/topology/database/test_snowflake_access_history_lineage.py` — 25 unit tests covering SQL rendering, connectionOptions parsing and key pop, probe-failure demote, table edge yielding, column lineage attachment with single and multiple pairs, COPY edge resolution (resolved / unresolved / internal), legacy-parser-bypass regression, and `_parse_column_pairs` robustness.

Test plan

`make py_format_check` clean (ruff lint + format)
25 new unit tests pass
51/51 Snowflake unit tests pass (no regression in existing `test_snowflake.py` or `test_snowflake_ordinal_position.py`)
Manual run on an Enterprise Snowflake account with `useAccessHistory: "true"` and `queryLogDuration: 180`; record wall time and edge count
Verify edge count ≥ legacy path on the same window
Verify column lineage attaches end-to-end (e.g., `orders.amount → revenue.total_amount` on a known graph)
Verify Standard Edition account: probe demotes silently and behavior is identical to today
Verify pipeline without `useAccessHistory` set: behavior is bit-identical to today (default off)

Notes

ACCESS_HISTORY has ~45-minute freshness — daily/weekly cron pipelines unaffected; not suitable for near-real-time lineage.
`directSources` is NULL for INSERT…VALUES, CALL (stored procs), and certain dynamic SQL. Proc body lineage is preserved via the existing `StoredProcedureLineageMixin` join.
COPY edges depend on the upstream Container being ingested via a Storage Service ingestion; unresolved external stages are logged at INFO so the operator can decide which storage services to ingest next.
No incremental checkpointing wired in this POC; each run scans the full configured `queryLogDuration` window.

🤖 Generated with Claude Code

Summary by Gitar

Enhanced lineage metadata:
- Integrated QUERY_TEXT from ACCOUNT_USAGE.QUERY_HISTORY into the ACCESS_HISTORY lineage path to provide representative SQL context for edges.
- Updated LineageDetails to include sqlQuery and added corresponding unit tests to verify SQL text attachment.
Filtering capability:
- Added _build_filter_condition_clause to inject sourceConfig.filterCondition directly into the ACCESS_HISTORY SQL query.
- Updated SNOWFLAKE_ACCESS_HISTORY_LINEAGE query to support filtering at the source CTE level.
Documentation:
- Removed "POC" labels and references from log messages, docstrings, and unit test headers to reflect readiness.

_{This will update automatically on new commits.}

Adds an alternative Snowflake lineage path that reads precomputed table-to-table and column-to-column lineage directly from ACCOUNT_USAGE.ACCESS_HISTORY, bypassing client-side SQL parsing. Opt-in via connectionOptions.useAccessHistory="true" with a runtime probe that silently demotes to the legacy parser path on Standard Edition or when the role lacks the IMPORTED PRIVILEGES grant. Zero behavior change for pipelines that do not set the flag. The combined SQL groups table edges with MAX_BY for dedup and aggregates column pairs per (downstream, upstream) edge via ARRAY_AGG so the client streams one row per directed edge with column lineage already attached — constant client memory regardless of catalog size, single round-trip to Snowflake. COPY_HISTORY is also surfaced for external stage→table lineage, resolving the upstream Container by stage URL; internal Snowflake stages are skipped silently. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gitar-bot · 2026-05-15T13:08:55Z

+    @staticmethod
+    def _split_snowflake_fqn(snowflake_fqn: str) -> Optional[Tuple[str, str, str]]:  # noqa: UP006, UP045
+        """
+        Split a Snowflake `DB.SCHEMA.TABLE` FQN into its three parts.
+        Returns None for malformed inputs (quoted names with embedded dots are
+        not handled in the POC and are skipped silently).
+        """
+        if not snowflake_fqn or '"' in snowflake_fqn:
+            return None
+        parts = snowflake_fqn.split(".")
+        if len(parts) != 3:
+            return None
+        return parts[0], parts[1], parts[2]


💡 Edge Case: Quoted Snowflake identifiers silently dropped from lineage

_split_snowflake_fqn (line 439) rejects any FQN containing a double-quote character. Snowflake routinely quotes identifiers that contain spaces, mixed case, or special characters (e.g., "My DB"."My Schema"."My Table"). ACCESS_HISTORY returns quoted identifiers for such objects, so their lineage edges are silently skipped.

While documented as a POC limitation, this could lead to significant lineage gaps for customers with mixed-case or special-character naming conventions, with no visibility into what was missed (only the aggregate skip count is logged).

Strip quotes and split correctly, or at minimum log a DEBUG message per skipped FQN so operators can assess impact:

@staticmethod def _split_snowflake_fqn(snowflake_fqn: str) -> Optional[Tuple[str, str, str]]: if not snowflake_fqn: return None # Strip surrounding quotes from each part parts = snowflake_fqn.split(".") # Quoted identifiers may contain dots — reassemble quoted segments # For POC: handle the simple case of "DB"."SCHEMA"."TABLE" stripped = [p.strip('"') for p in parts] if len(stripped) != 3: logger.debug(f"Skipping FQN with unexpected part count: {snowflake_fqn}") return None return stripped[0], stripped[1], stripped[2]

Apply fix

_{Check the box to apply the fix or reply for a change | Was this helpful? React with 👍 / 👎}

gitar-bot · 2026-05-15T13:08:57Z

+        if not (db and schema and table and stage_location):
+            return None
+
+        downstream_fqn = fqn._build(self.config.serviceName, db, schema, table)


💡 Quality: Using private fqn._build instead of public API

fqn._build is a private helper (underscore-prefixed) that simply joins components with .. The public fqn.build function provides entity-type-aware FQN construction and is the standard across the codebase. Using the private function bypasses any future validation or normalization added to the public API.

_{Was this helpful? React with 👍 / 👎}

The combined ACCESS_HISTORY SQL now LEFT JOINs back to QUERY_HISTORY on the representative query_id (already selected via MAX_BY) and returns QUERY_TEXT alongside the edge. `_build_access_history_edge` populates LineageDetails.sqlQuery so the OpenMetadata lineage panel shows the SQL that produced the edge — matching the per-edge "SQL Query" surface in the Snowflake-native lineage view. LineageDetails is now built via the constructor (sqlQuery, columnsLineage, source) rather than post-assignment, since Pydantic 2 skips coercion on attribute setters and the RootModel wrapper would not get applied. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drops the "(POC path)" suffix from the dispatch log line and POC framing from docstrings now that the path is stable enough to ship as a permanent connector option. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… path The combined SQL now exposes a `{filter_condition}` placeholder inside the `access_history_filtered` CTE so users can scope which queries contribute to lineage — same field the legacy QUERY_HISTORY parser path already respects via `get_filters()`. Unqualified column names resolve against QUERY_HISTORY (alias `qh`) in the CTE, so existing filterCondition values like `query_type = 'COPY'` or `user_name = 'etl_user'` carry over without edits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gitar-bot · 2026-05-15T14:50:52Z

Code Review 👍 Approved with suggestions 1 resolved / 3 findings

Introduces an opt-in Snowflake ACCESS_HISTORY lineage path for improved ingestion performance, including enhanced SQL context and filter support. Please address the silent dropping of quoted identifiers in FQN splitting and switch to the public FQN API instead of internal methods.

💡 Edge Case: Quoted Snowflake identifiers silently dropped from lineage

📄 ingestion/src/metadata/ingestion/source/database/snowflake/lineage.py:432-444

_split_snowflake_fqn (line 439) rejects any FQN containing a double-quote character. Snowflake routinely quotes identifiers that contain spaces, mixed case, or special characters (e.g., "My DB"."My Schema"."My Table"). ACCESS_HISTORY returns quoted identifiers for such objects, so their lineage edges are silently skipped.

While documented as a POC limitation, this could lead to significant lineage gaps for customers with mixed-case or special-character naming conventions, with no visibility into what was missed (only the aggregate skip count is logged).

Strip quotes and split correctly, or at minimum log a DEBUG message per skipped FQN so operators can assess impact

@staticmethod
def _split_snowflake_fqn(snowflake_fqn: str) -> Optional[Tuple[str, str, str]]:
    if not snowflake_fqn:
        return None
    # Strip surrounding quotes from each part
    parts = snowflake_fqn.split(".")
    # Quoted identifiers may contain dots — reassemble quoted segments
    # For POC: handle the simple case of "DB"."SCHEMA"."TABLE"
    stripped = [p.strip('"') for p in parts]
    if len(stripped) != 3:
        logger.debug(f"Skipping FQN with unexpected part count: {snowflake_fqn}")
        return None
    return stripped[0], stripped[1], stripped[2]

💡 Quality: Using private fqn._build instead of public API

📄 ingestion/src/metadata/ingestion/source/database/snowflake/lineage.py:376 📄 ingestion/src/metadata/ingestion/source/database/snowflake/lineage.py:406

fqn._build is a private helper (underscore-prefixed) that simply joins components with .. The public fqn.build function provides entity-type-aware FQN construction and is the standard across the codebase. Using the private function bypasses any future validation or normalization added to the public API.

✅ 1 resolved

✅ Bug: Pop of useAccessHistory happens after engine creation — ineffective

📄 ingestion/src/metadata/ingestion/source/database/snowflake/lineage.py:96-110
The docstring for _read_access_history_flag says "Popping it ensures the Snowflake driver never sees it in the URL." However, super().__init__() (line 102) creates the engine before _read_access_history_flag() runs (line 105). The engine URL is built via get_connection_url() which reads connectionOptions and appends them as URL query params. By the time .pop() executes, the key has already been baked into the engine's connection URL.

This means useAccessHistory=true is passed to Snowflake as a session parameter on every connection. Snowflake may log warnings or (in future versions) reject unknown session parameters, potentially breaking the connection.

Fix: pop the key before calling super().__init__(), or read/check it without popping (since by that point the engine already exists).

🤖 Prompt for agents

Code Review: Introduces an opt-in Snowflake ACCESS_HISTORY lineage path for improved ingestion performance, including enhanced SQL context and filter support. Please address the silent dropping of quoted identifiers in FQN splitting and switch to the public FQN API instead of internal methods.

1. 💡 Edge Case: Quoted Snowflake identifiers silently dropped from lineage
   Files: ingestion/src/metadata/ingestion/source/database/snowflake/lineage.py:432-444

   `_split_snowflake_fqn` (line 439) rejects any FQN containing a double-quote character. Snowflake routinely quotes identifiers that contain spaces, mixed case, or special characters (e.g., `"My DB"."My Schema"."My Table"`). ACCESS_HISTORY returns quoted identifiers for such objects, so their lineage edges are silently skipped.
   
   While documented as a POC limitation, this could lead to significant lineage gaps for customers with mixed-case or special-character naming conventions, with no visibility into what was missed (only the aggregate skip count is logged).

   Fix (Strip quotes and split correctly, or at minimum log a DEBUG message per skipped FQN so operators can assess impact):
   @staticmethod
   def _split_snowflake_fqn(snowflake_fqn: str) -> Optional[Tuple[str, str, str]]:
       if not snowflake_fqn:
           return None
       # Strip surrounding quotes from each part
       parts = snowflake_fqn.split(".")
       # Quoted identifiers may contain dots — reassemble quoted segments
       # For POC: handle the simple case of "DB"."SCHEMA"."TABLE"
       stripped = [p.strip('"') for p in parts]
       if len(stripped) != 3:
           logger.debug(f"Skipping FQN with unexpected part count: {snowflake_fqn}")
           return None
       return stripped[0], stripped[1], stripped[2]

2. 💡 Quality: Using private fqn._build instead of public API
   Files: ingestion/src/metadata/ingestion/source/database/snowflake/lineage.py:376, ingestion/src/metadata/ingestion/source/database/snowflake/lineage.py:406

   `fqn._build` is a private helper (underscore-prefixed) that simply joins components with `.`. The public `fqn.build` function provides entity-type-aware FQN construction and is the standard across the codebase. Using the private function bypasses any future validation or normalization added to the public API.

Options

Display: compact → Showing less information.

Comment with these commands to change:

`Compact`
`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

github-actions Bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels May 15, 2026

gitar-bot Bot reviewed May 15, 2026

View reviewed changes

Comment thread ingestion/src/metadata/ingestion/source/database/snowflake/lineage.py

gitar-bot Bot reviewed May 15, 2026

View reviewed changes

ulixius9 and others added 3 commits May 15, 2026 19:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(snowflake): opt-in ACCESS_HISTORY lineage path (POC)#28149

feat(snowflake): opt-in ACCESS_HISTORY lineage path (POC)#28149
ulixius9 wants to merge 4 commits into
mainfrom
snowflake_lineage_get

ulixius9 commented May 15, 2026 •

edited by gitar-bot Bot

Loading

Uh oh!

Uh oh!

gitar-bot Bot May 15, 2026

Uh oh!

gitar-bot Bot May 15, 2026

Uh oh!

gitar-bot Bot commented May 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ulixius9 commented May 15, 2026 • edited by gitar-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How to enable

Files changed

Test plan

Notes

Summary by Gitar

Uh oh!

Uh oh!

gitar-bot Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ulixius9 commented May 15, 2026 •

edited by gitar-bot Bot

Loading

gitar-bot Bot commented May 15, 2026 •

edited

Loading