Skip to content

feat(snowflake): opt-in ACCESS_HISTORY lineage path (POC)#28149

Draft
ulixius9 wants to merge 4 commits into
mainfrom
snowflake_lineage_get
Draft

feat(snowflake): opt-in ACCESS_HISTORY lineage path (POC)#28149
ulixius9 wants to merge 4 commits into
mainfrom
snowflake_lineage_get

Conversation

@ulixius9
Copy link
Copy Markdown
Member

@ulixius9 ulixius9 commented May 15, 2026

Summary

  • New Snowflake lineage path that reads precomputed table and column lineage from ACCOUNT_USAGE.ACCESS_HISTORY instead of parsing every relevant query client-side. Targets large query windows where the current sqlglot/sqlfluff/sqlparse pipeline becomes a wall-clock bottleneck.
  • Opt-in via connectionOptions.useAccessHistory: \"true\" on the Snowflake service connection. Default off — existing pipelines see zero change. A runtime probe demotes silently to the legacy parser path when ACCESS_HISTORY isn't readable (Standard Edition or missing IMPORTED PRIVILEGES ON DATABASE SNOWFLAKE).
  • Single combined SQL: MAX_BY dedup on table edges, ARRAY_AGG(DISTINCT OBJECT_CONSTRUCT(...)) on column pairs per (downstream, upstream) edge, LEFT JOIN'd so the cursor streams one row per directed edge with column lineage already attached. Constant client memory regardless of edge count.
  • COPY_HISTORY surfaced separately for external stage → Container lineage. External stages (s3://, azure://, gcs://, https://) are resolved against ingested Container entities via es_search_container_by_path; internal Snowflake stages (@~/, @%table/, @db.schema.stage/) are skipped silently.
  • Stored-procedure body lineage path (StoredProcedureLineageMixin) is unchanged and continues to run regardless of the flag.
  • POC scope: configured via existing connectionOptions Map<String,String> on the Snowflake connection — no JSON schema or generated-model changes. If validated, a follow-up PR will promote the key to a first-class field.

How to enable

Customer pipeline YAML:

serviceConnection:
  config:
    type: Snowflake
    ...
    connectionOptions:
      useAccessHistory: \"true\"
sourceConfig:
  config:
    queryLogDuration: 180  # set as wide as needed; ACCESS_HISTORY retains ~365 days

Files changed

  • `ingestion/src/metadata/ingestion/source/database/snowflake/queries.py` — three new SQL constants: `SNOWFLAKE_ACCESS_HISTORY_PROBE`, `SNOWFLAKE_ACCESS_HISTORY_LINEAGE` (combined table+column), `SNOWFLAKE_COPY_HISTORY_LINEAGE`.
  • `ingestion/src/metadata/ingestion/source/database/snowflake/connection.py` — `probe_access_history_available(engine, schema)` helper; failures logged at INFO (Standard Edition is a legitimate state).
  • `ingestion/src/metadata/ingestion/source/database/snowflake/lineage.py` — `init` reads and pops the option, runs probe, demotes on failure; `yield_query_lineage` dispatches to the new path or falls through to `super()`; new private yielders for ACCESS_HISTORY and COPY_HISTORY with LRU-cached entity resolution; `_parse_column_pairs` decodes the VARIANT array (handles both Python-list and JSON-string forms).
  • `ingestion/tests/unit/topology/database/test_snowflake_access_history_lineage.py` — 25 unit tests covering SQL rendering, connectionOptions parsing and key pop, probe-failure demote, table edge yielding, column lineage attachment with single and multiple pairs, COPY edge resolution (resolved / unresolved / internal), legacy-parser-bypass regression, and `_parse_column_pairs` robustness.

Test plan

  • `make py_format_check` clean (ruff lint + format)
  • 25 new unit tests pass
  • 51/51 Snowflake unit tests pass (no regression in existing `test_snowflake.py` or `test_snowflake_ordinal_position.py`)
  • Manual run on an Enterprise Snowflake account with `useAccessHistory: "true"` and `queryLogDuration: 180`; record wall time and edge count
  • Verify edge count ≥ legacy path on the same window
  • Verify column lineage attaches end-to-end (e.g., `orders.amount → revenue.total_amount` on a known graph)
  • Verify Standard Edition account: probe demotes silently and behavior is identical to today
  • Verify pipeline without `useAccessHistory` set: behavior is bit-identical to today (default off)

Notes

  • ACCESS_HISTORY has ~45-minute freshness — daily/weekly cron pipelines unaffected; not suitable for near-real-time lineage.
  • `directSources` is NULL for INSERT…VALUES, CALL (stored procs), and certain dynamic SQL. Proc body lineage is preserved via the existing `StoredProcedureLineageMixin` join.
  • COPY edges depend on the upstream Container being ingested via a Storage Service ingestion; unresolved external stages are logged at INFO so the operator can decide which storage services to ingest next.
  • No incremental checkpointing wired in this POC; each run scans the full configured `queryLogDuration` window.

🤖 Generated with Claude Code


Summary by Gitar

  • Enhanced lineage metadata:
    • Integrated QUERY_TEXT from ACCOUNT_USAGE.QUERY_HISTORY into the ACCESS_HISTORY lineage path to provide representative SQL context for edges.
    • Updated LineageDetails to include sqlQuery and added corresponding unit tests to verify SQL text attachment.
  • Filtering capability:
    • Added _build_filter_condition_clause to inject sourceConfig.filterCondition directly into the ACCESS_HISTORY SQL query.
    • Updated SNOWFLAKE_ACCESS_HISTORY_LINEAGE query to support filtering at the source CTE level.
  • Documentation:
    • Removed "POC" labels and references from log messages, docstrings, and unit test headers to reflect readiness.

This will update automatically on new commits.

Adds an alternative Snowflake lineage path that reads precomputed
table-to-table and column-to-column lineage directly from
ACCOUNT_USAGE.ACCESS_HISTORY, bypassing client-side SQL parsing. Opt-in
via connectionOptions.useAccessHistory="true" with a runtime probe that
silently demotes to the legacy parser path on Standard Edition or when
the role lacks the IMPORTED PRIVILEGES grant. Zero behavior change for
pipelines that do not set the flag.

The combined SQL groups table edges with MAX_BY for dedup and aggregates
column pairs per (downstream, upstream) edge via ARRAY_AGG so the client
streams one row per directed edge with column lineage already attached
— constant client memory regardless of catalog size, single round-trip
to Snowflake. COPY_HISTORY is also surfaced for external stage→table
lineage, resolving the upstream Container by stage URL; internal
Snowflake stages are skipped silently.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels May 15, 2026
Comment on lines +432 to +444
@staticmethod
def _split_snowflake_fqn(snowflake_fqn: str) -> Optional[Tuple[str, str, str]]: # noqa: UP006, UP045
"""
Split a Snowflake `DB.SCHEMA.TABLE` FQN into its three parts.
Returns None for malformed inputs (quoted names with embedded dots are
not handled in the POC and are skipped silently).
"""
if not snowflake_fqn or '"' in snowflake_fqn:
return None
parts = snowflake_fqn.split(".")
if len(parts) != 3:
return None
return parts[0], parts[1], parts[2]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Edge Case: Quoted Snowflake identifiers silently dropped from lineage

_split_snowflake_fqn (line 439) rejects any FQN containing a double-quote character. Snowflake routinely quotes identifiers that contain spaces, mixed case, or special characters (e.g., "My DB"."My Schema"."My Table"). ACCESS_HISTORY returns quoted identifiers for such objects, so their lineage edges are silently skipped.

While documented as a POC limitation, this could lead to significant lineage gaps for customers with mixed-case or special-character naming conventions, with no visibility into what was missed (only the aggregate skip count is logged).

Strip quotes and split correctly, or at minimum log a DEBUG message per skipped FQN so operators can assess impact:

@staticmethod
def _split_snowflake_fqn(snowflake_fqn: str) -> Optional[Tuple[str, str, str]]:
    if not snowflake_fqn:
        return None
    # Strip surrounding quotes from each part
    parts = snowflake_fqn.split(".")
    # Quoted identifiers may contain dots — reassemble quoted segments
    # For POC: handle the simple case of "DB"."SCHEMA"."TABLE"
    stripped = [p.strip('"') for p in parts]
    if len(stripped) != 3:
        logger.debug(f"Skipping FQN with unexpected part count: {snowflake_fqn}")
        return None
    return stripped[0], stripped[1], stripped[2]
  • Apply fix

Check the box to apply the fix or reply for a change | Was this helpful? React with 👍 / 👎

if not (db and schema and table and stage_location):
return None

downstream_fqn = fqn._build(self.config.serviceName, db, schema, table)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Quality: Using private fqn._build instead of public API

fqn._build is a private helper (underscore-prefixed) that simply joins components with .. The public fqn.build function provides entity-type-aware FQN construction and is the standard across the codebase. Using the private function bypasses any future validation or normalization added to the public API.

Was this helpful? React with 👍 / 👎

ulixius9 and others added 3 commits May 15, 2026 19:49
The combined ACCESS_HISTORY SQL now LEFT JOINs back to QUERY_HISTORY on
the representative query_id (already selected via MAX_BY) and returns
QUERY_TEXT alongside the edge. `_build_access_history_edge` populates
LineageDetails.sqlQuery so the OpenMetadata lineage panel shows the SQL
that produced the edge — matching the per-edge "SQL Query" surface in
the Snowflake-native lineage view.

LineageDetails is now built via the constructor (sqlQuery, columnsLineage,
source) rather than post-assignment, since Pydantic 2 skips coercion on
attribute setters and the RootModel wrapper would not get applied.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the "(POC path)" suffix from the dispatch log line and POC framing
from docstrings now that the path is stable enough to ship as a
permanent connector option.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… path

The combined SQL now exposes a `{filter_condition}` placeholder inside the
`access_history_filtered` CTE so users can scope which queries contribute
to lineage — same field the legacy QUERY_HISTORY parser path already
respects via `get_filters()`. Unqualified column names resolve against
QUERY_HISTORY (alias `qh`) in the CTE, so existing filterCondition values
like `query_type = 'COPY'` or `user_name = 'etl_user'` carry over without
edits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 15, 2026

Code Review 👍 Approved with suggestions 1 resolved / 3 findings

Introduces an opt-in Snowflake ACCESS_HISTORY lineage path for improved ingestion performance, including enhanced SQL context and filter support. Please address the silent dropping of quoted identifiers in FQN splitting and switch to the public FQN API instead of internal methods.

💡 Edge Case: Quoted Snowflake identifiers silently dropped from lineage

📄 ingestion/src/metadata/ingestion/source/database/snowflake/lineage.py:432-444

_split_snowflake_fqn (line 439) rejects any FQN containing a double-quote character. Snowflake routinely quotes identifiers that contain spaces, mixed case, or special characters (e.g., "My DB"."My Schema"."My Table"). ACCESS_HISTORY returns quoted identifiers for such objects, so their lineage edges are silently skipped.

While documented as a POC limitation, this could lead to significant lineage gaps for customers with mixed-case or special-character naming conventions, with no visibility into what was missed (only the aggregate skip count is logged).

Strip quotes and split correctly, or at minimum log a DEBUG message per skipped FQN so operators can assess impact
@staticmethod
def _split_snowflake_fqn(snowflake_fqn: str) -> Optional[Tuple[str, str, str]]:
    if not snowflake_fqn:
        return None
    # Strip surrounding quotes from each part
    parts = snowflake_fqn.split(".")
    # Quoted identifiers may contain dots — reassemble quoted segments
    # For POC: handle the simple case of "DB"."SCHEMA"."TABLE"
    stripped = [p.strip('"') for p in parts]
    if len(stripped) != 3:
        logger.debug(f"Skipping FQN with unexpected part count: {snowflake_fqn}")
        return None
    return stripped[0], stripped[1], stripped[2]
💡 Quality: Using private fqn._build instead of public API

📄 ingestion/src/metadata/ingestion/source/database/snowflake/lineage.py:376 📄 ingestion/src/metadata/ingestion/source/database/snowflake/lineage.py:406

fqn._build is a private helper (underscore-prefixed) that simply joins components with .. The public fqn.build function provides entity-type-aware FQN construction and is the standard across the codebase. Using the private function bypasses any future validation or normalization added to the public API.

✅ 1 resolved
Bug: Pop of useAccessHistory happens after engine creation — ineffective

📄 ingestion/src/metadata/ingestion/source/database/snowflake/lineage.py:96-110
The docstring for _read_access_history_flag says "Popping it ensures the Snowflake driver never sees it in the URL." However, super().__init__() (line 102) creates the engine before _read_access_history_flag() runs (line 105). The engine URL is built via get_connection_url() which reads connectionOptions and appends them as URL query params. By the time .pop() executes, the key has already been baked into the engine's connection URL.

This means useAccessHistory=true is passed to Snowflake as a session parameter on every connection. Snowflake may log warnings or (in future versions) reject unknown session parameters, potentially breaking the connection.

Fix: pop the key before calling super().__init__(), or read/check it without popping (since by that point the engine already exists).

🤖 Prompt for agents
Code Review: Introduces an opt-in Snowflake ACCESS_HISTORY lineage path for improved ingestion performance, including enhanced SQL context and filter support. Please address the silent dropping of quoted identifiers in FQN splitting and switch to the public FQN API instead of internal methods.

1. 💡 Edge Case: Quoted Snowflake identifiers silently dropped from lineage
   Files: ingestion/src/metadata/ingestion/source/database/snowflake/lineage.py:432-444

   `_split_snowflake_fqn` (line 439) rejects any FQN containing a double-quote character. Snowflake routinely quotes identifiers that contain spaces, mixed case, or special characters (e.g., `"My DB"."My Schema"."My Table"`). ACCESS_HISTORY returns quoted identifiers for such objects, so their lineage edges are silently skipped.
   
   While documented as a POC limitation, this could lead to significant lineage gaps for customers with mixed-case or special-character naming conventions, with no visibility into what was missed (only the aggregate skip count is logged).

   Fix (Strip quotes and split correctly, or at minimum log a DEBUG message per skipped FQN so operators can assess impact):
   @staticmethod
   def _split_snowflake_fqn(snowflake_fqn: str) -> Optional[Tuple[str, str, str]]:
       if not snowflake_fqn:
           return None
       # Strip surrounding quotes from each part
       parts = snowflake_fqn.split(".")
       # Quoted identifiers may contain dots — reassemble quoted segments
       # For POC: handle the simple case of "DB"."SCHEMA"."TABLE"
       stripped = [p.strip('"') for p in parts]
       if len(stripped) != 3:
           logger.debug(f"Skipping FQN with unexpected part count: {snowflake_fqn}")
           return None
       return stripped[0], stripped[1], stripped[2]

2. 💡 Quality: Using private fqn._build instead of public API
   Files: ingestion/src/metadata/ingestion/source/database/snowflake/lineage.py:376, ingestion/src/metadata/ingestion/source/database/snowflake/lineage.py:406

   `fqn._build` is a private helper (underscore-prefixed) that simply joins components with `.`. The public `fqn.build` function provides entity-type-aware FQN construction and is the standard across the codebase. Using the private function bypasses any future validation or normalization added to the public API.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant