Skip to content

fix(snowflake): bound get_schema_columns cache, drop table_type kwarg#28136

Open
ulixius9 wants to merge 3 commits into
mainfrom
snowflake_cust_oom
Open

fix(snowflake): bound get_schema_columns cache, drop table_type kwarg#28136
ulixius9 wants to merge 3 commits into
mainfrom
snowflake_cust_oom

Conversation

@ulixius9
Copy link
Copy Markdown
Member

@ulixius9 ulixius9 commented May 15, 2026

Summary

Two related fixes that together stop the OOM seen ingesting Snowflake
COM_US_IMDNA_ADL.AWB_INTERM (~13k wide tables) on a 4 GB pod (kernel
SIGKILL mid-table, no traceback in the workflow log).

  • metadata.py — stop forwarding table_type into
    inspector.get_columns(...). The kwarg ended up in SQLAlchemy's
    @reflection.cache key, so Regular vs View calls for the same schema
    got distinct keys and re-materialized the schema-wide column dict
    (~1.6 GB) at the first view. No dialect reads table_type from kw;
    the Stage/Stream branches above the call already consumed it.
  • utils.py — replace @reflection.cache on get_schema_columns
    with a bounded LRU (size 2 default, env
    OM_SNOWFLAKE_SCHEMA_COLUMNS_CACHE_SIZE). The LRU is stored on
    info_cache, inheriting the per-thread isolation that
    _inspector_map already provides. LRU recency keeps an
    actively-queried schema from being evicted by other threads' churn;
    on eviction the per-table get_columns cache entries for that
    schema are also cleared so the column data is actually freed
    (otherwise per-table refs pin the column lists even after the
    schema-wide dict is evicted).

Root cause (from the customer log)

Memory walk:

Time Event Memory
13:34–13:45 _get_schema_columns(AWB_INTERM) runs 11m 36s 740 → 2424 MB
13:46–14:01 5263 BASE TABLEs stream; cache hits flat ~2.45 GB
14:01–14:07 _get_schema_columns(AWB_INTERM) runs AGAIN 6m 19s jumps to 4053 MB
14:40 log cuts off mid-table, 0 errors in log SIGKILL

Three @reflection.cache-decorated functions all cache-missed
simultaneously at 14:01 (_current_database_schema,
_get_schema_primary_keys, _get_schema_columns). The only input that
flipped across that boundary was table_type (Regular for the last
table → View for the first view), which was being forwarded as a kwarg
into inspector.get_columns(...) and ended up in the cache key.

Plus, without the LRU bound, info_cache is only cleared between
databases (common_db_source.py:_release_engine), so any multi-schema
run accumulates every schema's column metadata in RAM for the full
database — also a latent OOM risk for any database with more than one
wide schema.

Test plan

  • 7 new tests in test_snowflake_schema_columns_lru.py — same-schema cache hit, eviction over size, LRU recency protecting a long-running schema (the multi-thread "one slow schema + many fast ones" case), per-table entry cleanup on eviction, no-info_cache fallthrough, 90030 None cached, env-var override
  • 4 new tests in test_snowflake_table_type_cache_pollution.py — base-table and view kwargs (no table_type), table-vs-view kwargs identical, Stage early-return still works
  • 26 existing Snowflake unit tests green
  • make py_format_check clean
  • Smoke-tested locally against a real Snowflake account by the reporter

Tuning

  • OM_SNOWFLAKE_SCHEMA_COLUMNS_CACHE_SIZE=1 for the tightest bound (one schema at a time, no buffer slot) if memory is extremely constrained.
  • Default 2 (current + just-finished) covers the table→view-transition use case and lets long-running schemas stay resident while smaller schemas cycle through.

🤖 Generated with Claude Code


Summary by Gitar

  • Resilience and error handling:
    • Added robust exception handling in _get_table_names_and_types to prevent FQN build failures from interrupting ingestion.
    • Added error handling in get_schema_columns to skip records with unparsable table names instead of crashing the process.

This will update automatically on new commits.

…ype cache pollution

Two related fixes that together stop the OOM seen ingesting Snowflake
COM_US_IMDNA_ADL.AWB_INTERM (~13k wide tables) on a 4 GB pod:

- metadata.py: stop forwarding table_type into inspector.get_columns(...).
  The kwarg ended up in SQLAlchemy's @reflection.cache key, so Regular vs
  View calls for the same schema got distinct keys and re-materialized
  the schema-wide column dict (~1.6 GB) at the first view. No dialect
  reads table_type from kw; the Stage/Stream branches above already
  consumed it for their early-returns.

- utils.py: replace @reflection.cache on get_schema_columns with a
  bounded LRU (size 2 default, env OM_SNOWFLAKE_SCHEMA_COLUMNS_CACHE_SIZE).
  Stored on info_cache so it inherits per-thread isolation from
  _inspector_map. LRU recency keeps an actively-queried schema from
  being evicted by other threads' churn; on eviction the per-table
  get_columns cache entries for that schema are also cleared so the
  column data is actually freed (otherwise per-table refs pin the lists
  even after the schema-wide dict is evicted).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ulixius9 ulixius9 requested a review from a team as a code owner May 15, 2026 07:50
Copilot AI review requested due to automatic review settings May 15, 2026 07:50
@github-actions github-actions Bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels May 15, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 15, 2026

Code Review ✅ Approved

Bounds the Snowflake schema column cache with an LRU and removes the table_type kwarg to prevent memory exhaustion and cache pollution. No issues found.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown
Contributor

🔴 Playwright Results — 1 failure(s), 10 flaky

✅ 4069 passed · ❌ 1 failed · 🟡 10 flaky · ⏭️ 92 skipped

Shard Passed Failed Flaky Skipped
🔴 Shard 1 297 1 1 4
🟡 Shard 2 757 0 5 14
🟡 Shard 3 780 0 1 7
🟡 Shard 4 789 0 1 18
🟡 Shard 5 708 0 1 41
🟡 Shard 6 738 0 1 8

Genuine Failures (failed on all attempts)

Pages/SearchIndexApplication.spec.ts › Search Index Application (shard 1)
Error: �[2mexpect(�[22m�[31mreceived�[39m�[2m).�[22mtoEqual�[2m(�[22m�[32mexpected�[39m�[2m) // deep equality�[22m

Expected: �[32mStringMatching /success|activeError/g�[39m
Received: �[31m"failed"�[39m
🟡 10 flaky test(s) (passed on retry)
  • Features/TagsSuggestion.spec.ts › should decline suggested tags for a container column (shard 1, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should start term as Draft when glossary has reviewers (shard 2, 1 retry)
  • Features/KnowledgeCenterList.spec.ts › Knowledge Center List - Test infinite scroll/pagination (shard 2, 1 retry)
  • Features/KnowledgeCenterTextEditor.spec.ts › Rich Text Editor - Text Formatting (shard 2, 1 retry)
  • Features/KnowledgeCenterTextEditor.spec.ts › Rich Text Editor - Text Formatting (shard 2, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Pages/DataContractsSemanticRules.spec.ts › Validate Description Rule Is_Set (shard 4, 1 retry)
  • Pages/ExplorePageRightPanel_KnowledgeCenter.spec.ts › Should remove user owner for knowledgeCenter (shard 5, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants