Skip to content

fix(bigquery): GCP service account impersonation across BigQuery ADC and SQLAlchemy paths#29683

Open
harshsoni2024 wants to merge 5 commits into
mainfrom
fix/bigquery-sa-impersonation-28204
Open

fix(bigquery): GCP service account impersonation across BigQuery ADC and SQLAlchemy paths#29683
harshsoni2024 wants to merge 5 commits into
mainfrom
fix/bigquery-sa-impersonation-28204

Conversation

@harshsoni2024

@harshsoni2024 harshsoni2024 commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Fixes #28204

Issue

When the BigQuery connector is configured with a Target Service Account Email
(impersonation), operations run under the base identity instead of the
impersonated one, producing 403 ... does not have bigquery.jobs.create. This
affects both ADC and JSON-key setups (issue #28204).

gcpImpersonateServiceAccount sits on the parent GCPCredentials object, so it's
valid with any gcpConfig type. But the connector only partially honored it, and
BigQuery has two independent access paths:

  • Path A — SQLAlchemy bigquery:// engine (Test Connection and all
    information_schema metadata queries): set_google_credentials() only sets the
    base credentials env var / logs for ADC — it never applies impersonation. The
    engine was built with the common connect args, so it could never impersonate,
    for any credential type
    .
  • Path B — bigquery.Client (dataset listing, policy tags):
    get_inspector_details did forward impersonation, but only inside an
    isinstance(..., GcpCredentialsValues) guard, so ADC was excluded.

Net result (reporter's table): ADC ignores impersonation everywhere; JSON-key
impersonates the Python client but still runs its actual SQL as the base SA.

Fix

Resolve impersonation independently of the gcpConfig type, on both paths —
mirroring the existing Google Drive / Airflow connectors:

  • helper.py: extract get_impersonate_client_kwargs() (type-independent;
    reads gcpImpersonateServiceAccount off the credentials) and use it in
    get_inspector_details, so ADC and JSON-key both impersonate the Python client.
  • connection.py: add a BigQuery-specific get_connection_args() that, when a
    target SA is configured, builds an impersonated bigquery.Client via the
    existing get_bigquery_client(...) and injects it through the
    sqlalchemy-bigquery dialect's documented connect_args={'client': ...} hook.
    _get_client now uses this instead of get_connection_args_common.

The impersonated client is scoped to billingProjectId when set, otherwise to the
credentials project id.

No impact on current behavior

Impersonation logic only activates when gcpImpersonateServiceAccount is set with
a non-empty email. When it isn't:

  • get_impersonate_client_kwargs() returns {} → Python client built exactly as
    before.
  • get_connection_args() returns the common args with no client injected →
    engine built exactly as before.

Testing

  • New test_bigquery_impersonation.py (12 tests) covering both paths for ADC and
    JSON-key
    : impersonation kwargs are type-independent, an impersonated client is
    injected into the engine and scoped to the right project (billing project takes
    precedence), the client is built with impersonated_credentials, empty/blank
    target email is ignored, and — the regression guard — connect args are unchanged
    and no client is injected when impersonation is absent.
  • Existing BigQuery suites pass unchanged (test_bigquery.py,
    test_bigquery_test_connection.py, test_bigquery_incremental_table_processor.py)
    plus test_credentials.py.
  • make py_format_check clean; basedpyright reports 0 errors on the changed files.

Manual reproduction (for reviewers with a GCP setup)

With a base SA granted roles/iam.serviceAccountTokenCreator on a target reader SA,
add to the connector config:

yaml

credentials:
  gcpConfig: { ... }          # ADC or JSON-key
  gcpImpersonateServiceAccount:
    impersonateServiceAccount: bq-reader-target@<project>.iam.gserviceaccount.com
    lifetime: 3600

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

High-level design:

N/A — small change.

Tests:

Use cases covered

Unit tests

Backend integration tests

Ingestion integration tests

Playwright (UI) tests

Manual testing performed

UI screen recording / screenshots:

Not applicable.

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • My PR is linked to a GitHub issue via Fixes #<issue-number> above.
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.
  • For UI changes: I attached a screen recording and/or screenshots above.
  • I have added tests (unit / integration / Playwright as applicable) and listed them above.

Greptile Summary

This PR fixes a bug where BigQuery service account impersonation was ignored by both the SQLAlchemy engine path and, for ADC configurations, the Python bigquery.Client path. The root cause was that impersonation handling was incorrectly gated on isinstance(..., GcpCredentialsValues), excluding ADC, and was never applied to the SQLAlchemy bigquery:// engine at all.

  • helper.py: Extracts a new get_impersonate_client_kwargs() helper that reads gcpImpersonateServiceAccount off the credentials object regardless of gcpConfig type, with proper stripping of whitespace-only emails; replaces the old type-guarded block in get_inspector_details.
  • connection.py: Adds a BigQuery-specific get_connection_args() that, when impersonation is configured, builds an impersonated bigquery.Client and injects it via the sqlalchemy-bigquery dialect's connect_args={'client': ...} hook, ensuring every query through the engine runs under the target identity; _get_client() is updated to use this instead of get_connection_args_common.
  • test_bigquery_impersonation.py: Adds 12 targeted unit tests covering both ADC and JSON-key paths, billing project precedence, blank email edge cases, and a regression guard ensuring no client is injected when impersonation is absent.

Confidence Score: 5/5

Safe to merge — the change is narrowly scoped to impersonation handling and is a no-op when the impersonation field is absent, preserving all existing behavior.

Both changed code paths (SQLAlchemy engine and Python client) are guarded by a truthiness check on impersonation kwargs, so no impact on connectors that do not use gcpImpersonateServiceAccount. The new get_impersonate_client_kwargs helper correctly strips whitespace, and the project-ID resolution path covers all credential types (ADC, JSON key, credential path) with a fallback warning when none can be determined. The 12 new unit tests cover the regression guard and all meaningful edge cases.

No files require special attention.

Important Files Changed

Filename Overview
ingestion/src/metadata/ingestion/source/database/bigquery/helper.py Extracts get_impersonate_client_kwargs() and removes the isinstance(GcpCredentialsValues) gate; correct whitespace stripping; backward-compatible for non-impersonated configs.
ingestion/src/metadata/ingestion/source/database/bigquery/connection.py Adds get_connection_args() that injects an impersonated bigquery.Client into SQLAlchemy connect args when configured; _get_first_project_id and the warning log for unresolvable project IDs are clean additions; no-op when impersonation is absent.
ingestion/tests/unit/topology/database/test_bigquery_impersonation.py 12 well-scoped unit tests covering both credential types, billing-project precedence, blank-email edge case, and the regression guard; mocking is precise and focused.

Reviews (4): Last reviewed commit: "Merge branch 'main' into fix/bigquery-sa..." | Re-trigger Greptile

… credential types

BigQuery service account impersonation was only applied to the BigQuery
Python client and only when using JSON-key credentials (GcpCredentialsValues).
Two gaps caused impersonation to be silently ignored:

- The SQLAlchemy engine (used by Test Connection and all information_schema
  reads) never received impersonated credentials, so those queries always ran
  under the base identity.
- The Python-client path gated impersonation behind an isinstance check on
  GcpCredentialsValues, excluding ADC.

gcpImpersonateServiceAccount lives on the parent credentials object and is
valid regardless of the selected gcpConfig type, so impersonation is now
resolved independently of the type across both paths:

- Add get_impersonate_client_kwargs() in helper.py and use it in
  get_inspector_details so ADC and JSON-key both impersonate the Python client.
- Add a BigQuery-specific get_connection_args() that injects a pre-built
  impersonated bigquery.Client into the SQLAlchemy engine via the dialect's
  documented connect_args={'client': ...} hook, only when a target service
  account is configured.

When no target service account is set, behavior is unchanged: connect args
equal the common args with no client injected.

Fixes #28204

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings July 2, 2026 05:37
@harshsoni2024 harshsoni2024 requested a review from a team as a code owner July 2, 2026 05:37
@github-actions github-actions Bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels Jul 2, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes BigQuery connector service-account impersonation so it is applied consistently across both BigQuery access paths: the SQLAlchemy bigquery:// engine (test connection + information_schema queries) and the google.cloud.bigquery.Client usage (datasets/policy tags). This addresses #28204 where ADC (and parts of JSON-key flows) were silently running under the base identity instead of the configured impersonated service account.

Changes:

  • Adds a type-independent get_impersonate_client_kwargs() helper and uses it when building the BigQuery Python client.
  • Adds BigQuery-specific get_connection_args() to inject a pre-built impersonated bigquery.Client into the sqlalchemy-bigquery dialect via connect_args={'client': ...} when impersonation is configured.
  • Introduces unit tests covering ADC + JSON-key impersonation behavior across both paths and ensuring no behavior change when impersonation is not configured.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
ingestion/src/metadata/ingestion/source/database/bigquery/helper.py Centralizes impersonation kwargs generation and applies it to the BigQuery Python client path without gating on credential type.
ingestion/src/metadata/ingestion/source/database/bigquery/connection.py Injects an impersonated BigQuery client into SQLAlchemy connect args when impersonation is configured, ensuring engine-driven queries run under the target SA.
ingestion/tests/unit/topology/database/test_bigquery_impersonation.py Adds focused unit coverage for impersonation on both SQLAlchemy and Python client paths, for ADC and JSON-key, including regression guards.

Comment thread ingestion/src/metadata/ingestion/source/database/bigquery/helper.py Outdated
…eview

- Treat whitespace-only impersonate email as unset (strip before guard)
- Warn when impersonated client project id can't be resolved from config
- Tighten get_impersonate_client_kwargs docstring; add path-scoping test
Copilot AI review requested due to automatic review settings July 2, 2026 08:22
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

🟡 Playwright Results — all passed (25 flaky)

✅ 4478 passed · ❌ 0 failed · 🟡 25 flaky · ⏭️ 38 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 438 0 5 16
🟡 Shard 2 804 0 6 8
🟡 Shard 3 808 0 1 7
🟡 Shard 4 808 0 3 5
🟡 Shard 5 855 0 2 0
🟡 Shard 6 765 0 8 2
🟡 25 flaky test(s) (passed on retry)
  • Features/DataAssetRulesDisabled.spec.ts › Verify the Database Schema entity item action after rules disabled (shard 1, 1 retry)
  • Features/Glossary/GlossaryPagination.spec.ts › should check for nested glossary term search (shard 1, 1 retry)
  • Features/Glossary/GlossaryPagination.spec.ts › should filter by InReview status (shard 1, 1 retry)
  • Features/MetricCustomUnitFlow.spec.ts › Should create metric and test unit of measurement updates (shard 1, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: metric (shard 1, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/BulkImport.spec.ts › Database (shard 2, 1 retry)
  • Features/BulkImport.spec.ts › Database Schema (shard 2, 2 retries)
  • Features/ContextCenter.spec.ts › clicking a memory row opens the view-only modal (shard 2, 1 retry)
  • Features/ContextCenterPermission.spec.ts › user with view-only permission cannot see restore or delete actions on an archived document (shard 2, 2 retries)
  • Features/DataQuality/CertificationFilter.spec.ts › Certification filter narrows both table- and testCase-index queries via the flat field path (shard 2, 1 retry)
  • Features/SearchExport.spec.ts › Export queues a background job and downloads from the jobs tray (shard 3, 1 retry)
  • Flow/ServiceForm.spec.ts › Verify form selects are working properly (shard 4, 1 retry)
  • Pages/CustomProperties.spec.ts › Table (shard 4, 1 retry)
  • Pages/CustomProperties.spec.ts › Should display custom properties for apiCollection in right panel (shard 4, 1 retry)
  • Pages/ExploreBrowse.spec.ts › service type drill-down disables unrelated roots and query-panel Clear resets it (shard 5, 2 retries)
  • Pages/ExplorePageRightPanel_KnowledgeCenter.spec.ts › Should remove user owner for knowledgeCenter (shard 5, 1 retry)
  • Pages/ExplorePageRightPanel.spec.ts › Should allow Data Steward to edit glossary terms for mlmodel (shard 6, 1 retry)
  • Pages/Glossary.spec.ts › Column dropdown drag-and-drop functionality for Glossary Terms table (shard 6, 1 retry)
  • Pages/GlossaryImportExport.spec.ts › Import partial success - some terms pass, some fail (shard 6, 1 retry)
  • Pages/GlossaryImportExport.spec.ts › Glossary CSV import rejects unknown relation type (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify Impact Analysis service filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab is NOT visible for pipelineService in platform lineage (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab is NOT visible for apiService in platform lineage (shard 6, 1 retry)
  • Pages/TestSuite.spec.ts › Logical TestSuite (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment on lines +180 to +188
project_id = None
gcp_config = connection.credentials.gcpConfig
config_project_id = getattr(gcp_config, "projectId", None)
if config_project_id is not None:
if isinstance(config_project_id, SingleProjectId):
project_id = config_project_id.root
elif isinstance(config_project_id, MultipleProjectId) and config_project_id.root:
project_id = config_project_id.root[0]
return project_id

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harshsoni2024 can you check if this is relevant?

@greptile-apps

greptile-apps Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Want your agent to iterate on Greptile's feedback? Try greploops.

@gitar-bot

gitar-bot Bot commented Jul 3, 2026

Copy link
Copy Markdown
Code Review ✅ Approved 1 resolved / 1 findings

Enables service account impersonation for BigQuery across both ADC and JSON-key paths, resolving connection errors during ingestion. The implementation also addresses potential scoping issues when the project ID is missing.

✅ 1 resolved
Edge Case: Impersonated engine client may be scoped to None project

📄 ingestion/src/metadata/ingestion/source/database/bigquery/connection.py:160 📄 ingestion/src/metadata/ingestion/source/database/bigquery/connection.py:168-181
In get_connection_args, when impersonation is configured the impersonated bigquery.Client is scoped to connection.billingProjectId or _get_first_project_id(connection). _get_first_project_id returns None for credential types other than GcpCredentialsValues (e.g. GcpCredentialsPath, external account) or when projectId is absent, and billingProjectId may also be unset. In that case get_bigquery_client(project_id=None, ...) builds a client with no explicit project, so queries issued through the engine fall back to the impersonated credentials' default project, which can differ from the intended base config project. This is likely acceptable for ADC/JSON-key (covered by tests) but is untested for the path/external-account credential types the docstring of get_impersonate_client_kwargs explicitly claims to support. Consider falling back to the credentials' own project id for those types, or documenting that impersonation project-scoping only applies to value/ADC configs.

Options

Display: compact → Showing less information.

Comment with these commands to change the behavior for this request:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud

sonarqubecloud Bot commented Jul 3, 2026

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GCP Service Account Impersonation is ignored when using ADC in BigQuery connector

3 participants