feat(ingestion): Add InfluxDB 3 database connector#29682
Conversation
- Add JSON Schema for InfluxDB 3 connection config (hostPort + token auth) - Add InfluxDBClient HTTP wrapper for /api/v3/query_sql metadata introspection - Add InfluxDBSource extending CommonNoSQLSource (non-SQLAlchemy pattern) - Map InfluxDB 3 concepts: instance→Database, database→Schema, table→Table - Port type mapping from existing custom connector (Influx→OpenMetadata types) - Add 15 unit tests across 5 test classes (type mapping, client, source, schema) - Register InfluxDB in databaseServiceType enum, javaEnums, and connection oneOf Closes open-metadata#29681
❌ PR checklist incompleteThis PR cannot be merged until the following are addressed on its linked issue:
The fields live on the linked issue in the Shipping project (open the issue → right sidebar → Projects). After you set them, re-run this check (or push a commit) — issue/project changes do not re-trigger it automatically. Maintainers can bypass this check by adding the |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
| def fetch_sample_rows( | ||
| self, database: str, table: str, limit: int = 50 | ||
| ) -> Tuple[List[str], List[List[Any]]]: | ||
| sql = ( | ||
| f'SELECT * FROM "{table}" ' | ||
| f"WHERE time >= now() - INTERVAL '24 hours' " | ||
| f"LIMIT {limit}" | ||
| ) | ||
| data = self._query(database, sql) | ||
| if not data: | ||
| return [], [] | ||
| columns = list(data[0].keys()) | ||
| rows = [[row.get(col) for col in columns] for row in data] | ||
| return columns, rows |
There was a problem hiding this comment.
💡 Edge Case: 24h window may yield no sample rows for stale tables
fetch_sample_rows hardcodes WHERE time >= now() - INTERVAL '24 hours'. Tables that have not received data in the last 24 hours will return no rows, so the sample-data preview will be empty even though the table has historical data. Combined with LIMIT SAMPLE_SIZE (1000), this fixed window is an implicit assumption that all tables are actively written. Consider dropping the time predicate (relying on LIMIT + ordering) or making the window configurable so preview data is available for lower-frequency measurements.
Was this helpful? React with 👍 / 👎
…ions - Fix create() test config_dict to match WorkflowSource model (flat, not nested) - Fix Column.name comparisons: column.name returns ColumnName object, use .root - Fix reject-wrong-type test: MysqlConnection needs username+authType fields - Remove redundant __init__ patch in reject test
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
Code Review 👍 Approved with suggestions 2 resolved / 5 findingsIntroduces an InfluxDB 3 connector using the native SQL API, but requires fixes for a Pydantic validation error in 💡 Security: Table name not escaped in fetch_sample_rows SQL📄 ingestion/src/metadata/ingestion/source/database/influxdb/connection.py:91-104 📄 ingestion/src/metadata/ingestion/source/database/influxdb/connection.py:82-89 In Table names originate from Escape embedded double quotes in the table identifier.💡 Edge Case: 24h window may yield no sample rows for stale tables📄 ingestion/src/metadata/ingestion/source/database/influxdb/connection.py:91-104
💡 Quality: Unused test_connection override in source metadata.py📄 ingestion/src/metadata/ingestion/source/database/influxdb/metadata.py:215-218
✅ 2 resolved✅ Bug: Sample data lookup may run before table is persisted
✅ Bug: databaseName used as both OM database name and schema filter
🤖 Prompt for agentsOptionsDisplay: compact → Showing less information. Comment with these commands to change the behavior for this request:
Was this helpful? React with 👍 / 👎 | Gitar |
- Fix TestConnectionResult to use StatusType enum and steps=[] (P1) - Escape double-quotes in fetch_sample_rows table name (P1) - Remove unused import 'requests' in connection.py (P2) - Remove unused imports in metadata.py (P2) - Remove duplicate get_database_names override (already in base class) - Remove dead test_connection override on source class - Fix databaseName: use only for schema filtering, not OM database name - Remove inline sample data ingestion from yield_table (race condition)
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
Code Review 👍 Approved with suggestions 4 resolved / 5 findingsAdds the InfluxDB 3 connector, addressing runtime validation failures in test connections and SQL malformation during sample data ingestion. Consider adjusting the 24-hour lookback window in 💡 Edge Case: 24h window may yield no sample rows for stale tables📄 ingestion/src/metadata/ingestion/source/database/influxdb/connection.py:91-104
✅ 4 resolved✅ Bug: Sample data lookup may run before table is persisted
✅ Bug: databaseName used as both OM database name and schema filter
✅ Security: Table name not escaped in fetch_sample_rows SQL
✅ Quality: Unused test_connection override in source metadata.py
🤖 Prompt for agentsOptionsDisplay: compact → Showing less information. Comment with these commands to change the behavior for this request:
Was this helpful? React with 👍 / 👎 | Gitar |
… lifecycle - Replace bare TestConnectionResult with test_connection_steps() call - Define structured test_fn dict (CheckAccess + GetDatabases steps) - Use self.client instead of creating/destroying a temporary client - Fixes P1 bug where 'Test Connection' button hung in UI indefinitely - Matches MongoDB/Cassandra NoSQL connector pattern
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
- Remove hard failure when list_databases() returns no databases - A healthy but empty instance should pass connectivity check
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
Code Review 👍 Approved with suggestions 5 resolved / 6 findingsAdds a new InfluxDB 3 connector with comprehensive unit testing, resolving previous test connection failures and SQL injection risks. Update the 💡 Edge Case: 24h window may yield no sample rows for stale tables📄 ingestion/src/metadata/ingestion/source/database/influxdb/connection.py:91-104
✅ 5 resolved✅ Bug: Sample data lookup may run before table is persisted
✅ Bug: databaseName used as both OM database name and schema filter
✅ Security: Table name not escaped in fetch_sample_rows SQL
✅ Quality: Unused test_connection override in source metadata.py
✅ Edge Case: Test connection fails on healthy but empty InfluxDB instance
🤖 Prompt for agentsOptionsDisplay: compact → Showing less information. Comment with these commands to change the behavior for this request:
Was this helpful? React with 👍 / 👎 | Gitar |
- Replace ValueError with TypeError per ruff TRY004 - Remove unused connection_obj type annotation (F821) - Ruff check: All checks passed - Ruff format: all files conform to project style
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
Code Review 👍 Approved with suggestions 5 resolved / 6 findingsAdds a new InfluxDB 3 connector with comprehensive unit testing, resolving previous test connection failures and SQL injection risks. Update the 💡 Edge Case: 24h window may yield no sample rows for stale tables📄 ingestion/src/metadata/ingestion/source/database/influxdb/connection.py:91-104
✅ 5 resolved✅ Bug: Sample data lookup may run before table is persisted
✅ Bug: databaseName used as both OM database name and schema filter
✅ Security: Table name not escaped in fetch_sample_rows SQL
✅ Quality: Unused test_connection override in source metadata.py
✅ Edge Case: Test connection fails on healthy but empty InfluxDB instance
🤖 Prompt for agentsOptionsDisplay: compact → Showing less information. Comment with these commands to change the behavior for this request:
Was this helpful? React with 👍 / 👎 | Gitar |
Summary
New database connector for InfluxDB 3, the latest version of the most widely used time-series database. InfluxDB 3 uses SQL (via HTTP API
/api/v3/query_sql) instead of Flux/InfluxQL.Connector architecture
Since InfluxDB 3 has no SQLAlchemy dialect, this connector follows the same
CommonNoSQLSourcepattern as MongoDB and Cassandra — using the native HTTP SQL API for metadata introspection.Files
influxdbConnection.jsonhostPort+tokenauthenticationconnection.pyInfluxDBClientHTTP wrapper +InfluxDBConnection(BaseConnection)metadata.pyInfluxDBSource(CommonNoSQLSource)— metadata extractiontest_influxdb.pydatabaseService.json$refCapabilities
information_schema)databaseNameconfig optionDependencies
requests(already a transitive dependency ofopenmetadata-ingestion)Validation Checklist
$refpaths resolvemake generatesucceeds (CI will verify)mvn clean install -pl openmetadata-specsucceeds (CI will verify)yarn parse-schemasucceeds (CI will verify)test_connectionpasses (viatest_connection_steps)create()validates config typemake py_formatproduces no changes (All checks passed!)mvn spotless:applyproduces no changes (CI will verify)Closes #29681
Greptile Summary
This PR adds a new InfluxDB 3 database connector following the
CommonNoSQLSourcepattern used by MongoDB and Cassandra, using InfluxDB's HTTP SQL API (/api/v3/query_sql) for metadata introspection since no SQLAlchemy dialect exists.connection.py: AddsInfluxDBClient(thin HTTP wrapper) andInfluxDBConnection(BaseConnection).test_connectionis correctly wired throughtest_connection_stepswithautomation_workflowandtimeout_secondsforwarded properly.metadata.py: ImplementsInfluxDBSource(CommonNoSQLSource)with column type mapping, schema/database listing, and table column extraction. Afetch_sample_rowsmethod is defined on the client but is never invoked from the source — sample data ingestion described in the PR is not actually connected to the topology.influxdbConnection.json/databaseService.json: Schema and registry entries are placed correctly in alphabetical order with required fields (hostPort,token) declared.Confidence Score: 4/5
Safe to merge for metadata extraction; the claimed sample-data capability is not wired up and should be addressed before documenting it as a feature.
The connector correctly handles connection lifecycle, type mapping, schema filtering, and test-connection flow. The one gap is that
fetch_sample_rowsonInfluxDBClienthas no caller in the ingestion topology — sample data will silently produce nothing despite being listed as a supported capability in the PR description.connection.py—fetch_sample_rowsis defined but never invoked from the source class or any topology step.Important Files Changed
test_connectionis now correctly wired throughtest_connection_steps.fetch_sample_rowsis defined but has no caller in the ingestion topology — claimed sample-data support is non-functional.Sequence Diagram
%%{init: {'theme': 'neutral'}}%% sequenceDiagram participant UI as OpenMetadata UI participant Source as InfluxDBSource participant Conn as InfluxDBConnection participant Client as InfluxDBClient participant API as InfluxDB 3 HTTP API UI->>Conn: test_connection(metadata, automation_workflow) Conn->>Client: test_connection() GET /health API-->>Client: 200 OK Client-->>Conn: True Conn->>Client: "list_databases() GET /api/v3/query_sql?db=_internal" API-->>Client: "[{database_name: ...}]" Conn-->>UI: TestConnectionResult (via test_connection_steps) UI->>Source: run ingestion Source->>Client: list_databases() API-->>Client: [db1, db2] Source->>Client: list_tables(schema_name) API-->>Client: [table1, table2] Source->>Client: get_columns(schema_name, table_name) API-->>Client: "[{column_name, data_type, is_nullable}]" Source-->>UI: Metadata (Database → Schema → Table → Columns)%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% sequenceDiagram participant UI as OpenMetadata UI participant Source as InfluxDBSource participant Conn as InfluxDBConnection participant Client as InfluxDBClient participant API as InfluxDB 3 HTTP API UI->>Conn: test_connection(metadata, automation_workflow) Conn->>Client: test_connection() GET /health API-->>Client: 200 OK Client-->>Conn: True Conn->>Client: "list_databases() GET /api/v3/query_sql?db=_internal" API-->>Client: "[{database_name: ...}]" Conn-->>UI: TestConnectionResult (via test_connection_steps) UI->>Source: run ingestion Source->>Client: list_databases() API-->>Client: [db1, db2] Source->>Client: list_tables(schema_name) API-->>Client: [table1, table2] Source->>Client: get_columns(schema_name, table_name) API-->>Client: "[{column_name, data_type, is_nullable}]" Source-->>UI: Metadata (Database → Schema → Table → Columns)Reviews (5): Last reviewed commit: "fix(review): pass py_format + use TypeEr..." | Re-trigger Greptile