What
PostgresEngineAdapter doesn't override _df_to_source_queries, so DataFrames from Python models get written as literal VALUES batches i.e.
SELECT CAST(...) ... FROM (VALUES (...), (...)) AS t(...) per select_from_values_for_batch_range, (DEFAULT_BATCH_SIZE=10K rows).
Postgres has to parse and plan each batch as one giant statement, which on my workload capped out around 2–6K rows/s (row width and network will move that number, but the per-row parse cost is structural). If you're serving Python model output to Postgres, the write phase ends up dominating the whole plan.
Proposed Solution
MSSQLEngineAdapter._df_to_source_queries (mssql.py) (along with BigQuery, Snowflake, Clickhouse) all already have the shape: create a temp table, bulk-load the DataFrame into it, return a SourceQuery selecting _casted_columns(...) from the temp table with cleanup_func=drop_table, and fall back to super()._df_to_source_queries(...) when the connection can't bulk-load.
Postgres has a native bulk mechanism: psycopg2's cursor.copy_expert("COPY ... FROM STDIN WITH (FORMAT csv, ...)", buf). So this is a port of an existing pattern. Same temp-table lifecycle, same fallback philosophy (MSSQL gates on driver capability; this gates on dtype/content safety, see below).
I've got this running as a subclass override in my project (verified against 0.228.0 and 0.234.1) and checked correctness properly: per-table row counts and content checksums are byte-identical to the literal-SQL path across all 81 tables.
Comparision on 81 Python models (~1M rows, prod state creation , local Postgres):
|
literal VALUES |
COPY (suggested improvement) |
| model batch execution (81 models) |
96s |
7s (~14x) |
Some findings and considerations
- Int columns with NULLs: pandas stores nullable ints as floats, so COPY receives 0.0 and rejects it for int columns. We convert those columns to pandas' nullable Int64 first using a hand-picked set of int types, because sqlglot's INTEGER_TYPES constant wrongly includes Postgres's bit-string type.
- Errors instead of rounding: a bad value like 1.5 in an INT column now raises an error, where the old path silently rounded it to 2. Failing fast is better, but it's a behaviour change so the tests pin it down.
- Temp table names being too long: sqlmesh's generated temp-table names exceed Postgres's 63-character limit for long model names (34 of my 81 models were too long 😅 ). We shorten the name while keeping the suffix; the proper fix belongs in sqlmesh's base class.
- Risingwave's adapter inherits from the Postgres one, so it would silently get the COPY path on a server that may not support that syntax. The PR should pin Risingwave to its current behaviour until someone verifies COPY works there.
- NULL edge case: NULLs appear to travel as unquoted \N in the CSV. A real string whose whole value is \N would be mistaken for NULL, so any DataFrame containing one just uses the old write path instead.
Happy to contribute to the feature - implementation, unit tests asserting the copy_expert call + CSV payload, and whatever integration coverage you need. Some open questions:
- Default-on like MSSQL, or behind a connection setting?
_get_temp_table/MAX_IDENTIFIER_LENGTH as a precursor PR, or keep the local clamp in the COPY method?
- Risingwave: override back to base in the same PR, or does someone have COPY-on-Risingwave knowledge that says it's fine?
What
PostgresEngineAdapter doesn't override _df_to_source_queries, so DataFrames from Python models get written as literal VALUES batches i.e.
SELECT CAST(...) ... FROM (VALUES (...), (...)) AS t(...)per select_from_values_for_batch_range, (DEFAULT_BATCH_SIZE=10K rows).Postgres has to parse and plan each batch as one giant statement, which on my workload capped out around 2–6K rows/s (row width and network will move that number, but the per-row parse cost is structural). If you're serving Python model output to Postgres, the write phase ends up dominating the whole plan.
Proposed Solution
MSSQLEngineAdapter._df_to_source_queries(mssql.py) (along with BigQuery, Snowflake, Clickhouse) all already have the shape: create a temp table, bulk-load the DataFrame into it, return aSourceQueryselecting_casted_columns(...)from the temp table withcleanup_func=drop_table, and fall back tosuper()._df_to_source_queries(...)when the connection can't bulk-load.Postgres has a native bulk mechanism: psycopg2's
cursor.copy_expert("COPY ... FROM STDIN WITH (FORMAT csv, ...)", buf). So this is a port of an existing pattern. Same temp-table lifecycle, same fallback philosophy (MSSQL gates on driver capability; this gates on dtype/content safety, see below).I've got this running as a subclass override in my project (verified against 0.228.0 and 0.234.1) and checked correctness properly: per-table row counts and content checksums are byte-identical to the literal-SQL path across all 81 tables.
Comparision on 81 Python models (~1M rows, prod state creation , local Postgres):
Some findings and considerations
Happy to contribute to the feature - implementation, unit tests asserting the
copy_expertcall + CSV payload, and whatever integration coverage you need. Some open questions:_get_temp_table/MAX_IDENTIFIER_LENGTHas a precursor PR, or keep the local clamp in the COPY method?