Skip to content

feat: enforce physical column ordering in Parquet files#6287

Merged
g-talbot merged 80 commits intomainfrom
gtt/parquet-column-ordering-v2
Apr 13, 2026
Merged

feat: enforce physical column ordering in Parquet files#6287
g-talbot merged 80 commits intomainfrom
gtt/parquet-column-ordering-v2

Conversation

@g-talbot
Copy link
Copy Markdown
Contributor

Summary

  • Sort schema columns are placed first (in their configured sort order), followed by remaining data columns alphabetically
  • This layout enables a two-GET streaming merge during compaction: the first GET reads the footer, the second streams from the start of the row group — sort columns arrive first, allowing the compactor to compute the global merge order before data columns arrive
  • Clean cherry-pick of the column-ordering work from feat: enforce physical column ordering in Parquet files for two-GET streaming merge #6281 (which was accidentally merged into the docs branch instead of main)

Test plan

  • reorder_columns unit test verifies sort columns first, then alphabetical
  • Round-trip test verifies column order preserved through Parquet write/read
  • All 154 quickwit-parquet-engine tests pass
  • Clippy clean

🤖 Generated with Claude Code

mattmkim and others added 30 commits March 30, 2026 14:21
…, window, TableConfig

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… model, field lookup

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire TableConfig-driven sort order into ParquetWriter and add
self-describing Parquet file metadata for compaction:

- ParquetWriter::new() takes &TableConfig, resolves sort fields at
  construction via parse_sort_fields() + ParquetField::from_name()
- sort_batch() uses resolved fields with per-column direction (ASC/DESC)
- SS-1 debug_assert verification: re-sort and check identity permutation
- build_compaction_key_value_metadata(): embeds sort_fields, window_start,
  window_duration, num_merge_ops, row_keys (base64) in Parquet kv_metadata
- SS-5 verify_ss5_kv_consistency(): kv_metadata matches source struct
- write_to_file_with_metadata() replaces write_to_file()
- prepare_write() shared method for bytes and file paths
- ParquetWriterConfig gains to_writer_properties_with_metadata()
- ParquetSplitWriter passes TableConfig through
- All callers in quickwit-indexing updated with TableConfig::default()
- 23 storage tests pass including META-07 self-describing roundtrip

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…publish

Add compaction metadata to the PostgreSQL metastore:

Migration 27:
- 6 new columns: window_start, window_duration_secs, sort_fields,
  num_merge_ops, row_keys, zonemap_regexes
- Partial index idx_metrics_splits_compaction_scope on
  (index_uid, sort_fields, window_start) WHERE split_state = 'Published'

stage_metrics_splits:
- INSERT extended from 15 to 21 bind parameters for compaction columns
- ON CONFLICT SET updates all compaction columns

list_metrics_splits:
- PgMetricsSplit construction includes compaction fields (defaults from JSON)

Also fixes pre-existing compilation errors on upstream-10b-parquet-actors:
- Missing StageMetricsSplitsRequestExt import
- index_id vs index_uid type mismatches in publish/mark/delete
- IndexUid binding (to_string() for sqlx)
- ListMetricsSplitsResponseExt trait disambiguation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ublish validation

Close critical gaps identified during port review:

split_writer.rs:
- Store table_config on ParquetSplitWriter (not just pass-through)
- Compute window_start from batch time range using table_config.window_duration_secs
- Populate sort_fields, window_duration_secs, parquet_files on metadata before write
- Call write_to_file_with_metadata(Some(&metadata)) to embed KV metadata in Parquet
- Update size_bytes after write completes

metastore/mod.rs:
- Add window_start and sort_fields fields to ListMetricsSplitsQuery
- Add with_compaction_scope() builder method

metastore/postgres/metastore.rs:
- Add compaction scope filters (AND window_start = $N, AND sort_fields = $N) to list query
- Add replaced_split_ids count verification in publish_metrics_splits
- Bind compaction scope query parameters

ingest/config.rs:
- Add table_config: TableConfig field to ParquetIngestConfig

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ad code removal

- file_backed_index/mod.rs: Add window_start and sort_fields filtering
  to metrics_split_matches_query() for compaction scope queries
- writer.rs: Add test_meta07_self_describing_parquet_roundtrip test
  (writes compaction metadata to Parquet, reads back from cold file,
  verifies all fields roundtrip correctly)
- fields.rs: Remove dead sort_order() method (replaced by TableConfig)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…regexes

Gap 1: Change window_duration_secs from i32 to Option<i32> in both
PgMetricsSplit and InsertableMetricsSplit. Pre-Phase-31 splits now
correctly map 0 → NULL in PostgreSQL, enabling Phase 32 compaction
queries to use `WHERE window_duration_secs IS NOT NULL` instead of
the fragile `WHERE window_duration_secs > 0`.

Gap 2: Change zonemap_regexes from String to serde_json::Value in
both structs. This maps directly to JSONB in sqlx, avoiding ambiguity
when PostgreSQL JSONB operators are used in Phase 34/35 zonemap pruning.

Gap 3: Add two missing tests:
- test_insertable_from_metadata_with_compaction_fields: verifies all 6
  compaction fields round-trip through InsertableMetricsSplit
- test_insertable_from_metadata_pre_phase31_defaults: verifies pre-Phase-31
  metadata produces window_duration_secs: None, zonemap_regexes: json!({})

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…! macro

11 tests covering the full metrics split lifecycle:
- stage (happy path + non-existent index error)
- stage upsert (ON CONFLICT update)
- list by state, time range, metric name, compaction scope
- publish (happy path + non-existent split error)
- mark for deletion
- delete (happy path + idempotent non-existent)

Tests are generic and run against both file-backed and PostgreSQL backends.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- publish_metrics_splits: return NotFound (not FailedPrecondition) when
  staged splits don't exist
- delete_metrics_splits: succeed silently (idempotent) for non-existent
  splits instead of returning FailedPrecondition
- Tests now assert the correct error types on both backends

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Migration 27: add maturity_timestamp, delete_opstamp, node_id columns
  and publish_timestamp trigger to match the splits table (Paul's review)
- ListMetricsSplitsQuery: adopt FilterRange<i64> for time_range (matching
  log-side pattern), single time_range field for both read and compaction
  paths, add node_id/delete_opstamp/update_timestamp/create_timestamp/
  mature filters to close gaps with ListSplitsQuery
- Use SplitState enum instead of stringly-typed Vec<String> for split_states
- StoredMetricsSplit: add create_timestamp, node_id, delete_opstamp,
  maturity_timestamp so file-backed metastore can filter on them locally
- File-backed filter: use FilterRange::overlaps_with() for time range and
  window intersection, apply all new filters matching log-side predicate
- Postgres: intersection semantics for window queries, FilterRange-based
  SQL generation for all range filters
- Fix InsertableMetricsSplit.window_duration_secs from Option<i32> to i32
- Rename two-letter variables (ws, sf, dt) throughout

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract duplicated invariant logic into a shared `invariants/` module
within `quickwit-dst`. This is the "single source of truth" layer in
the verification pyramid — used by stateright models, production
debug_assert checks, and (future) Datadog metrics emission.

Key changes:
- `invariants/registry.rs`: InvariantId enum (20 variants) with Display
- `invariants/window.rs`: shared window_start_secs(), is_valid_window_duration()
- `invariants/sort.rs`: generic compare_with_null_ordering() for SS-2
- `invariants/check.rs`: check_invariant! macro wrapping debug_assert
- stateright gated behind `model-checking` feature (optional dep)
- quickwit-parquet-engine uses shared functions and check_invariant!

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The check_invariant! macro now always evaluates the condition — not just
in debug builds. This implements Layer 4 (Production) of the verification
stack: invariant checks run in release, with results forwarded to a
pluggable InvariantRecorder for Datadog metrics emission.

- Debug builds: panic on violation (debug_assert, Layer 3)
- All builds: evaluate condition, call recorder (Layer 4)
- set_invariant_recorder() wires up statsd at process startup
- No recorder registered = no-op (single OnceLock load)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Emit cloudprem.pomsky.invariant.checked and .violated counters with
invariant label via the metrics crate / DogStatsD exporter at process
startup, completing Layer 4 of the verification stack.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
g-talbot and others added 14 commits April 8, 2026 10:54
- Resolve postgres.rs conflict: keep check_invariant! macros, add
  window_duration_secs consistency check
- Group setup_dogstatsd_exporter + setup_invariant_recorder into
  single setup_metrics() function (fulmicoton-dd review)
- Rename `id` to `invariant_id` in invariant_recorder (fulmicoton-dd review)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reverts c8bf8d7, cafcac5, a088f53 — these are code changes
(delete_metrics_splits error handling, doc comment tweaks) that
don't belong in a docs-only PR. They will land in a separate PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This ADR contains company-specific information and should live
in the private fork, not in the upstream quickwit-oss repo.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Rewrite CLAUDE.md as generic Quickwit AI development guide
- Replace Quickhouse-Pomsky -> Quickwit branding across all docs
- Replace "Datadog" observability references with generic
  "production observability" language
- Remove "Husky (Datadog)" qualifier from gap docs (keep Husky
  citations — the blog post is public)
- Generalize internal knowledge (query rate numbers, product-specific
  lateness guarantees)
- Remove PomChi reference, private Google Doc link
- Add docs/internals/UPSTREAM-CANDIDATES.md for pomsky tracking

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove all ClickHouse/ClickStack references from gap docs and ADRs
  (keep Prometheus, Mimir, InfluxDB, Husky as prior art)
- Restore gap-005 Option C (compaction-time dedup) without ClickHouse citation
- Mark /sesh-mode reference in CLAUDE.md as aspirational
- Add aspirational items section to UPSTREAM-CANDIDATES.md tracking
  items described in docs but not yet implemented (TLA+ specs, DST,
  Kani, Bloodhound, performance baselines, benchmark binaries)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
UPSTREAM-CANDIDATES.md incorrectly stated TLA+ specs and Stateright
models don't exist. They do (contributed in #6246): ParquetDataModel.tla,
SortSchema.tla, TimeWindowedCompaction.tla, plus quickwit-dst invariants
and Stateright model tests. Updated to accurately reflect that the
remaining aspirational piece is the simulation infrastructure (SimClock,
FaultInjector, etc.).

Also removed the /sesh-mode aspirational entry — it's actively being
used and the underlying specs/models are real.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents GSD planning artifacts from being committed to the repository.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reverts test env vars (CP_ENABLE_REVERSE_CONNECTION) and
load-cloudprem-ui target — these are pomsky-specific and
don't belong in upstream.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@g-talbot g-talbot requested a review from mattmkim April 10, 2026 12:30
@g-talbot g-talbot changed the base branch from main to gtt/docs-claude-md April 10, 2026 14:15
@g-talbot g-talbot force-pushed the gtt/parquet-column-ordering-v2 branch from 9e5c6ef to cc4492e Compare April 10, 2026 14:18
@g-talbot g-talbot force-pushed the gtt/docs-claude-md branch from cc4492e to 4006b20 Compare April 10, 2026 14:18
g-talbot and others added 2 commits April 10, 2026 10:41
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…treaming merge (#6281)

* feat: enforce physical column ordering in Parquet files

Sort schema columns are written first (in their configured sort order),
followed by all remaining data columns in alphabetical order. This
physical layout enables a two-GET streaming merge during compaction:
the footer GET provides the schema and offsets, then a single streaming
GET from the start of the row group delivers sort columns first —
allowing the compactor to compute the global merge order before data
columns arrive.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: verify input column order is actually scrambled

The sanity check only asserted presence, not ordering. Now it
verifies that host appears before service in the input (scrambled)
which is the opposite of the sort-schema order (service before host).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: rustfmt test code

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: collapse nested if to satisfy clippy::collapsible_if

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@g-talbot g-talbot force-pushed the gtt/parquet-column-ordering-v2 branch from cc4492e to 946c229 Compare April 10, 2026 14:42
g-talbot added a commit that referenced this pull request Apr 13, 2026
Mark completed items in GAP-002 (sort schema parser, configurable
directions, timeseries_id, schema-driven sort, metadata storage) and
GAP-004 (MetricsSplitMetadata fields, RowKeys, zonemap regexes,
sorting_columns, KV metadata). Update ADR-002 Implementation Status
to reflect the full PR stack (#6287-#6295).

Remaining open items: per-index metastore storage (Phase 32), null
ordering fix, Parquet column/offset index enabling, PostgreSQL
migration for row_keys + zonemap columns.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Base automatically changed from gtt/docs-claude-md to main April 13, 2026 23:16
@g-talbot g-talbot merged commit b469625 into main Apr 13, 2026
8 checks passed
@g-talbot g-talbot deleted the gtt/parquet-column-ordering-v2 branch April 13, 2026 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants