Skip to content

feat: compute deterministic timeseries_id column at ingest#6286

Open
g-talbot wants to merge 4 commits intogtt/parquet-column-ordering-v2from
gtt/sorted-series-column
Open

feat: compute deterministic timeseries_id column at ingest#6286
g-talbot wants to merge 4 commits intogtt/parquet-column-ordering-v2from
gtt/sorted-series-column

Conversation

@g-talbot
Copy link
Copy Markdown
Contributor

@g-talbot g-talbot commented Apr 9, 2026

Summary

  • Adds a timeseries_id column (Int64) to the metrics Arrow batch, computed as a deterministic SipHash-2-4 of the series identity columns
  • Hash includes metric_name, metric_type, and all tags — excludes temporal columns (timestamp_secs, start_timestamp_secs, timestamp) and value columns (value, plus DDSketch components from [metrics] Support DDSketch in the parquet pipeline #6257: count, sum, min, max, flags, keys, counts)
  • Column is already declared in the metrics default sort schema (metric_name|service|env|datacenter|region|host|timeseries_id|timestamp_secs/V2), so the writer automatically sorts by it and places it in the correct physical position
  • Adds TimeseriesId variant to ParquetField enum and updates SORT_ORDER

Design reference

Sorted Series Column for QW Parquet Pipeline — this PR implements the Timeseries ID component; the full Sorted Series composite key is a follow-up.

Test plan

  • 8 unit tests for compute_timeseries_id (determinism, exclusions, order independence, key/value non-interchangeability)
  • All 200 existing tests in quickwit-parquet-engine and quickwit-opentelemetry pass with updated column counts
  • Clippy clean (pre-existing warning in reorder_columns not introduced by this PR)

🤖 Generated with Claude Code

@g-talbot g-talbot force-pushed the gtt/sorted-series-column branch from 1fba6db to 9ac8674 Compare April 10, 2026 10:44
Base automatically changed from gtt/parquet-column-ordering to gtt/docs-claude-md April 10, 2026 10:57
@g-talbot g-talbot changed the base branch from gtt/docs-claude-md to gtt/parquet-column-ordering April 10, 2026 10:59
@g-talbot g-talbot changed the base branch from gtt/parquet-column-ordering to main April 10, 2026 11:00
@g-talbot g-talbot force-pushed the gtt/sorted-series-column branch from 9ac8674 to 60d859c Compare April 10, 2026 11:11
@g-talbot g-talbot changed the base branch from main to gtt/parquet-column-ordering-v2 April 10, 2026 11:11
@g-talbot g-talbot force-pushed the gtt/sorted-series-column branch from 60d859c to 9522326 Compare April 10, 2026 14:17
@g-talbot g-talbot force-pushed the gtt/parquet-column-ordering-v2 branch from 9e5c6ef to cc4492e Compare April 10, 2026 14:18
@g-talbot g-talbot requested a review from mattmkim April 10, 2026 14:20
g-talbot and others added 4 commits April 10, 2026 10:41
Add a timeseries_id column (Int64) to the metrics Arrow batch,
computed as a SipHash-2-4 of the series identity columns (metric_name,
metric_type, and all tags excluding temporal/value columns). The hash
uses fixed keys for cross-process determinism.

The column is already declared in the metrics default sort schema
(between host and timestamp_secs), so the parquet writer now
automatically sorts by it and places it in the correct physical
position.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The timeseries_id hash is persisted to Parquet files — any change
silently corrupts compaction and queries. Add:

- 3 pinned stability tests with hardcoded expected hash values
- 3 proptest properties (order independence, excluded tag immunity,
  extra-tag discrimination) each running 256 random cases
- Boundary ambiguity test ({"ab":"c"} vs {"a":"bc"})
- Same-series-different-timestamp invariant test
- All-excluded-tags coverage (every EXCLUDED_TAGS entry verified)
- Edge cases: empty strings, unicode, 100-tag cardinality
- Module-level doc explaining the stability contract

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mirror the CTE + FOR UPDATE pattern from delete_splits to prevent
stale-state races. Without row locking, a concurrent
mark_metrics_splits_for_deletion can commit between the state read
and the DELETE, causing spurious FailedPrecondition errors and retry
churn.

The new query locks the target rows before reading their state,
reports not-deletable (Staged/Published) and not-found splits
separately, and only deletes when all requested splits are in
MarkedForDeletion state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@g-talbot g-talbot force-pushed the gtt/parquet-column-ordering-v2 branch from cc4492e to 946c229 Compare April 10, 2026 14:42
@g-talbot g-talbot force-pushed the gtt/sorted-series-column branch from 9522326 to b0344ba Compare April 10, 2026 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant