feat: compute deterministic timeseries_id column at ingest by g-talbot · Pull Request #6286 · quickwit-oss/quickwit

g-talbot · 2026-04-09T21:15:48Z

Summary

Adds a timeseries_id column (Int64) to the metrics Arrow batch, computed as a deterministic SipHash-2-4 of the series identity columns
Hash includes metric_name, metric_type, and all tags — excludes temporal columns (timestamp_secs, start_timestamp_secs, timestamp) and value columns (value, plus DDSketch components from [metrics] Support DDSketch in the parquet pipeline #6257: count, sum, min, max, flags, keys, counts)
Column is already declared in the metrics default sort schema (metric_name|service|env|datacenter|region|host|timeseries_id|timestamp_secs/V2), so the writer automatically sorts by it and places it in the correct physical position
Adds TimeseriesId variant to ParquetField enum and updates SORT_ORDER

Design reference

Sorted Series Column for QW Parquet Pipeline — this PR implements the Timeseries ID component; the full Sorted Series composite key is a follow-up.

Test plan

8 unit tests for compute_timeseries_id (determinism, exclusions, order independence, key/value non-interchangeability)
All 200 existing tests in quickwit-parquet-engine and quickwit-opentelemetry pass with updated column counts
Clippy clean (pre-existing warning in reorder_columns not introduced by this PR)

🤖 Generated with Claude Code

Add a timeseries_id column (Int64) to the metrics Arrow batch, computed as a SipHash-2-4 of the series identity columns (metric_name, metric_type, and all tags excluding temporal/value columns). The hash uses fixed keys for cross-process determinism. The column is already declared in the metrics default sort schema (between host and timestamp_secs), so the parquet writer now automatically sorts by it and places it in the correct physical position. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The timeseries_id hash is persisted to Parquet files — any change silently corrupts compaction and queries. Add: - 3 pinned stability tests with hardcoded expected hash values - 3 proptest properties (order independence, excluded tag immunity, extra-tag discrimination) each running 256 random cases - Boundary ambiguity test ({"ab":"c"} vs {"a":"bc"}) - Same-series-different-timestamp invariant test - All-excluded-tags coverage (every EXCLUDED_TAGS entry verified) - Edge cases: empty strings, unicode, 100-tag cardinality - Module-level doc explaining the stability contract Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Mirror the CTE + FOR UPDATE pattern from delete_splits to prevent stale-state races. Without row locking, a concurrent mark_metrics_splits_for_deletion can commit between the state read and the DELETE, causing spurious FailedPrecondition errors and retry churn. The new query locks the target rows before reading their state, reports not-deletable (Staged/Published) and not-found splits separately, and only deletes when all requested splits are in MarkedForDeletion state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

g-talbot force-pushed the gtt/sorted-series-column branch from 1fba6db to 9ac8674 Compare April 10, 2026 10:44

Base automatically changed from gtt/parquet-column-ordering to gtt/docs-claude-md April 10, 2026 10:57

g-talbot changed the base branch from gtt/docs-claude-md to gtt/parquet-column-ordering April 10, 2026 10:59

g-talbot changed the base branch from gtt/parquet-column-ordering to main April 10, 2026 11:00

g-talbot force-pushed the gtt/sorted-series-column branch from 9ac8674 to 60d859c Compare April 10, 2026 11:11

g-talbot changed the base branch from main to gtt/parquet-column-ordering-v2 April 10, 2026 11:11

g-talbot force-pushed the gtt/sorted-series-column branch from 60d859c to 9522326 Compare April 10, 2026 14:17

g-talbot force-pushed the gtt/parquet-column-ordering-v2 branch from 9e5c6ef to cc4492e Compare April 10, 2026 14:18

g-talbot requested a review from mattmkim April 10, 2026 14:20

g-talbot and others added 4 commits April 10, 2026 10:41

style: rustfmt timeseries_id.rs

0705171

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

g-talbot force-pushed the gtt/parquet-column-ordering-v2 branch from cc4492e to 946c229 Compare April 10, 2026 14:42

g-talbot force-pushed the gtt/sorted-series-column branch from 9522326 to b0344ba Compare April 10, 2026 14:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: compute deterministic timeseries_id column at ingest#6286

feat: compute deterministic timeseries_id column at ingest#6286
g-talbot wants to merge 4 commits intogtt/parquet-column-ordering-v2from
gtt/sorted-series-column

g-talbot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

g-talbot commented Apr 9, 2026

Summary

Design reference

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant