statpopgen: store GT genotypes as u8 (schema + builder + regression test)#8230
statpopgen: store GT genotypes as u8 (schema + builder + regression test)#8230joseph-isaacs wants to merge 4 commits into
Conversation
Signed-off-by: Mikhail Kot <mikhail@spiraldb.com>
The "Use Uint8 for GT field" change set the GT field schema to list(UInt8), but the Arrow builder still produced list(UInt64), causing data generation to fail at write time: column types must match schema types, expected List(UInt8) but found List(UInt64) Switch GT_builder to ListBuilder<UInt8Builder> and have parse_genotype return Option<u8> (genotype dosage is only ever NULL/0/1/2), so the produced arrays match the declared schema. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> Signed-off-by: Claude <noreply@anthropic.com>
Add a zero-row GnomADBuilder test that exercises RecordBatch::try_new and fails unless GT is emitted as list(u8), locking in the schema/ builder type match. Also convert genotype dosage with a checked u8::try_from to satisfy clippy's cast_possible_truncation lint. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> Signed-off-by: Claude <noreply@anthropic.com>
Polar Signals Profiling ResultsLatest Run
Powered by Polar Signals Cloud |
🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨Benchmark |
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.802x ✅, 6↑ 1↓)
datafusion / vortex-compact (0.937x ➖, 1↑ 0↓)
datafusion / parquet (0.942x ➖, 1↑ 0↓)
duckdb / vortex-file-compressed (0.780x ✅, 7↑ 0↓)
duckdb / vortex-compact (0.907x ➖, 5↑ 0↓)
duckdb / parquet (0.935x ➖, 2↑ 0↓)
unknown / unknown (no group data, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: PolarSignals ProfilingVortex (geomean): 1.014x ➖ datafusion / vortex-file-compressed (1.014x ➖, 0↑ 1↓)
unknown / unknown (no group data, 0↑ 0↓)
|
🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨Benchmark |
🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨Benchmark |
Benchmarks: FineWeb S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.346x ❌, 0↑ 5↓)
datafusion / vortex-compact (1.130x ➖, 0↑ 2↓)
datafusion / parquet (1.258x ➖, 0↑ 4↓)
duckdb / vortex-file-compressed (1.263x ➖, 0↑ 3↓)
duckdb / vortex-compact (1.009x ➖, 0↑ 0↓)
duckdb / parquet (1.126x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (environment too noisy confidence) duckdb / vortex-file-compressed (0.677x ✅, 8↑ 0↓)
duckdb / vortex-compact (0.659x ✅, 7↑ 0↓)
duckdb / parquet (0.790x ✅, 8↑ 0↓)
unknown / unknown (no group data, 0↑ 0↓)
Full attributed analysis
|
🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨Benchmark |
🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨Benchmark |
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.203x ➖, 0↑ 6↓)
datafusion / vortex-compact (1.086x ➖, 0↑ 4↓)
datafusion / parquet (1.039x ➖, 2↑ 6↓)
duckdb / vortex-file-compressed (0.997x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.029x ➖, 0↑ 0↓)
duckdb / parquet (0.990x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: CompressionVortex (geomean): 1.007x ➖ unknown / unknown (1.023x ➖, 3↑ 11↓)
|
Benchmarks: Random AccessVortex (geomean): 1.164x ❌ unknown / unknown (0.973x ➖, 11↑ 4↓)
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.013x ➖, 1↑ 1↓)
datafusion / vortex-compact (0.975x ➖, 1↑ 0↓)
datafusion / parquet (1.111x ➖, 0↑ 4↓)
duckdb / vortex-file-compressed (1.025x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.045x ➖, 0↑ 1↓)
duckdb / parquet (1.079x ➖, 0↑ 0↓)
Full attributed analysis
|
|
ping me again once the benchmarks pass |
|
They did all pass |
Merging this PR will improve performance by 10.81%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | chunked_varbinview_canonical_into[(100, 100)] |
273.3 µs | 308.3 µs | -11.34% |
| ⚡ | Simulation | chunked_varbinview_canonical_into[(1000, 10)] |
197.7 µs | 161.4 µs | +22.5% |
| ⚡ | Simulation | encode_varbin[(1000, 4)] |
162.3 µs | 141.6 µs | +14.62% |
| ⚡ | Simulation | encode_varbin[(1000, 2)] |
161.3 µs | 141 µs | +14.39% |
| ⚡ | Simulation | encode_varbin[(1000, 8)] |
163 µs | 142.8 µs | +14.16% |
| ⚡ | Simulation | encode_varbin[(1000, 32)] |
168.3 µs | 147.8 µs | +13.87% |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/statpopgen-layout-compression-SAaRE (aef5ba3) with develop (66335d4)
What
Stores the
statpopgenGT(genotype) field aslist(u8)instead oflist(u64). GT dosage is only everNULL,0,1, or2, sou8is the natural width and 8× narrower thanu64.