Multi-row table headers + .xls support + DECO benchmark (and excel_parser rename) by arnav2 · Pull Request #14 · knowledgestack/excel-parser

arnav2 · 2026-06-10T08:05:51Z

Scope (large — see breakdown)

This branch bundles a repo-wide rename with three parser improvements. The rename dominates the file count; the substantive logic changes are small and isolated.

1. Package rename: `ks_xlsx_parser` → `excel_parser` (and `ks_xlsx_core` → `excel_core`)

Mechanical rename across source, docs, scripts, site, and the Rust crate. Most of the +/- churn.

2. Multi-row table headers (the headline fix)

find_header_span previously returned a single row by construction (HeaderSpan(top=r, bottom=r)), so it could never cover a 2–4 row header — 36% of real headers. It now extends the band downward over contiguous multi-column, styled label rows, stopping at the first data row / blank / single-cell divider, bounded by MAX_HEADER_ROWS.

The styling gate is the key design choice: genuine stacked headers are uniformly bold/filled; the first data row under a one-row header is not — so single-row headers stay one row.

Measured on DECO (852 annotated .xlsx, 1,480 ground-truth tables; scripts/eval_deco.py):

cohort	metric	before	after
multi-row headers (36%)	F1 / exact	0.37 / 0%	0.50 / 24%
multi-row headers	recall	0.23	0.33
single-row headers (64%)	exact	84%	79%
all headers	F1 / exact	0.58 / 54%	0.63 / 59%
table-boundary IoU	mean	0.507	0.507 (unchanged)

A more aggressive un-gated variant reached multi-row F1 0.71 but collapsed single-row precision 0.92→0.43 (it glues data rows into headers) — rejected. New regression tests in tests/test_header_detector.py lock both behaviours.

3. Legacy `.xls` support

convert_xls_to_xlsx backend (xls_converter.py) so .xls workbooks parse via workbook_parser.

4. DECO structural benchmark (`scripts/eval_deco.py`)

Scores table-boundary IoU + header-row P/R/F1 for ks vs Docling — the structural ground truth SpreadsheetBench lacks (it only has answer positions). Wired into download_corpora.sh + benchmarks README. Notable cross-parser finding: Docling has no A1 coordinates for xlsx and over-segments 80% of sheets into tiny spurious tables (median 5/sheet, up to 624), where ks produces real boundaries at mean IoU 0.51.

Testing

Full suite: 1137 passed.
Ruff clean on changed files.
DECO benchmark reproducible via make/download_corpora.sh (corpus is download-on-demand, gitignored).

🤖 Generated with Claude Code

…o excel_parser Renames the package ks_xlsx_parser → excel_parser (and rust ks_xlsx_core → excel_core) across source, docs, scripts, and site, plus parser improvements: - Header detection: extend find_header_span to multi-row header bands, gated on styling continuity so single-row headers stay one row. Measured on DECO (852 files, 1,480 GT tables): multi-row header F1 0.37→0.50 (exact 0%→24%, recall 0.23→0.33), single-row exact 84%→79%, table IoU unchanged. New unit tests in tests/test_header_detector.py. - .xls support: convert_xls_to_xlsx backend so legacy workbooks parse. - DECO structural benchmark (scripts/eval_deco.py): scores table-boundary IoU + header-row precision/recall/F1 vs Docling — the structural ground truth SpreadsheetBench lacks. Wired into download_corpora.sh + benchmarks README. Full test suite: 1137 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-row table headers + .xls support + DECO benchmark (and excel_parser rename)#14

Multi-row table headers + .xls support + DECO benchmark (and excel_parser rename)#14
arnav2 wants to merge 1 commit into
mainfrom
arnav2/table-header-split-fix

arnav2 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arnav2 commented Jun 10, 2026

Scope (large — see breakdown)

1. Package rename: ks_xlsx_parser → excel_parser (and ks_xlsx_core → excel_core)

2. Multi-row table headers (the headline fix)

3. Legacy .xls support

4. DECO structural benchmark (scripts/eval_deco.py)

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Package rename: `ks_xlsx_parser` → `excel_parser` (and `ks_xlsx_core` → `excel_core`)

3. Legacy `.xls` support

4. DECO structural benchmark (`scripts/eval_deco.py`)