Skip to content

Multi-row table headers + .xls support + DECO benchmark (and excel_parser rename)#14

Open
arnav2 wants to merge 1 commit into
mainfrom
arnav2/table-header-split-fix
Open

Multi-row table headers + .xls support + DECO benchmark (and excel_parser rename)#14
arnav2 wants to merge 1 commit into
mainfrom
arnav2/table-header-split-fix

Conversation

@arnav2

@arnav2 arnav2 commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Scope (large — see breakdown)

This branch bundles a repo-wide rename with three parser improvements. The rename dominates the file count; the substantive logic changes are small and isolated.

1. Package rename: ks_xlsx_parserexcel_parser (and ks_xlsx_coreexcel_core)

Mechanical rename across source, docs, scripts, site, and the Rust crate. Most of the +/- churn.

2. Multi-row table headers (the headline fix)

find_header_span previously returned a single row by construction (HeaderSpan(top=r, bottom=r)), so it could never cover a 2–4 row header — 36% of real headers. It now extends the band downward over contiguous multi-column, styled label rows, stopping at the first data row / blank / single-cell divider, bounded by MAX_HEADER_ROWS.

The styling gate is the key design choice: genuine stacked headers are uniformly bold/filled; the first data row under a one-row header is not — so single-row headers stay one row.

Measured on DECO (852 annotated .xlsx, 1,480 ground-truth tables; scripts/eval_deco.py):

cohort metric before after
multi-row headers (36%) F1 / exact 0.37 / 0% 0.50 / 24%
multi-row headers recall 0.23 0.33
single-row headers (64%) exact 84% 79%
all headers F1 / exact 0.58 / 54% 0.63 / 59%
table-boundary IoU mean 0.507 0.507 (unchanged)

A more aggressive un-gated variant reached multi-row F1 0.71 but collapsed single-row precision 0.92→0.43 (it glues data rows into headers) — rejected. New regression tests in tests/test_header_detector.py lock both behaviours.

3. Legacy .xls support

convert_xls_to_xlsx backend (xls_converter.py) so .xls workbooks parse via workbook_parser.

4. DECO structural benchmark (scripts/eval_deco.py)

Scores table-boundary IoU + header-row P/R/F1 for ks vs Docling — the structural ground truth SpreadsheetBench lacks (it only has answer positions). Wired into download_corpora.sh + benchmarks README. Notable cross-parser finding: Docling has no A1 coordinates for xlsx and over-segments 80% of sheets into tiny spurious tables (median 5/sheet, up to 624), where ks produces real boundaries at mean IoU 0.51.

Testing

  • Full suite: 1137 passed.
  • Ruff clean on changed files.
  • DECO benchmark reproducible via make/download_corpora.sh (corpus is download-on-demand, gitignored).

🤖 Generated with Claude Code

…o excel_parser

Renames the package ks_xlsx_parser → excel_parser (and rust ks_xlsx_core →
excel_core) across source, docs, scripts, and site, plus parser improvements:

- Header detection: extend find_header_span to multi-row header bands, gated on
  styling continuity so single-row headers stay one row. Measured on DECO (852
  files, 1,480 GT tables): multi-row header F1 0.37→0.50 (exact 0%→24%, recall
  0.23→0.33), single-row exact 84%→79%, table IoU unchanged. New unit tests in
  tests/test_header_detector.py.
- .xls support: convert_xls_to_xlsx backend so legacy workbooks parse.
- DECO structural benchmark (scripts/eval_deco.py): scores table-boundary IoU +
  header-row precision/recall/F1 vs Docling — the structural ground truth
  SpreadsheetBench lacks. Wired into download_corpora.sh + benchmarks README.

Full test suite: 1137 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant