Multi-row table headers + .xls support + DECO benchmark (and excel_parser rename)#14
Open
arnav2 wants to merge 1 commit into
Open
Multi-row table headers + .xls support + DECO benchmark (and excel_parser rename)#14arnav2 wants to merge 1 commit into
arnav2 wants to merge 1 commit into
Conversation
…o excel_parser Renames the package ks_xlsx_parser → excel_parser (and rust ks_xlsx_core → excel_core) across source, docs, scripts, and site, plus parser improvements: - Header detection: extend find_header_span to multi-row header bands, gated on styling continuity so single-row headers stay one row. Measured on DECO (852 files, 1,480 GT tables): multi-row header F1 0.37→0.50 (exact 0%→24%, recall 0.23→0.33), single-row exact 84%→79%, table IoU unchanged. New unit tests in tests/test_header_detector.py. - .xls support: convert_xls_to_xlsx backend so legacy workbooks parse. - DECO structural benchmark (scripts/eval_deco.py): scores table-boundary IoU + header-row precision/recall/F1 vs Docling — the structural ground truth SpreadsheetBench lacks. Wired into download_corpora.sh + benchmarks README. Full test suite: 1137 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Scope (large — see breakdown)
This branch bundles a repo-wide rename with three parser improvements. The rename dominates the file count; the substantive logic changes are small and isolated.
1. Package rename:
ks_xlsx_parser→excel_parser(andks_xlsx_core→excel_core)Mechanical rename across source, docs, scripts, site, and the Rust crate. Most of the +/- churn.
2. Multi-row table headers (the headline fix)
find_header_spanpreviously returned a single row by construction (HeaderSpan(top=r, bottom=r)), so it could never cover a 2–4 row header — 36% of real headers. It now extends the band downward over contiguous multi-column, styled label rows, stopping at the first data row / blank / single-cell divider, bounded byMAX_HEADER_ROWS.The styling gate is the key design choice: genuine stacked headers are uniformly bold/filled; the first data row under a one-row header is not — so single-row headers stay one row.
Measured on DECO (852 annotated
.xlsx, 1,480 ground-truth tables;scripts/eval_deco.py):A more aggressive un-gated variant reached multi-row F1 0.71 but collapsed single-row precision 0.92→0.43 (it glues data rows into headers) — rejected. New regression tests in
tests/test_header_detector.pylock both behaviours.3. Legacy
.xlssupportconvert_xls_to_xlsxbackend (xls_converter.py) so.xlsworkbooks parse viaworkbook_parser.4. DECO structural benchmark (
scripts/eval_deco.py)Scores table-boundary IoU + header-row P/R/F1 for ks vs Docling — the structural ground truth SpreadsheetBench lacks (it only has answer positions). Wired into
download_corpora.sh+ benchmarks README. Notable cross-parser finding: Docling has no A1 coordinates for xlsx and over-segments 80% of sheets into tiny spurious tables (median 5/sheet, up to 624), where ks produces real boundaries at mean IoU 0.51.Testing
make/download_corpora.sh(corpus is download-on-demand, gitignored).🤖 Generated with Claude Code