Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,16 @@ jobs:
with:
python-version: ${{ matrix.python-version }}

# Headless LibreOffice powers the full-fidelity legacy .xls → .xlsx path
# (formula text, charts). On Linux it's a cheap apt install, so the
# full-fidelity tests run here instead of being skipped. macOS runners
# skip it (the cask install is heavyweight); those tests self-skip.
- name: Install LibreOffice (Linux)
if: runner.os == 'Linux'
run: |
sudo apt-get update
sudo apt-get install -y --no-install-recommends libreoffice-calc-nogui

- name: Install
run: uv pip install --system -e ".[dev,api]"

Expand Down
6 changes: 4 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -57,8 +57,10 @@ examples/stress_test/stress_results.json
examples/stress_test/built_reference.json
examples/stress_test/STRESS_TEST_RESULTS.md

# Local benchmark harness (private, not pushed)
tests/benchmarks/reports/
# Local benchmark harness (private, not pushed) — run outputs stay private,
# except the curated comparison the README links to.
tests/benchmarks/reports/*
!tests/benchmarks/reports/COMPARISON.md
tests/benchmarks/hucre_node/node_modules/
tests/benchmarks/hucre_node/.pnpm-store/

Expand Down
38 changes: 19 additions & 19 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Changelog

All notable changes to **ks-xlsx-parser** are documented here.
All notable changes to **excel-parser** are documented here.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
Expand Down Expand Up @@ -47,18 +47,18 @@ Template for a new release (copy this block, fill in, move Unreleased items in):

## [0.2.1] — 2026-05-19

### ⚠️ BREAKING (Fixed — see also #ks-xlsx-parser channel report)
### ⚠️ BREAKING (Fixed — see also #excel-parser channel report)
- Repository layout flattened on `src/` was leaking 13 generic top-level
packages (`models`, `utils`, `parsers`, …) into installed wheels and
silently dropping `pipeline.py` and `api.py` (setuptools `packages.find`
only finds *packages*, not top-level modules). Users hitting
`from ks_xlsx_parser.pipeline import ...` on 0.2.0 from PyPI got
`from excel_parser.pipeline import ...` on 0.2.0 from PyPI got
`ModuleNotFoundError`. **All modules now live under
`src/ks_xlsx_parser/`**; the wheel's `top_level.txt` contains only
`ks_xlsx_parser`. Imports inside the package switched from
`from pipeline import` to `from ks_xlsx_parser.pipeline import`.
`src/excel_parser/`**; the wheel's `top_level.txt` contains only
`excel_parser`. Imports inside the package switched from
`from pipeline import` to `from excel_parser.pipeline import`.
Downstream code that imported the leaked generics
(`from models import …`) MUST migrate to `from ks_xlsx_parser.models …`.
(`from models import …`) MUST migrate to `from excel_parser.models …`.

### Added
- `scripts/verify_wheel.py` — builds the wheel, installs it in a fresh
Expand Down Expand Up @@ -89,9 +89,9 @@ Template for a new release (copy this block, fill in, move Unreleased items in):
### Changed
- Dropped `PYTHONPATH=src` from Makefile benchmark targets — the
package is now properly installable so callers don't need it.
- `pyproject.toml`: `packages.find` constrained to `ks_xlsx_parser*`,
`py.typed` declared as package data, `xlsx-parser-api` console script
updated to `ks_xlsx_parser.api:main`.
- `pyproject.toml`: `packages.find` constrained to `excel_parser*`,
`py.typed` declared as package data, `excel-parser-api` console script
updated to `excel_parser.api:main`.

### ⚠️ BREAKING
- Retired the in-tree `testBench/` corpus. The 1054-workbook stress dataset
Expand Down Expand Up @@ -121,7 +121,7 @@ Template for a new release (copy this block, fill in, move Unreleased items in):
**Benchmark + retrievability release.** Adds a head-to-head benchmark against
[Docling](https://github.com/DS4SD/docling) on the [SpreadsheetBench](https://github.com/RUCKBReasoning/SpreadsheetBench)
corpus (912 instances, 5,458 xlsx files) and fixes three rendering bugs that
were silently torpedoing RAG retrieval. ks-xlsx-parser parses **99.945%** of
were silently torpedoing RAG retrieval. excel-parser parses **99.945%** of
SpreadsheetBench and **ties Docling at recall@1 / wins at recall@3 (+2.7 pp)
and recall@5 (+1.8 pp)**, plus 36.9% citation-grade geometric recall (Docling
0%, structurally — no A1 anchors).
Expand Down Expand Up @@ -190,14 +190,14 @@ and recall@5 (+1.8 pp)**, plus 36.9% citation-grade geometric recall (Docling
text-match and geometric recall metrics.

### Performance
- ks-xlsx-parser is now ~5% faster on average parse time on SpreadsheetBench
- excel-parser is now ~5% faster on average parse time on SpreadsheetBench
than Docling (251 ms vs 265 ms mean), while producing a richer output
(formulas, dependency graph, charts, named ranges, etc.).

### Docs
- `tests/benchmarks/README.md` — new — methodology + adapter design.
- `tests/benchmarks/reports/COMPARISON.md` — new — head-to-head report.
- README — new "Benchmark — ks-xlsx-parser vs Docling on SpreadsheetBench"
- README — new "Benchmark — excel-parser vs Docling on SpreadsheetBench"
section near the top with the headline table.

### Internal
Expand All @@ -215,8 +215,8 @@ and recall@5 (+1.8 pp)**, plus 36.9% citation-grade geometric recall (Docling
announcement: [`docs/launch/RELEASE_NOTES_v0.1.1.md`](docs/launch/RELEASE_NOTES_v0.1.1.md).

### Added
- Public Python package **`ks-xlsx-parser`** on PyPI; import as
`xlsx_parser` or the alias `ks_xlsx_parser`.
- Public Python package **`excel-parser`** on PyPI; import as
`excel_parser` or the alias `excel_parser`.
- `parse_workbook()` returning a `ParseResult` with `.workbook`,
`.chunks`, and `.serializer` — full workbook graph (cells, formulas,
merges, tables, charts, CF, DV, named ranges, dependency edges).
Expand All @@ -233,7 +233,7 @@ announcement: [`docs/launch/RELEASE_NOTES_v0.1.1.md`](docs/launch/RELEASE_NOTES_
combo: 400, adversarial: 300).
- `tests/test_testbench_roundtrip.py` — parallel round-trip gate;
1054/1054 passing in ~70 s.
- FastAPI web server (`xlsx-parser-api`) in the `[api]` extra.
- FastAPI web server (`excel-parser-api`) in the `[api]` extra.
- GitHub Actions: `ci.yml` (test matrix on py3.10/3.11/3.12 × ubuntu/macos
+ dedicated testBench job) and `release.yml` (wheel + sdist + testBench
zip, PyPI Trusted Publishing).
Expand Down Expand Up @@ -278,7 +278,7 @@ announcement: [`docs/launch/RELEASE_NOTES_v0.1.1.md`](docs/launch/RELEASE_NOTES_
- Removed internal-only tooling: Ralph loop scripts, Cursor / Serena
agent configs, iteration logs, Knowledge-Stack-internal framing in
DESIGN.md.
- Rebranded from `arnav2/XLSXParser` to `knowledgestack/ks-xlsx-parser`;
- Rebranded from `arnav2/XLSXParser` to `knowledgestack/excel-parser`;
transferred the repo into the `knowledgestack` org and made it public.
- `uv.lock` regenerated after dropping the `[ralph]` extra and adding
`pytest-timeout` / `ruff` / `mypy` to `[dev]`.
Expand All @@ -289,5 +289,5 @@ Private-beta release used inside the Knowledge Stack ecosystem. Not
published to PyPI. Superseded by 0.1.1.

<!-- Compare links -->
[Unreleased]: https://github.com/knowledgestack/ks-xlsx-parser/compare/v0.1.1...HEAD
[0.1.1]: https://github.com/knowledgestack/ks-xlsx-parser/releases/tag/v0.1.1
[Unreleased]: https://github.com/knowledgestack/excel-parser/compare/v0.1.1...HEAD
[0.1.1]: https://github.com/knowledgestack/excel-parser/releases/tag/v0.1.1
24 changes: 12 additions & 12 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# Contributing to ks-xlsx-parser
# Contributing to excel-parser

**First: welcome.** 👋 If you got here and aren't sure what to do:

- Jump into our [**Discord**](https://discord.gg/4uaGhJcx) — real-time help, roadmap chat, and the fastest way to pair on an idea with a maintainer.
- Or open a [Discussion](https://github.com/knowledgestack/ks-xlsx-parser/discussions) if async is your thing.
- Or open a [Discussion](https://github.com/knowledgestack/excel-parser/discussions) if async is your thing.

We'd rather talk than have you leave. Every good-first-issue, every weird
`.xlsx` fixture, every three-line doc patch is welcome.
Expand All @@ -15,21 +15,21 @@ bug or send a small PR. If that's you, thank you.

1. **Run `make bench-robust` on SpreadsheetBench and report a file that
breaks.** We actively want edge-case `.xlsx` fixtures — use the
[Parser edge case issue template](https://github.com/knowledgestack/ks-xlsx-parser/issues/new?template=parser_edge_case.yml).
[Parser edge case issue template](https://github.com/knowledgestack/excel-parser/issues/new?template=parser_edge_case.yml).
2. **Submit an adversarial workbook.** Attach a `.xlsx` (or a generator
that builds one) to a Parser edge case issue. If the parser crashes
on it, even better.
3. **Fix one of the flagged issues** in [`docs/PARSER_KNOWN_ISSUES.md`](docs/PARSER_KNOWN_ISSUES.md).
4. **Improve docs.** The README, the architecture diagram, the examples —
if something confused you, it confuses everyone.
5. **Open a [Show & Tell](https://github.com/knowledgestack/ks-xlsx-parser/discussions/new?category=show-and-tell)**
5. **Open a [Show & Tell](https://github.com/knowledgestack/excel-parser/discussions/new?category=show-and-tell)**
if you shipped something with the parser. Seriously, it helps us prioritise.

## Development setup

```bash
git clone https://github.com/knowledgestack/ks-xlsx-parser.git
cd ks-xlsx-parser
git clone https://github.com/knowledgestack/excel-parser.git
cd excel-parser
make install # pip install -e ".[dev,api]"
make test # fast, default suite
make corpus-download # fetch SpreadsheetBench (5,458 real-world xlsx)
Expand Down Expand Up @@ -58,14 +58,14 @@ fix with a one-paragraph explanation is almost always mergeable.

## Reporting issues

Use the [issue templates](https://github.com/knowledgestack/ks-xlsx-parser/issues/new/choose).
Use the [issue templates](https://github.com/knowledgestack/excel-parser/issues/new/choose).
For security issues, please use the
[private advisory flow](https://github.com/knowledgestack/ks-xlsx-parser/security/advisories/new)
[private advisory flow](https://github.com/knowledgestack/excel-parser/security/advisories/new)
— not a public issue.

Helpful things to include:

- Output of `python -c "import xlsx_parser; print(xlsx_parser.__version__)"`
- Output of `python -c "import excel_parser; print(excel_parser.__version__)"`
- Python version (`python --version`)
- OS
- Minimal `.xlsx` that reproduces the bug (or a generator that builds one)
Expand All @@ -83,9 +83,9 @@ Helpful things to include:
## Community

- **Discord**: <https://discord.gg/4uaGhJcx> — come hang out, the maintainers and regulars are active here.
- Discussions: <https://github.com/knowledgestack/ks-xlsx-parser/discussions>
- Issues: <https://github.com/knowledgestack/ks-xlsx-parser/issues>
- Security: <https://github.com/knowledgestack/ks-xlsx-parser/security/advisories>
- Discussions: <https://github.com/knowledgestack/excel-parser/discussions>
- Issues: <https://github.com/knowledgestack/excel-parser/issues>
- Security: <https://github.com/knowledgestack/excel-parser/security/advisories>
- Knowledge Stack org: <https://github.com/knowledgestack>

By participating you agree to follow our [Code of Conduct](CODE_OF_CONDUCT.md).
Expand Down
14 changes: 9 additions & 5 deletions Dockerfile.bench
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
# Benchmark image for ks-xlsx-parser.
# Benchmark image for excel-parser.
#
# Builds once, then on each run downloads SpreadsheetBench (if not cached),
# parses the corpus, embeds chunks with a small sentence-transformer, and
# emits a recall@k report + failure-bucket triage. The output lands in
# tests/benchmarks/reports/ — mount that path as a volume to persist results.
#
# Usage:
# docker build -f Dockerfile.bench -t ks-xlsx-parser-bench .
# docker build -f Dockerfile.bench -t excel-parser-bench .
# docker run --rm \
# -v "$PWD/tests/benchmarks/reports:/app/tests/benchmarks/reports" \
# -v "$PWD/data:/app/data" \
# ks-xlsx-parser-bench
# excel-parser-bench
#
# # Quick sanity run on 20 instances:
# docker run --rm -e BENCH_SAMPLE=20 ks-xlsx-parser-bench
# docker run --rm -e BENCH_SAMPLE=20 excel-parser-bench

FROM python:3.12-slim

Expand All @@ -24,8 +24,12 @@ ENV PYTHONDONTWRITEBYTECODE=1 \

WORKDIR /app

# libreoffice-calc-nogui gives a headless `soffice` for full-fidelity legacy
# .xls → .xlsx conversion (preserves formula text, charts, shapes). Without it
# the parser falls back to the pure-Python xlrd path (values only). --no-install-
# recommends keeps the image lean (skips the X11/Java recommends).
RUN apt-get update && apt-get install -y --no-install-recommends \
curl unzip ca-certificates git \
curl unzip ca-certificates git libreoffice-calc-nogui \
&& rm -rf /var/lib/apt/lists/*

# Install deps first to keep layers cacheable across code edits.
Expand Down
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2025 XLSX Parser Contributors
Copyright (c) 2025 Excel Parser Contributors

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
8 changes: 4 additions & 4 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ PYTHON ?= python
PKG_VERSION := $(shell $(PYTHON) -c "import tomllib, pathlib; print(tomllib.loads(pathlib.Path('pyproject.toml').read_text())['project']['version'])")

help:
@echo "ks-xlsx-parser — common targets"
@echo "excel-parser — common targets"
@echo ""
@echo " make install Install package and dev deps (editable)"
@echo " make install-dev Alias for install (matches ks-backend)"
Expand Down Expand Up @@ -44,7 +44,7 @@ format:
$(PYTHON) -m ruff format src/ tests/ scripts/

typecheck:
$(PYTHON) -m mypy src/ks_xlsx_parser
$(PYTHON) -m mypy src/excel_parser

# Build the wheel and prove it imports outside the editable source tree.
# This is the regression guard for the v0.2.0 packaging bug (pipeline.py
Expand Down Expand Up @@ -84,5 +84,5 @@ bench-track:
$(PYTHON) scripts/triage_recall.py tests/benchmarks/reports/retrieval

docker-bench:
docker build -f Dockerfile.bench -t ks-xlsx-parser-bench .
docker run --rm -v "$(PWD)/tests/benchmarks/reports:/app/tests/benchmarks/reports" ks-xlsx-parser-bench
docker build -f Dockerfile.bench -t excel-parser-bench .
docker run --rm -v "$(PWD)/tests/benchmarks/reports:/app/tests/benchmarks/reports" excel-parser-bench
Loading
Loading