From 9517001cf52e9599973b8c3c7bb23444cf3d6088 Mon Sep 17 00:00:00 2001 From: Na'aman Hirschfeld Date: Wed, 10 Jun 2026 15:23:48 +0200 Subject: [PATCH] feat: add Kreuzberg plugins --- .factory-plugin/marketplace.json | 18 + .../.factory-plugin/plugin.json | 23 + plugins/kreuzberg-cloud/README.md | 112 ++ plugins/kreuzberg-cloud/assets/icon.svg | 9 + plugins/kreuzberg-cloud/assets/logo.png | Bin 0 -> 11204 bytes .../skills/kreuzberg-cloud/SKILL.md | 345 ++++ .../skills/managing-cloud-usage/SKILL.md | 101 ++ .../skills/offloading-extraction/SKILL.md | 200 +++ .../skills/presigned-uploads/SKILL.md | 158 ++ .../skills/sandbox-keys/SKILL.md | 116 ++ .../skills/tracking-cloud-jobs/SKILL.md | 185 +++ plugins/kreuzberg/.factory-plugin/plugin.json | 24 + plugins/kreuzberg/README.md | 110 ++ plugins/kreuzberg/assets/icon.svg | 8 + plugins/kreuzberg/assets/logo.png | Bin 0 -> 9547 bytes .../skills/extracting-tables/SKILL.md | 148 ++ .../skills/extracting-with-ocr/SKILL.md | 123 ++ plugins/kreuzberg/skills/kreuzberg/SKILL.md | 419 +++++ .../kreuzberg/references/advanced-features.md | 967 +++++++++++ .../kreuzberg/references/cli-reference.md | 449 +++++ .../kreuzberg/references/configuration.md | 417 +++++ .../skills/kreuzberg/references/nodejs-api.md | 1380 ++++++++++++++++ .../kreuzberg/references/other-bindings.md | 212 +++ .../skills/kreuzberg/references/python-api.md | 1440 +++++++++++++++++ .../skills/kreuzberg/references/rust-api.md | 866 ++++++++++ .../kreuzberg/references/supported-formats.md | 223 +++ .../skills/picking-a-format/SKILL.md | 94 ++ .../kreuzcrawl/.factory-plugin/plugin.json | 23 + plugins/kreuzcrawl/README.md | 102 ++ plugins/kreuzcrawl/assets/icon.svg | 14 + plugins/kreuzcrawl/assets/logo.png | Bin 0 -> 12773 bytes .../skills/crawling-a-site/SKILL.md | 142 ++ .../skills/headless-fallback/SKILL.md | 140 ++ plugins/kreuzcrawl/skills/kreuzcrawl/SKILL.md | 191 +++ .../skills/scraping-html-to-markdown/SKILL.md | 124 ++ skills/crawling-a-site | 1 + skills/extracting-tables | 1 + skills/extracting-with-ocr | 1 + skills/headless-fallback | 1 + skills/kreuzberg | 1 + skills/kreuzberg-cloud | 1 + skills/kreuzcrawl | 1 + skills/managing-cloud-usage | 1 + skills/offloading-extraction | 1 + skills/picking-a-format | 1 + skills/presigned-uploads | 1 + skills/sandbox-keys | 1 + skills/scraping-html-to-markdown | 1 + skills/tracking-cloud-jobs | 1 + 49 files changed, 8897 insertions(+) create mode 100644 plugins/kreuzberg-cloud/.factory-plugin/plugin.json create mode 100644 plugins/kreuzberg-cloud/README.md create mode 100644 plugins/kreuzberg-cloud/assets/icon.svg create mode 100644 plugins/kreuzberg-cloud/assets/logo.png create mode 100644 plugins/kreuzberg-cloud/skills/kreuzberg-cloud/SKILL.md create mode 100644 plugins/kreuzberg-cloud/skills/managing-cloud-usage/SKILL.md create mode 100644 plugins/kreuzberg-cloud/skills/offloading-extraction/SKILL.md create mode 100644 plugins/kreuzberg-cloud/skills/presigned-uploads/SKILL.md create mode 100644 plugins/kreuzberg-cloud/skills/sandbox-keys/SKILL.md create mode 100644 plugins/kreuzberg-cloud/skills/tracking-cloud-jobs/SKILL.md create mode 100644 plugins/kreuzberg/.factory-plugin/plugin.json create mode 100644 plugins/kreuzberg/README.md create mode 100644 plugins/kreuzberg/assets/icon.svg create mode 100644 plugins/kreuzberg/assets/logo.png create mode 100644 plugins/kreuzberg/skills/extracting-tables/SKILL.md create mode 100644 plugins/kreuzberg/skills/extracting-with-ocr/SKILL.md create mode 100644 plugins/kreuzberg/skills/kreuzberg/SKILL.md create mode 100644 plugins/kreuzberg/skills/kreuzberg/references/advanced-features.md create mode 100644 plugins/kreuzberg/skills/kreuzberg/references/cli-reference.md create mode 100644 plugins/kreuzberg/skills/kreuzberg/references/configuration.md create mode 100644 plugins/kreuzberg/skills/kreuzberg/references/nodejs-api.md create mode 100644 plugins/kreuzberg/skills/kreuzberg/references/other-bindings.md create mode 100644 plugins/kreuzberg/skills/kreuzberg/references/python-api.md create mode 100644 plugins/kreuzberg/skills/kreuzberg/references/rust-api.md create mode 100644 plugins/kreuzberg/skills/kreuzberg/references/supported-formats.md create mode 100644 plugins/kreuzberg/skills/picking-a-format/SKILL.md create mode 100644 plugins/kreuzcrawl/.factory-plugin/plugin.json create mode 100644 plugins/kreuzcrawl/README.md create mode 100644 plugins/kreuzcrawl/assets/icon.svg create mode 100644 plugins/kreuzcrawl/assets/logo.png create mode 100644 plugins/kreuzcrawl/skills/crawling-a-site/SKILL.md create mode 100644 plugins/kreuzcrawl/skills/headless-fallback/SKILL.md create mode 100644 plugins/kreuzcrawl/skills/kreuzcrawl/SKILL.md create mode 100644 plugins/kreuzcrawl/skills/scraping-html-to-markdown/SKILL.md create mode 120000 skills/crawling-a-site create mode 120000 skills/extracting-tables create mode 120000 skills/extracting-with-ocr create mode 120000 skills/headless-fallback create mode 120000 skills/kreuzberg create mode 120000 skills/kreuzberg-cloud create mode 120000 skills/kreuzcrawl create mode 120000 skills/managing-cloud-usage create mode 120000 skills/offloading-extraction create mode 120000 skills/picking-a-format create mode 120000 skills/presigned-uploads create mode 120000 skills/sandbox-keys create mode 120000 skills/scraping-html-to-markdown create mode 120000 skills/tracking-cloud-jobs diff --git a/.factory-plugin/marketplace.json b/.factory-plugin/marketplace.json index 4282159..98338f6 100644 --- a/.factory-plugin/marketplace.json +++ b/.factory-plugin/marketplace.json @@ -53,6 +53,24 @@ "description": "Pull request lifecycle skills: create PRs with consistent conventions and follow up on them until merge-ready", "source": "./plugins/code-review", "category": "productivity" + }, + { + "name": "kreuzberg", + "description": "Local document extraction: text, tables, metadata, images from 91+ formats with optional OCR.", + "source": "./plugins/kreuzberg", + "category": "document-intelligence" + }, + { + "name": "kreuzcrawl", + "description": "Web crawling and scraping with HTML→Markdown and headless-Chrome fallback.", + "source": "./plugins/kreuzcrawl", + "category": "web-scraping" + }, + { + "name": "kreuzberg-cloud", + "description": "Managed extraction via api.kreuzberg.dev with webhooks, uploads, and usage tracking.", + "source": "./plugins/kreuzberg-cloud", + "category": "document-intelligence" } ] } diff --git a/plugins/kreuzberg-cloud/.factory-plugin/plugin.json b/plugins/kreuzberg-cloud/.factory-plugin/plugin.json new file mode 100644 index 0000000..03b8e43 --- /dev/null +++ b/plugins/kreuzberg-cloud/.factory-plugin/plugin.json @@ -0,0 +1,23 @@ +{ + "name": "kreuzberg-cloud", + "version": "0.1.0", + "description": "Offload document extraction to api.kreuzberg.dev — managed extraction with webhooks, presigned uploads, and usage tracking.", + "author": { + "name": "Kreuzberg, Inc.", + "email": "support@kreuzberg.dev", + "url": "https://kreuzberg.dev" + }, + "homepage": "https://kreuzberg.dev", + "repository": "https://github.com/kreuzberg-dev/plugins", + "license": "MIT", + "category": "document-intelligence", + "keywords": [ + "cloud", + "document-intelligence", + "extraction", + "webhooks" + ], + "brandColor": "#0EA5E9", + "icon": "./assets/icon.svg", + "logo": "./assets/logo.png" +} diff --git a/plugins/kreuzberg-cloud/README.md b/plugins/kreuzberg-cloud/README.md new file mode 100644 index 0000000..cadb366 --- /dev/null +++ b/plugins/kreuzberg-cloud/README.md @@ -0,0 +1,112 @@ +# kreuzberg-cloud + +Offload document extraction to `api.kreuzberg.dev` — managed extraction with webhook delivery, presigned uploads for large files, sandbox keys, and per-project usage tracking. + + + +## Install + +### From the marketplace (recommended) + +Pending review for official Claude marketplace. + +Self-host: + +```text +/plugin marketplace add kreuzberg-dev/plugins +/plugin install kreuzberg-cloud@kreuzberg +``` + +### v0.1.0 — skills only + +The plugin v0.1.0 ships **skills and documentation only; no MCP server**. The `kreuzberg-cloud` CLI binary with MCP wiring lands in plugin v0.2.0. Agents call the HTTP REST API directly via curl or one of the official SDKs: + +- **TypeScript/Node.js**: `@kreuzberg/cloud` ([npm](https://www.npmjs.com/package/@kreuzberg/cloud)) +- **Python**: `kreuzberg-cloud-sdk` ([PyPI](https://pypi.org/project/kreuzberg-cloud-sdk/)) + +### API key requirement + +Set the `KREUZBERG_API_KEY` environment variable or write `~/.kreuzberg/cloud.toml`: + +```toml +api_key = "sk_live_..." +``` + +If neither is set, the plugin's SessionStart hook displays a reminder. For evaluation without signup, use sandbox keys (see the `sandbox-keys` skill). + +## Skills shipped + +| Skill | Trigger | +|-------|---------| +| **kreuzberg-cloud** | Offload document extraction to api.kreuzberg.dev. Use when the user wants managed extraction with webhook delivery, presigned uploads for large files, sandbox keys, or per-project usage tracking — instead of running the local kreuzberg CLI. Covers authentication, the 12 REST endpoints, request/response shapes, error model, and SDK options. | +| **offloading-extraction** | Use when the user wants to extract a document via the cloud rather than the local kreuzberg CLI. Covers POST /v1/extract — JSON vs multipart bodies, URL crawls, options block, webhook attachment, and the async response shape. | +| **tracking-cloud-jobs** | Use when an extraction job has been submitted and the result needs to be retrieved. Covers GET /v1/jobs/{id}, polling cadence with exponential backoff, terminal status detection, and webhook delivery (signature verification, retry semantics). | +| **presigned-uploads** | Use when the user has files larger than ~50 MB to extract via the cloud, or when base64-encoding the body would be wasteful. Covers the three-step presign / PUT / confirm flow against POST /v1/uploads/presign and POST /v1/uploads/confirm. | +| **managing-cloud-usage** | Use when the user asks about quota, billing visibility, or processed-page counts. Covers GET /v1/usage — query params, response shape, when to report usage proactively to the user. | +| **sandbox-keys** | Use when the user wants to try Kreuzberg Cloud without signing up, or needs an ephemeral key for evaluation, demos, or CI integration tests. Covers POST /v1/sandbox/key — the no-auth endpoint, quota, TTL, and cleanup expectations. | + +## MCP tools + +MCP wiring lands in v0.2.0. Until then, the v0.1.0 skills document the REST API directly with curl, TypeScript SDK, and Python SDK examples. + +## Configuration + +### Environment variable + +```bash +export KREUZBERG_API_KEY="sk_live_..." +``` + +### Config file + +Create `~/.kreuzberg/cloud.toml`: + +```toml +api_key = "sk_live_..." +base_url = "https://api.kreuzberg.dev" # optional +``` + +Precedence: CLI argument > environment variable > config file. + +## Examples + +Submit a document for extraction via curl: + +```text +curl -X POST https://api.kreuzberg.dev/v1/extract \ + -H "Authorization: Bearer $KREUZBERG_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{"url":"https://example.com/document.pdf"}' +``` + +Poll a job for completion with the TypeScript SDK: + +```text +import { CloudClient } from "@kreuzberg/cloud"; +const client = new CloudClient({ apiKey: process.env.KREUZBERG_API_KEY }); +const job = await client.getJob(jobId); +console.log(job.status); // "pending" | "processing" | "completed" | "failed" +``` + +Check quota with the Python SDK: + +```text +from kreuzberg_cloud_sdk import Client +client = Client(api_key=os.getenv("KREUZBERG_API_KEY")) +usage = client.get_usage() +print(f"Pages processed: {usage.pages_processed}, Quota: {usage.quota}") +``` + +## Versioning + +The plugin version tracks the marketplace `VERSION` file. See [CHANGELOG.md](../../CHANGELOG.md) for release notes. + +## License + +MIT. + +## See also + +- **Marketplace**: [kreuzberg-dev/plugins](https://github.com/kreuzberg-dev/plugins) +- **Upstream**: [kreuzberg-dev/kreuzberg-cloud](https://github.com/kreuzberg-dev/kreuzberg-cloud) +- **Sibling plugins**: [kreuzberg](../kreuzberg/README.md), [kreuzcrawl](../kreuzcrawl/README.md) diff --git a/plugins/kreuzberg-cloud/assets/icon.svg b/plugins/kreuzberg-cloud/assets/icon.svg new file mode 100644 index 0000000..23eeacc --- /dev/null +++ b/plugins/kreuzberg-cloud/assets/icon.svg @@ -0,0 +1,9 @@ + + + + + + + + + diff --git a/plugins/kreuzberg-cloud/assets/logo.png b/plugins/kreuzberg-cloud/assets/logo.png new file mode 100644 index 0000000000000000000000000000000000000000..8cb2e11d0ff8a0d5560d8bfb6368b1d972c5a354 GIT binary patch literal 11204 zcmd6NcT^M3v+o9i;8#izkzSOdfCZ%sfd~i!B1MXHl@3xAX+a<$N|7QUAYG)lP^5Q7 zDbl-itaOkPN@#cE?|0t0@BVTBdGEgS4(E_$cXoDm=QFc2yA!N=Tbbq`)_(v1G%7b0 zwE-Z5U&#Q~G5E3RF-U|T#~!FDD}p1^-`o1^XaKl@isE%0kEG>EuOwqv@1wQt?M?@G zH#dn|6}RXveqPK4Ub5q71@#?rC(g#@mGB}%;?{G@gDQv{vl=0qP6TT%?LNZ=jp8$i zP`y*nzGqS(oCqf|)Ox+3x_Gq9!QI+gq;linfZqQMZ(@CC{Hm0TX-}iGh|6#DtJ1S# z(=NY5>FwTmxtev{?C+LnFt{h<cj674F!MCFY9-*K9Y257RFNsCAlUhj_0et z^lrpPMY{XP+!igp949(JumLEB13m0S+y0j?M=`T}oE*274$5PDI9$Zz)0D3Ee4vBre0Yaydm$J?`?AVc+Ob&*Y4P z#drY3bYFG)s%!}b>nyO<#<|nhaJH3yJ=WCedQkmf)bv7p<;^X^0|PGb{oVQW)wW7A z$1OrwRf-OCEohrun|NSeud9R&%(y+W0YDOKX}?KIoiCfQ)CLniFgM!+*;Lcjwft*S zP{`!syv4{50OaSHTE>iyb);9eetYWof)+PXGi$~FCAiXX)Sj16l|l(n$I10-$`?FS z1&qaA5*P3L_rF32J1keE=ZUX1hV{inB^dXRgCI_xFvEE@m7?`Elen(YC2WeJeGX_V zTv@z&e8ZHd-e_qD0m`*X`{VLb+!c5VsYDe)S>gCZ?qi}t)S)V=7a_cN>CXP;()b8k zd{V_YLFNwQGN_v7AFm`d4ykg{$9<;&gRPg^rz8gl4y5Y4b&Tmm3?gZ8TVms8PGPba zZ6g;A09Z2*<%}EhI;QXDJ59I{O<(<2i1m}d??mZUr%(W(VzzeY&}=rYhtnC|rl=?g z#zwXpN534t%wWM=Bft!(4cqklz-VIj>dG)d_y>J3usPi3aM5mGVB$+&#GVF#wqv0| z^`n>3DcOrCBCs)ef1qu?Nue@e(nAKCB$pzW%xDUeT;mq6x15{;cPp!_3zLWe7WhYK zXpYg&Ui_9*QpI&btJqfx$1TCnl`*$EeL@3&{10Q&_)`zd8n&&Q^Q{ge7O&I(uI8M% zS6I1i@>7>V&T5tp*npRzvquHq>PLyjbdQVw6o7*F!NZS_l*2ZCp4(mmAgkNZT<%KJ zQp4>I^&x_=_D#34$^1P-C5EZwDDP*a9ij>y@}7TJ`LuQ1inf;XvnQ}{;(C+3x$mTk zeg3yUK@2PU>s;J&%~Rs(I2TD8+{8jdoY!%eq72VVvjDv9PFXATy?(ep%4R|{a{-L4 zvP*s@9(sS@H~=`0F&_IKu9~OA8pbkJ(He-K!&|AA-bZquySV_$Y`~&(du(-X&zyzrXSS>EO|$bEgXCWAmCQF;YL?Q`S_ww?p_Ti8L3;wV9!-09xDqMc1B-RkBP=~ z>R=L!zBfts*0pPEAgs@ww|sE%3IYWE!xN?#=I-P3xsz*fe$9Wx&3ZcjXX2`gU194F z1VCvilpabc#Pu8+^wOR305V5fjg)+@$b-88h~i8d2tCw99)G`MO&d(RyCoUv{!iHf zVC@_jdhOZOwvKOj%gJ4=f!LfVy_mSUU-U9r4gdx*IjxGtTK#Pbw<%*`L~%F&diBpZ zzGT+6Ed;Q6TEwjWrkq$zV`0Y8rZg^ez!kkz%3VPNFrmg1Zc9yJ*B@5>o&{z2<=U@p z6C9r^;1bT*hfc{#Z|ac zvXUHc`=jx5*zDsRWHx5C+0A)789QIO0JvYOmfQa^?U##t*fuSLA);8&^qkuvgq(V| zK&G5t*hi3`3KXw;INSw{K<*Wk{`_it;_Jryzc3hnup!Lc_y8Wg#hE_wd*Atlz@Bb zDx=(PV~a~O4<7u~+jiwU`P(-32mn>Y!59c32waB zAJ$@usB*Y-$LQK#S;h)04cs385xLibT_fw&>4CK$-Uxx^`6b2&&^urTpOouq3$@mK zc!Ri?nQt#m2S5&`dFxalm-t^hz0uz;mP!S{pr4fGSn`1y-F9{$4Vd1Zrtic+%jSGK z-LUNJ>(&+xAh63r`8&l`9Q@QtGm5|I(D$k^nY<&pLq}UpCY&G|7eX=vtr6M&@Acm_ zeVykNz;GR3?;`*(-X4Kd*^NI@O=aV58XE(dvXAD?pgnzK}#fY=Z zdYVguZ}0hpyqI-0TJkzx5W3n}A`1^MTum`{;Da1??F`O4?r>5-z^DkGj+hyx7uQ~7 zE&4zId?4r1_YcPCnpeeN7B+@s+hr*}Gjz1qkoUTaW<}w0Za;OI@;lmi3yNotLp$fx zxQvZH`X~iPr75vd4UOf83=mJ)oUG2J$X_Pj{Rkl-VBg~`aT01VLBF)f^IPNgQDFJC zt}aQ^-pFWNPNMQ?k^+bGfi+hEu%RA4JOj3p$%`FbT#Tg9=LLS*od*C_)vuY`i_bs3 zMBvRWOp>`y(xuIl`K>38C7b(afPH7M<17LI!c0dBH<}ByEFdf4L5#x+Qy#;KaSB0u z`P3ZTe?kbjMUF#vjY+-`%R5kCQ)DT6e;uCPDQ zgHOTwD&5<65Y>HiWC?@ElNQy2X{wJdqmu_7O%q}=y;Zta;us}xC#u1f%-Nt zJ!x>zIqAvhCF+r9P#R*%RS_AXXcsgDOHRIC&rOz*JFxj$sj630KOUX}z}4dQ8a12+ zGhrd427jX@STwXh5>COb!kxI!J%S~kmm98GXCpz5&`>_F5;>&w`sVQrN@mExt)633TMk`sG zvF6^TiScF>)Jxa3E&gf0XKJQFqf6NPmgkdNwBX*%NE2JIc;tmKp$AR(c|)J$ZNyxy5;ZDVF5v#6Ekgtih-!+uL$F6~r}+0f}lP^~!KCfMuZ{%Q zm87zMdLkurq2c7gEiH+othZrag7{Vhe;B|N-0JbDS@>#I|HHB?VQ4~<%YXO4S+^1` zmN(SMHk5)&?Ti)fPYk!~uMO{6_!MrF>Tkj4ZEde5;c#KDmD~~rybMZ!`?Qsq7q;i! zX%o+f(FqKFo|RHsyQM=+sjNed`&n-r`u?!L7mvCWAJ=yzbtU_`gR<BDEsg@)eqx%I@KKX}IGcYc;b^>60hF?7|o zE}LG9XoZ`)@T`%2cv#-!`w2;QGC(7+%tBf);V`Z;;QK8f{TCg2BspDgLxZu+GZ73?@!G>#pwfz49rE-XKTaMgmqPqc3wjYA!Y3zk1aF zND3ktg&|+iM*XV%_Rd)By z`RUvC1ODu*#0ZO=hlGYyx<$OYmwnCTRBXU-(kMRjisZ1I7TspdW}F;r|5kkx8}1J` zK-sGBn7C`-9$`-n`)R#vLi{q|MMz1dv9CJDRPH%hQ06F|uFf&Pb7_~Wy~4k!gNya& zD$-pN@PYudGd%I^Evw8xn4-KIc`m+MBs?HM-`gHlk-CcHrx7gub}#lU9R) z&DdRas8}kX*u!EX;JuhcOR(@H4$G_c&sB(w5gIjY;(HIKT#fZ4E_FTn_zRgSGaoxJ z_GI~8!J-z*rOyAX_NAq+x%{HUtcp{$$%a>>WS!+Zs^ZqCJ&pCa&{9l0#$jJ>*4%Gz zaM1_)aF!M$Ch^5Bk#Td^8;#i7{W<RKS$)V|UZs-NwlP5VB)y}0KR#`B~FNUs^Iil;sEPVr~ zWExd^!iWo<9@k2q#l7Tq74sxclWtfxrrqlg;Z~cO4P;T9 zIr`z&1Yc@auVf^3Fa74x3aT_WWWX{p9hYyS0s$?yCISaEm$5kBdGD9+6UIM|81YCV zfqdFU%gyH_&Mk7qW-k2@^wozH{%~7Q$(D*{n@%l*%a7aZLc71eVkVOZP1OUn+R!Mh zM$AFllY^*WZ#5oaQzS@&9t>_ao0-4~_7;JGJJ_yUvZsT_c)-lykndo@bB7BDZ@Xvut|Wdyl+9(BB5xVdo(8UE=eG4V%`<>2X} z2ImKQ@3ZqyKmN`oZ0hT#qf{u+9eaHl1)RU5Um6Z}1|Ir7vR7p``I`6$eT3+uh+T-ADY`E>OXh5RzGm^QO>lQ2ImvF5nh7sJNLExH_zqI>ex;PZTy%0ENB?bvJ zUC;)m4Ohfuo}sFuuSw}0A2f`08vJyM9#H9wiCoYnL)e@F!Z(nBti1SQT3PD}MooJu zGH^!`aBH?!tgUcS$zQqw?vX@T$T7h+vw{HT?M}`~RA`Tr%MaEb%z|H^@ zzG(tgkR)7{>KZjUS&41c0v4&rez}M0ls*FKzc#lLvPh451m_)UjI&hbrebrW|JN=M zi)K~v%7C@|=H~?_=`Z+J%qO_EK{+`s&A!=(mTzMl$}@Ac@W=pxvjCMdO|7@Nk_yv% za+~v&`)-2-DjSgHut9++b0dDOK_~Sp%q5-`h9-eKBS~!)_#*<*CjSAe1 z(oQKOcDi--lx!%Mp(iwu;Ky}!{3H#Z}=y6KQu;V@oVFQ;0Lj9;P=@36NR6#)WEQC-Cso@L&_mW`4N!Y-TgALrU z!5=n|ks2VyAj|=TRTROyPzX8!ZZZ_KFK~;V9(VpMOd5^_0Yw@(tC&2JTa}b`oH+}$ zuEW$Jj2;n0VnK#MaxcLSDk^%MFb}-0#wdc1A&?c09zn$e2^Qd_U*SSD&%urW*l`Um z_%xhjkOhu_(ljB?0z4FAhKe4v0Nh3n3Fq$Npq|SotBm%M^Fvtp5L4O_jHWqdSq0WMG zdAN#CC`1mN17nei{((hW2Z39R2*6!xcxR0U*qR~hNTCgL(0LLCM~qq(=Z9)aqG|#1 zo;G9>>H0vqpFow?qzd?9a)vUncHiiBZ-fMjfFCXtWiZ>-p2Nqox&dfEeFL|UB+jKO z>5R$})to#!lI*y(vj8ZLQQo!*KFCJUwy1&@5+7TUt0^$18745NHLHSf5+_gJ+N$U7 zb%AiI7ASU5-RpCcK328AT}*IsGa-JYz!jq~*XcoAA6-rumq)UPHX|fY!Nz7u{9ohg zWtm)BqHXyG*rAqtHLr=JtoWGJ9y&f!D`>nhr+N2ylOJ3W&ADY37z)1hX(+g(&oSQl z*`KzFk-Q`oCNh)o>TiV_Kld0-hN_bPydXWUN=jcMmhG8=mT=p$Wju|HR z<{L3@I@6#+j%n`T54)5c*9U_;yz=pwEoq`j1PFx3|NoWD6G8$&{ryieBPF-DblNKo z8>@4?DqnBTt&~=re#$5~Ot_nq{;=QX_V9oThrq?e#J=bCwj>B$;D*Rh{DMw=S>vc+ zsxh07SN2nY@h{A6(Na1?`l+M&Fn@G3P8hi1B7#>YKl@Ap-ZbUK$d;Wc(xsy*-Dp*N z_(i{bA?>>HL^v2)*89wwhA5r_a*D43dOKaJXB(&_nh=_k7x9W^@wf$Occaa4j?i7Wcp+3#aF z3}TP`9&y8m4~Yxy^OK4odm6En9tX=I`!~XdG(y7o(b6w{@ zkF(s9hg*O9OEf9x!!60>m%4q6q)}&wW)HAqNb)>xIKkmlmcj@}X;L z3koUBo`<-iPoL#Y6itXGJkeHG?f7l=DS#=os@4P^?sz*kvmnPyjRM{W}NjcUok;!rkzq7(9 z3a7Q7KsdtqDNS#^@X&{RwN2BEYv0n8FP!4-2NI$Pi7&@KUGW_p71_;$5z$X9HMb@N zuLJt>N9UiWwB-My`WNorK-|&M!0y5%7BXF>s<&|n4yamUI5@sc>%dFPZP zY)=2d60T{~9}nHq13Q5pv?xi4ictIa&5;~W-)J}QQ9F??q1tr2IE!8?gp8gGz*T00j+O=i?jg22~^4fBjs{{k{J;%FI1GT{b zx%6+;-43FJ8$QD_-;ya+h|W)TA=7kGuH=VV>vGL~mGGV;9`PtVCE$Nx;G{IWW$|P+ z_T9>)_8?E~modXXV%pyl$Ax`hfFWg|0jC;s2~%m2pPl|-qN4CGoaPo|}dNKLUsVVSjY1jRkF|LX06%i-bBo|c!wp0|+S zvoz#yEDz0c{#`Z8NIHV~94&(TWS$-orkn7ppsiajK>Sy}A))sz1Z`Z;18R~>JNu2O zyk4^#?7w)$8JX(&?kyW<7mS#jBvDW|UgAunyUK0qY$psGGv~0>KpE<0P_5~$u}(Cm*R z(%?1JAW_MCgy&vV1>;U7`A2wKqNpxWQw6a78+D?2q~@?`&496}?$?X-!YXRJmE?cD zDD-ygk*@bIAGu-P{BN-1xA|YN6Fvs<8M~gr7W>mU!tXfoYr9kfCo`3l#>qGXg&Y%6 ziFEt$9vGGWfOi-dI*u?qx@ew9xncoRqV0;0065`5ss(^f4J$#9!j4?}vOtW!TY@xxxVJu`eXq{-tT_L@lIAkA3bZb8tKoz$axk)78r=>9Kk%vJb`Mh^{1U`y?v8DF zQ~39$nqY2G;W4}oW1+vqPU4=!BLn~LK|4i{oGl@PNBh&S!{7SfPi+ZVh230mN7KLv z^ksvC{`M&umD2tn|GIcztK#*Hvas;^<#xJCdwusL2X}bL{z_}?73?jnEGT$N#+~(| zqmBwZ64m+cE{A{WJDqswEd+*Tf4i|)t>0>Z$sVAU>00%9$aj*m08nQNMW_i%CsW;) zsi^%<>e~;QKBy909+h4X*nA3Xb|=R`rZ@YNlg2M|o^Yc`|@YMp$(>}EgKe#V^X()ryhb$YlXigW8ZXwp@RlaY_ zo07wC1|CwMJ9f4^Q9XeFy>5q@sTWjfc%u3|)?M7vvz`bJ0y*pWMTO`{xdbsnfUL|p9%5`@$jojCx zO@FgOy24jMk7>yIlHzrE*XXEWXziNq`oLOSxtw<}MO@WO5iz6MUdejF04<_DWL_&s z{IC_(P+OQm#k4=g+cU{$>#D8<4e&$h*2`x7treO6uJAfCZUL=AANiAU zIC8)}vBe)l{ORE+i3eZ3Z}hy6I`}wf1xFzE_VFFrPAok%t&AW_XN`d?`bk^ zE47V-{}gy8l)W6=M_@>VZ&=9oyL)Mq9nzHN9;XHuf}5oTFAaX_upvX5-vz#aYKj~+SPv~x^1HOE zP~dn+Rd-oc4x;PfGwPct4xcZ+M;5Cx^T6Zq>N8)7(LHj2%PsbeHk0+bSGp~J$9tzj zY`pPe`bJg_2TY_*tqyv>;3aNZo`^O@d@Wj-zswu9_mLdt+^Wa&1nSN6*y109(1N)s zN6Vs=1vi-e(i_+>jVr~v6VzQ5?m+dkno2FQ=@BapAHwHu_V9y=KOqLw1)pQ>PC{)J zj%}-s-OXQy+LbHrH&@7A&*&k_eSNJ#CZjn%cKAH%AU|U21;I*}}9El#T)W2C;J^_Jn#j0?~$i8|b&W;^KiPqT-qldq03}S1cyN()?qc)rJ zb}y7L+5oRIzFBO$UadJ7JU165z#S`XPWVJ|ePurrM~;bFkL#J5Yu^vfKwp=G^*!_w7BZ{*-xx)F7&gf% zI6+jIwdDxU@WAt^b%KWsw1J$Nj=KAQO5^2u5hvBOQ;GC|E)y{s#mI! zY%E>;@V8P06LswJ>LGr1Z#Fmvn#Gn7bN)+8Sf13FLdQCn4QR6o;fErj<=d)uE^9mR zuHn<$@*3xz^%bt+J-=imj<>vs!{d33Zz`-$vVvMnn_OSoRfy{%=DwsXT0Dazr@<%f zo|qhnOiPM;ytHuX{a8z?NpvzwF_B-)w&z@J^1r32DCTvl7cs8hLGdZLeV!V7x?9bIZ zghqbb5>j!|n+rl{tEU?+X}TsF)hWO3eUn5wzxpPZx;vhxGanOm5Gu(FPw%Yp_qtw& z%$|)6WSD(7SCe{jCslqa5>5p61#cemZvZTn*ePlv;#G!%+%4-%4%<`>H*N zk|~h*KjrelLcximaRJ7S{{y$<(38cU;k(7B{3pio|loC6xKvsA8Ed&tAK3ssdnw@617})LeH}biPgdiyk3;59#MN}Th{ol z*oM?d^Ps8%MUzjmz1x=FtzYC(_>(Xp!?m+SmRH8+xBIf!KW@>3M1~5-ntV7G-OBYl zG8=_k>Us>j9---AenI5(p3?gwL1#8vJ3tV=ZaW7&J~$*M7>abovE}FUxr8jj`0u`j zkJYc~s}oo1l%PYeN#VCyqr#E@9G0nUV_yA-r{1Lh$-hbr%lhLOQg%nsajp)|=v(N| zm%!FywWX1G>c*w3Ef*^FX4UK-qZt6VhY~9x)5)vlsi&~Pe%9am%IFg%G&pis{OG9O znjjOjA46C%DK#R30Cs{-p$*pFdqYpItaWiXB}y$}xkspQ&nBIs3wvnl42Lg?>^+#V zSldPdtxd0Xyu{UM_9bTQdD>ROIe(6EdQy2!e1@rIf#>s#0lwrjv`hf=RfiWXILgK9 zNvEHGNVl%v34=Tx7(O$&j%Q_&iZh&+f?LMo?5Ex7FT{$DuUSAJG|ubrdh<=pZp>_A z>oK9-uo}A$rX$a897kvhEs9IbsqeQEJu1T&p`QvwTi4aXtOB`qG^Jk0WwUI<}dv8;zRK PKLC{*w-xd7_x%3@6wJ34 literal 0 HcmV?d00001 diff --git a/plugins/kreuzberg-cloud/skills/kreuzberg-cloud/SKILL.md b/plugins/kreuzberg-cloud/skills/kreuzberg-cloud/SKILL.md new file mode 100644 index 0000000..b1975d5 --- /dev/null +++ b/plugins/kreuzberg-cloud/skills/kreuzberg-cloud/SKILL.md @@ -0,0 +1,345 @@ +--- +name: kreuzberg-cloud +description: >- + Offload document extraction to api.kreuzberg.dev. Use when the user wants + managed extraction with webhook delivery, presigned uploads for large + files, sandbox keys, or per-project usage tracking — instead of running + the local kreuzberg CLI. Covers authentication, the 12 REST endpoints, + request/response shapes, error model, and SDK options. +license: MIT +metadata: + author: kreuzberg-dev + version: "0.1.0" + repository: https://github.com/kreuzberg-dev/kreuzberg-cloud +--- + +# Kreuzberg Cloud + +Kreuzberg Cloud is the managed extraction API hosted at +`https://api.kreuzberg.dev`. It exposes the same Rust extraction engine as +the local `kreuzberg` CLI, with two extras: jobs are asynchronous (webhook +or polling delivery) and large files go through presigned uploads instead +of in-band base64. + +Use this skill when writing code that: + +- Hits `api.kreuzberg.dev` directly via HTTP. +- Uses the `@kreuzberg/cloud` (npm) or `kreuzberg-cloud-sdk` (PyPI) SDKs. +- Configures webhooks, sandbox keys, or usage queries. + +## v0.1.0 limitation + +The `kreuzberg-cloud` plugin v0.1.0 ships **skills only — no MCP server**. +The `kreuzberg-cloud` CLI binary that hosts the MCP server lands in plugin +v0.2.0. Until then, prefer one of: + +1. The TypeScript SDK (`@kreuzberg/cloud`) — ESM, tree-shakable, generated + from the OpenAPI 3.1 spec. +2. The Python SDK (`kreuzberg-cloud-sdk`) — sync + async, `from_sandbox()` + helper for evaluation. +3. Raw `curl` — every example below shows the curl form first. + +## When cloud vs local + +| Situation | Use | +|---|---| +| You already have the `kreuzberg` CLI installed and the file is on disk | Local (`kreuzberg` plugin) | +| File is on a remote URL or in S3 / GCS | Cloud | +| Need OCR for languages the local Tesseract install doesn't have | Cloud | +| File is larger than ~50 MB | Cloud (presigned uploads) | +| Want webhook delivery rather than blocking the caller | Cloud | +| Batch of mixed documents with shared options | Either; cloud parallelizes server-side | +| No network access, air-gapped environment | Local | +| Evaluating before committing to install | Cloud sandbox key | + +## Getting an API key + +Three options, in order of preference for production: + +1. **Production key** — sign up at , mint a key + from the dashboard. Format: `sk_live_*`. +2. **Sandbox key** — no signup, 24-hour TTL, 50-page quota, rate-limited to + 10 keys per IP per 24 hours. Format: `sk_sandbox_*`. See the + `sandbox-keys` skill. +3. **Local `~/.kreuzberg/cloud.toml`** — for shell sessions, put the key in: + + ```toml + # ~/.kreuzberg/cloud.toml + api_key = "sk_live_..." + ``` + +The plugin's `SessionStart` hook checks `KREUZBERG_API_KEY` env var first, +then `~/.kreuzberg/cloud.toml`, and emits a setup reminder if neither is +present. + +## Authentication + +Every request — except `POST /v1/sandbox/key` — uses a Bearer token: + +```bash +curl https://api.kreuzberg.dev/v1/usage \ + -H "Authorization: Bearer $KREUZBERG_API_KEY" +``` + +Both `sk_live_*` and `sk_sandbox_*` go in the same header. The server +resolves project context from the key. + +## Base URL and versioning + +- Base: `https://api.kreuzberg.dev` +- Path prefix: `/v1/` for all extraction, jobs, sandbox, uploads, usage + endpoints. Health endpoints (`/healthz`, `/readyz`) are unversioned. +- The OpenAPI 3.1 spec is published at + ; full reference at + . + +## The 12 endpoints + +Twelve operations across seven tag groups: + +### health (2) + +| Method | Path | Purpose | +|---|---|---| +| GET | `/healthz` | Liveness — returns 200 if the process is up. | +| GET | `/readyz` | Readiness — returns 200 only when downstream deps are healthy. | + +Neither requires auth. Use `/readyz` for uptime monitors and `/healthz` +for load-balancer health checks. + +### extract (1) + +| Method | Path | Purpose | +|---|---|---| +| POST | `/v1/extract` | Submit one or more documents (or URLs) for extraction. | + +Accepts `application/json` (base64 documents) or `multipart/form-data` +(binary file parts). Returns `202 Accepted` with `job_ids` (extraction +jobs) and `crawl_job_ids` (URL-crawl jobs). Pair with `GET /v1/jobs/{id}` +to retrieve results — or supply a `webhook` block to receive them +asynchronously. See the `offloading-extraction` skill. + +### jobs (1) + +| Method | Path | Purpose | +|---|---|---| +| GET | `/v1/jobs/{id}` | Get the current status and (if terminal) result of a job. | + +Accepts both extraction job IDs and crawl job IDs. Response shape varies: +extraction jobs return `JobResponse`, crawl jobs return `CrawlJobResponse`. +See the `tracking-cloud-jobs` skill. + +### documents (4) + +| Method | Path | Purpose | +|---|---|---| +| GET | `/v1/documents/{document_id}` | Latest version of a document with its extraction result. | +| POST | `/v1/documents/{document_id}/diff` | Compute a diff between two versions (sync, with async fallback). | +| GET | `/v1/documents/{document_id}/diff/{diff_job_id}` | Poll the status of an async diff job. | +| GET | `/v1/documents/{document_id}/versions` | List all versions of a document (paginated). | + +For applications that re-process the same document over time. Each +extraction returns a `document_id` that's stable across versions. + +### uploads (2) + +| Method | Path | Purpose | +|---|---|---| +| POST | `/v1/uploads/presign` | Generate per-file presigned PUT URLs. | +| POST | `/v1/uploads/confirm` | Confirm the uploads and start processing. | + +Three-step flow for files larger than ~50 MB: presign → PUT to storage → +confirm. See the `presigned-uploads` skill. + +### sandbox (1) + +| Method | Path | Purpose | +|---|---|---| +| POST | `/v1/sandbox/key` | Mint an ephemeral sandbox API key. | + +24-hour TTL, 50-page quota, 10 keys per IP per 24 hours. No auth required. +See the `sandbox-keys` skill. + +### usage (1) + +| Method | Path | Purpose | +|---|---|---| +| GET | `/v1/usage` | Per-project usage statistics and remaining quota. | + +Accepts optional `start` and `end` ISO-8601 query params. Defaults to the +current calendar month. See the `managing-cloud-usage` skill. + +## Key request / response shapes + +### `ExtractJsonRequest` + +```json +{ + "documents": [ + { + "filename": "invoice.pdf", + "mime_type": "application/pdf", + "data": "" + } + ], + "urls": [ + { "url": "https://example.com/docs" } + ], + "options": { + "extraction_config": { + "output_format": "markdown", + "ocr": { "backend": "tesseract", "language": "eng" } + } + }, + "crawl_config": { + "max_depth": 2, + "max_pages": 50, + "stay_on_domain": true + }, + "webhook": { + "url": "https://example.com/webhook", + "secret": "shared-hmac-secret", + "metadata": { "request_id": "abc123" } + } +} +``` + +Either `documents` or `urls` is required (or both). `webhook`, `options`, +and `crawl_config` are optional. + +### `ExtractResponse` (202) + +```json +{ + "job_ids": ["550e8400-e29b-41d4-a716-446655440000"], + "crawl_job_ids": ["660e9400-f39c-51e5-b827-557766551111"], + "status": "pending" +} +``` + +### `JobResponse` (200) + +```json +{ + "id": "550e8400-e29b-41d4-a716-446655440000", + "filename": "invoice.pdf", + "status": "completed", + "created_at": "2025-12-21T10:00:00Z", + "processing_time_ms": 1234, + "result": { + "content": "Invoice total: $1,234.56", + "mime_type": "text/markdown", + "tables": [], + "images": [], + "metadata": { "title": "Invoice #12345" } + } +} +``` + +### `JobStatus` enum + +```text +awaiting_upload | pending | processing | chunking | aggregating + | completed | partial_success | failed | cancelled +``` + +Terminal states: `completed`, `partial_success`, `failed`, `cancelled`. +Stop polling when any of those appears. + +### `UsageResponse` (200) + +```json +{ + "period_start": "2026-05-01", + "period_end": "2026-06-01", + "total_pages": 5432, + "total_documents": 87, + "total_failed": 2, + "quota_limit": 100000, + "quota_remaining": 94568, + "by_mime_type": { + "application/pdf": { "documents": 65, "pages": 3200, "failed": 1 } + } +} +``` + +## Error model + +All errors are JSON with at least an `error` string field. Status codes +follow REST conventions: + +| Status | Meaning | Typical cause | +|---|---|---| +| `400` | Bad request | Missing required field, malformed body, invalid UUID. | +| `401` | Unauthorized | Missing or invalid `Authorization` header. | +| `404` | Not found | Job / document ID doesn't exist in this project. | +| `429` | Rate limited | Sandbox-key IP throttle or per-key quota. | +| `500` | Server error | Database failure, worker crash — retry with backoff. | +| `503` | Service unavailable | Downstream dep unhealthy — retry. | + +The SDKs surface these as typed exceptions: `AuthError`, `ValidationError`, +`NotFoundError`, `RateLimitError` (carries `retry_after`), `ServerError`, +`TimeoutError`, all extending `KreuzbergCloudError` (Python) / +`KreuzbergError` (TypeScript). + +## Concrete examples + +### Sandbox onboarding (no signup) + +```bash +# Mint an ephemeral key. +curl -X POST https://api.kreuzberg.dev/v1/sandbox/key +# → { "api_key": "sk_sandbox_...", "expires_at": "...", "pages_remaining": 50 } +``` + +```ts +import { KreuzbergCloud } from "@kreuzberg/cloud"; +const client = await KreuzbergCloud.fromSandbox(); +const result = await client.extractAndWait({ + file: new Blob(["Hello world"], { type: "text/plain" }), +}); +console.log(result.result?.content); +``` + +```python +from kreuzberg_cloud import AsyncKreuzbergCloud +async with await AsyncKreuzbergCloud.from_sandbox() as client: + job = await client.extract_and_wait(file=b"hello world") + print(job.status, job.result and job.result.content) +``` + +### Single-file extract → poll + +```bash +# 1. Submit. +JOB_ID=$(curl -sX POST https://api.kreuzberg.dev/v1/extract \ + -H "Authorization: Bearer $KREUZBERG_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{ + "documents": [ + {"filename": "invoice.pdf", "mime_type": "application/pdf", + "data": "'"$(base64 -w0 invoice.pdf)"'"} + ], + "options": {"extraction_config": {"output_format": "markdown"}} + }' | jq -r '.job_ids[0]') + +# 2. Poll until terminal. +curl -s https://api.kreuzberg.dev/v1/jobs/$JOB_ID \ + -H "Authorization: Bearer $KREUZBERG_API_KEY" +``` + +## Other skills + +- `offloading-extraction` — full `POST /v1/extract` workflow with options. +- `tracking-cloud-jobs` — polling cadence, webhook signatures. +- `presigned-uploads` — three-step flow for files >50 MB. +- `managing-cloud-usage` — quota and per-MIME breakdown. +- `sandbox-keys` — when to recommend sandbox over production keys. + +## References + +- API docs: +- OpenAPI spec: +- TypeScript SDK: +- Python SDK: +- Pricing and signup: diff --git a/plugins/kreuzberg-cloud/skills/managing-cloud-usage/SKILL.md b/plugins/kreuzberg-cloud/skills/managing-cloud-usage/SKILL.md new file mode 100644 index 0000000..804b26f --- /dev/null +++ b/plugins/kreuzberg-cloud/skills/managing-cloud-usage/SKILL.md @@ -0,0 +1,101 @@ +--- +name: managing-cloud-usage +description: Use when the user asks about quota, billing visibility, or processed-page counts. Covers GET /v1/usage — query params, response shape, when to report usage proactively to the user. +--- + +# Managing cloud usage + +`GET /v1/usage` is the only endpoint for quota and billing visibility. +It returns aggregate counters for the queried period plus the remaining +quota for the project. + +## Endpoint + +```text +GET https://api.kreuzberg.dev/v1/usage +Authorization: Bearer $KREUZBERG_API_KEY +``` + +### Query parameters + +| Param | Format | Default | +|---|---|---| +| `start` | ISO-8601 date (e.g. `2026-03-01`) | First day of current month. | +| `end` | ISO-8601 date (e.g. `2026-04-01`) | First day of next month. | + +Both are optional. Omit both for the current calendar month. + +## Response (200) + +```json +{ + "period_start": "2026-05-01", + "period_end": "2026-06-01", + "total_pages": 5432, + "total_documents": 87, + "total_failed": 2, + "quota_limit": 100000, + "quota_remaining": 94568, + "by_mime_type": { + "application/pdf": { "documents": 65, "pages": 3200, "failed": 1 }, + "image/png": { "documents": 15, "pages": 1800, "failed": 0 }, + "text/plain": { "documents": 7, "pages": 432, "failed": 1 } + } +} +``` + +### Reading the response + +- `total_pages` — pages billed in the period. The unit of cost. +- `total_documents` — files submitted, regardless of page count. +- `total_failed` — extractions that ended in `failed` status. Failed + jobs do not consume quota. +- `quota_limit` / `quota_remaining` — total and remaining pages on the + current plan. +- `by_mime_type` — per-MIME breakdown. Useful for identifying which + document types drive cost. + +## Examples + +### Current-month usage + +```bash +curl -s https://api.kreuzberg.dev/v1/usage \ + -H "Authorization: Bearer $KREUZBERG_API_KEY" | jq . +``` + +### Specific date range + +```bash +curl -s "https://api.kreuzberg.dev/v1/usage?start=2026-01-01&end=2026-02-01" \ + -H "Authorization: Bearer $KREUZBERG_API_KEY" | jq . +``` + +### Quota remaining as a percentage + +```bash +curl -s https://api.kreuzberg.dev/v1/usage \ + -H "Authorization: Bearer $KREUZBERG_API_KEY" \ + | jq '.quota_remaining * 100 / .quota_limit' +``` + +## When to report usage to the user + +Pull usage proactively when: + +- A batch job submits more than ~100 documents — report `quota_remaining` + after submit so the user can see the impact. +- The user asks "how much have I used?" or any quota-shaped question. +- A `429` response includes a quota-exhausted error — surface the usage + shape so the user can decide whether to upgrade. +- After a long-running crawl finishes, since page count is hard to + estimate up front. + +Don't report usage on every routine extraction — it's noise. + +## Errors + +| Status | Cause | +|---|---| +| `400` | `start` or `end` not ISO-8601, or `end <= start`. | +| `401` | Bad API key. | diff --git a/plugins/kreuzberg-cloud/skills/offloading-extraction/SKILL.md b/plugins/kreuzberg-cloud/skills/offloading-extraction/SKILL.md new file mode 100644 index 0000000..9fb3027 --- /dev/null +++ b/plugins/kreuzberg-cloud/skills/offloading-extraction/SKILL.md @@ -0,0 +1,200 @@ +--- +name: offloading-extraction +description: Use when the user wants to extract a document via the cloud rather than the local kreuzberg CLI. Covers POST /v1/extract — JSON vs multipart bodies, URL crawls, options block, webhook attachment, and the async response shape. +--- + +# Offloading extraction + +`POST /v1/extract` is the single submit endpoint. It returns `202 Accepted` +with `job_ids` (extraction) and `crawl_job_ids` (URL crawls) — never the +extraction result inline. Pair every submit with either a poll loop +(`tracking-cloud-jobs` skill) or a webhook. + +## When to reach for this + +- File is on a remote URL. +- File is on disk but the local `kreuzberg` CLI is not installed. +- You want server-side parallelism for a batch. +- The user wants webhook-delivered results to skip blocking. +- File is larger than ~50 MB → use `presigned-uploads` instead — the + base64 JSON body is too big. + +## Endpoint + +```text +POST https://api.kreuzberg.dev/v1/extract +Authorization: Bearer $KREUZBERG_API_KEY +Content-Type: application/json | multipart/form-data +``` + +Returns `202 Accepted` with `ExtractResponse`. + +## Three submission shapes + +### 1. Base64 JSON (small files, <5 MB recommended) + +```bash +curl -X POST https://api.kreuzberg.dev/v1/extract \ + -H "Authorization: Bearer $KREUZBERG_API_KEY" \ + -H "Content-Type: application/json" \ + -d @- < 50 MB. +- Batch with aggregate body size > 100 MB. +- Bandwidth-constrained environments where double-encoding (base64 + TLS + - worker) wastes throughput. +- File already lives in S3 / GCS and you can stream rather than buffer. + +## The three steps + +```text +1. POST /v1/uploads/presign → batch_id + per-file presigned PUT URLs +2. PUT → upload each file's bytes directly +3. POST /v1/uploads/confirm → start extraction, returns job_ids +``` + +Step 1 returns one `upload_url` per document. Step 3 cannot run until +every PUT in step 2 succeeds. + +## Step 1 — presign + +```bash +curl -X POST https://api.kreuzberg.dev/v1/uploads/presign \ + -H "Authorization: Bearer $KREUZBERG_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{ + "documents": [ + {"filename": "scan.pdf", "mime_type": "application/pdf"}, + {"filename": "report.docx", "mime_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document"} + ], + "config": {"output_format": "markdown"}, + "webhook": {"url": "https://hooks.example.com/x"} + }' +``` + +### Response + +```json +{ + "batch_id": "batch_550e8400-e29b-41d4-a716", + "uploads": [ + { + "job_id": "550e8400-...", + "upload_url": "https://storage.googleapis.com/kreuzberg-dev-uploads/...", + "object_key": "projects/abc123/uploads/550e8400-...", + "method": "PUT", + "expires_in_secs": 3600 + }, + { + "job_id": "660e9400-...", + "upload_url": "https://storage.googleapis.com/kreuzberg-dev-uploads/...", + "object_key": "projects/abc123/uploads/660e9400-...", + "method": "PUT", + "expires_in_secs": 3600 + } + ] +} +``` + +Keep the `batch_id` — you need it for step 3. URLs expire in 3600 seconds +(1 hour); upload before then. + +## Step 2 — PUT to each upload URL + +The presigned URL is signed by Google Cloud Storage; PUT directly to it, +**without** an `Authorization` header. Set `Content-Type` to match the +`mime_type` declared in step 1: + +```bash +curl -X PUT "" \ + -H "Content-Type: application/pdf" \ + --data-binary @scan.pdf +``` + +A successful upload returns `200 OK` with no body. Do this for every +entry in `uploads` before moving on. + +## Step 3 — confirm + +```bash +curl -X POST https://api.kreuzberg.dev/v1/uploads/confirm \ + -H "Authorization: Bearer $KREUZBERG_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{"batch_id": "batch_550e8400-e29b-41d4-a716"}' +``` + +### Response (202) + +```json +{ + "job_ids": ["550e8400-...", "660e9400-..."], + "status": "processing" +} +``` + +These are the same `job_id` values returned in step 1's `uploads` array. +From here, the flow is identical to `offloading-extraction` — poll +`GET /v1/jobs/{id}` or wait for the webhook. + +## End-to-end curl example + +```bash +#!/usr/bin/env bash +set -euo pipefail +API="https://api.kreuzberg.dev" +KEY="$KREUZBERG_API_KEY" +FILE="scan.pdf" + +# 1. Presign +resp=$(curl -fsS -X POST "$API/v1/uploads/presign" \ + -H "Authorization: Bearer $KEY" \ + -H "Content-Type: application/json" \ + -d '{"documents":[{"filename":"'"$FILE"'","mime_type":"application/pdf"}]}') + +batch_id=$(echo "$resp" | jq -r .batch_id) +upload_url=$(echo "$resp" | jq -r '.uploads[0].upload_url') + +# 2. PUT +curl -fsS -X PUT "$upload_url" \ + -H "Content-Type: application/pdf" \ + --data-binary "@$FILE" + +# 3. Confirm +curl -fsS -X POST "$API/v1/uploads/confirm" \ + -H "Authorization: Bearer $KEY" \ + -H "Content-Type: application/json" \ + -d '{"batch_id":"'"$batch_id"'"}' | jq . +``` + +## Errors + +| Status | Where | Cause | +|---|---|---| +| `400` | presign | Empty `documents`, bad MIME, missing `filename`. | +| `403` | PUT | URL expired (>1h since presign) or `Content-Type` mismatch. | +| `400` | confirm | One or more uploads missing in storage. | +| `401` | presign/confirm | Bad Bearer token. | + +If `confirm` returns `400` complaining about a missing upload, retry the +PUT for that specific `object_key` — confirmation requires every file to +be present in storage first. + +## When not to use this + +For files under ~5 MB, the JSON `data` field is simpler and lower-latency +(one round trip instead of three). See the `offloading-extraction` skill. diff --git a/plugins/kreuzberg-cloud/skills/sandbox-keys/SKILL.md b/plugins/kreuzberg-cloud/skills/sandbox-keys/SKILL.md new file mode 100644 index 0000000..edb4510 --- /dev/null +++ b/plugins/kreuzberg-cloud/skills/sandbox-keys/SKILL.md @@ -0,0 +1,116 @@ +--- +name: sandbox-keys +description: Use when the user wants to try Kreuzberg Cloud without signing up, or needs an ephemeral key for evaluation, demos, or CI integration tests. Covers POST /v1/sandbox/key — the no-auth endpoint, quota, TTL, and cleanup expectations. +--- + +# Sandbox keys + +`POST /v1/sandbox/key` issues ephemeral, anonymous API keys. Use these +for evaluation, demos, and integration smoke tests — never for production +workloads. + +## What you get + +| Property | Value | +|---|---| +| Format | `sk_sandbox_*` | +| TTL | 24 hours from issue | +| Quota | 50 pages, hard cap | +| Auth required to mint | None | +| IP throttle | 10 keys per IP per 24 hours | + +The 50-page quota is per key, not per IP. The IP throttle prevents abuse +of the no-auth mint endpoint. + +## Endpoint + +```text +POST https://api.kreuzberg.dev/v1/sandbox/key +``` + +No `Authorization` header — this is the only authenticated-by-omission +endpoint in the API. + +## Response (200) + +```json +{ + "api_key": "sk_sandbox_ABC123DEF456GHI789JKL012MNO345PQR678STU901VWX234", + "expires_at": "2025-12-21T10:00:00Z", + "pages_remaining": 50 +} +``` + +After 24 hours or 50 pages — whichever comes first — the key returns +`401` on every endpoint. The key is not renewable; mint a fresh one. + +## Examples + +### Mint and use + +```bash +KREUZBERG_API_KEY=$(curl -sX POST https://api.kreuzberg.dev/v1/sandbox/key \ + | jq -r .api_key) + +curl -sX POST https://api.kreuzberg.dev/v1/extract \ + -H "Authorization: Bearer $KREUZBERG_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{ + "documents":[{"filename":"hi.txt","mime_type":"text/plain","data":"aGVsbG8="}], + "options":{"extraction_config":{"output_format":"markdown"}} + }' +``` + +### TypeScript SDK + +```ts +import { KreuzbergCloud } from "@kreuzberg/cloud"; + +const client = await KreuzbergCloud.fromSandbox(); +const result = await client.extractAndWait({ + file: new Blob(["Hello world"], { type: "text/plain" }), +}); +console.log(result.result?.content); +``` + +`fromSandbox()` mints a key under the hood and configures the client. + +### Python SDK + +```python +import asyncio +from kreuzberg_cloud import AsyncKreuzbergCloud + +async def main() -> None: + async with await AsyncKreuzbergCloud.from_sandbox() as client: + job = await client.extract_and_wait(file=b"hello world") + print(job.status, job.result and job.result.content) + +asyncio.run(main()) +``` + +## When to recommend sandbox vs production keys + +| Use sandbox | Use production | +|---|---| +| First-time evaluation, no signup yet | Anything user-facing or business-critical | +| Local smoke tests, demos | CI on the main branch | +| One-off doc to test the API surface | Recurring batch pipelines | +| Onboarding flow that bootstraps an SDK | Any workload >50 pages | + +If the user already has a production key, do not silently switch to a +sandbox key — production keys carry the right quota, billing, and +project-scoped resources. + +## Cleanup + +Sandbox keys self-expire after 24 hours. No revocation endpoint exists — +nothing to clean up. Do not commit sandbox keys to version control even +though they're short-lived; treat them like any other credential. + +## Errors + +| Status | Cause | +|---|---| +| `429` | IP has minted 10 keys in the last 24 hours. Wait or use a production key. | +| `500` | Server-side mint failure; retry with backoff. | diff --git a/plugins/kreuzberg-cloud/skills/tracking-cloud-jobs/SKILL.md b/plugins/kreuzberg-cloud/skills/tracking-cloud-jobs/SKILL.md new file mode 100644 index 0000000..7f3cb98 --- /dev/null +++ b/plugins/kreuzberg-cloud/skills/tracking-cloud-jobs/SKILL.md @@ -0,0 +1,185 @@ +--- +name: tracking-cloud-jobs +description: Use when an extraction job has been submitted and the result needs to be retrieved. Covers GET /v1/jobs/{id}, polling cadence with exponential backoff, terminal status detection, and webhook delivery (signature verification, retry semantics). +--- + +# Tracking cloud jobs + +Every `POST /v1/extract` returns a job ID. The actual result arrives one +of two ways: + +1. **Polling** — `GET /v1/jobs/{id}` until status is terminal. +2. **Webhook** — a callback you registered at submit time fires when the + job is done. + +Pick polling when latency tolerance is short and you control the caller. +Pick webhooks when you can't block, or when the job runs minutes long. + +## Endpoint + +```text +GET https://api.kreuzberg.dev/v1/jobs/{id} +Authorization: Bearer $KREUZBERG_API_KEY +``` + +Accepts both extraction job IDs (from `job_ids`) and crawl job IDs (from +`crawl_job_ids`). The response schema is `JobLookupResponse`, a union of +`JobResponse` (extraction) and `CrawlJobResponse` (crawl). + +## Response (200) + +```json +{ + "id": "550e8400-e29b-41d4-a716-446655440000", + "filename": "invoice.pdf", + "status": "completed", + "created_at": "2025-12-21T10:00:00Z", + "processing_time_ms": 1234, + "result": { + "content": "Invoice total: $1,234.56", + "mime_type": "text/markdown", + "tables": [], + "images": [], + "metadata": { "title": "Invoice #12345" } + } +} +``` + +`result` is `null` until the job reaches a terminal state. + +## Status lifecycle + +```text +awaiting_upload → pending → processing → chunking → aggregating → completed + → partial_success + → failed + (any time) → cancelled +``` + +**Terminal statuses** — stop polling when status is one of: + +- `completed` — `result` is populated. +- `partial_success` — `result` is populated; check + `result.metadata.warnings` for the partial cause. +- `failed` — `result` is `null`; an error was logged server-side. +- `cancelled` — `result` is `null`; the job was cancelled before + completion. + +## Polling cadence + +Use exponential backoff capped at 30 seconds. Most extractions finish in +under 5 seconds; large PDFs with OCR may take minutes. + +```bash +#!/usr/bin/env bash +set -euo pipefail +JOB_ID="$1" +delay=1 +while true; do + body=$(curl -fsS \ + -H "Authorization: Bearer $KREUZBERG_API_KEY" \ + "https://api.kreuzberg.dev/v1/jobs/$JOB_ID") + status=$(echo "$body" | jq -r .status) + case "$status" in + completed|partial_success|failed|cancelled) + echo "$body" | jq .; exit 0;; + esac + sleep "$delay" + delay=$(( delay * 2 > 30 ? 30 : delay * 2 )) +done +``` + +### TypeScript SDK + +The SDK does the backoff for you: + +```ts +import { KreuzbergCloud } from "@kreuzberg/cloud"; +const client = new KreuzbergCloud({ apiKey: process.env.KREUZBERG_API_KEY! }); + +const result = await client.waitForJob(jobId, { + timeoutMs: 5 * 60_000, + pollIntervalMs: 1000, // starting interval; backs off internally +}); +console.log(result.status, result.result?.content); +``` + +### Python SDK + +```python +from kreuzberg_cloud import KreuzbergCloud + +with KreuzbergCloud(api_key=...) as client: + job = client.wait_for_job(job_id, timeout=300) + print(job.status, job.result and job.result.content) +``` + +## Webhooks + +Register a webhook at submit time by including a `webhook` block in the +`POST /v1/extract` body: + +```json +{ + "webhook": { + "url": "https://hooks.example.com/kreuzberg", + "secret": "32-byte-shared-secret", + "metadata": { "request_id": "abc123" } + } +} +``` + +When the job reaches a terminal status, the server POSTs the full +`JobResponse` (or `CrawlJobResponse`) to `url`. The `metadata` you +supplied is echoed back inside the payload. + +### Signature verification + +The server signs each webhook delivery with an HMAC computed over the raw +JSON body using `secret`. The signature header name and exact algorithm +(SHA-256, hex-encoded) are documented at ; +treat them as the source of truth — do not hard-code header names from +this skill. + +Verification pattern (Python, illustrative): + +```python +import hmac, hashlib +def verify(body: bytes, signature_hex: str, secret: str) -> bool: + expected = hmac.new(secret.encode(), body, hashlib.sha256).hexdigest() + return hmac.compare_digest(expected, signature_hex) +``` + +Reject any delivery whose signature does not match. Always pass `secret` +to `POST /v1/extract` for production — unsigned webhooks can be forged. + +### Retry semantics + +Webhook deliveries retry on non-2xx responses with exponential backoff +over several hours. Keep your handler idempotent — the same `job_id` may +be delivered more than once on transient failures. + +### When to prefer webhooks vs polling + +| Prefer webhooks | Prefer polling | +|---|---| +| You can run an HTTP server | CLI / one-shot scripts | +| Jobs run minutes long | Jobs finish in seconds | +| Batch of many jobs | A single foreground job | +| Caller can't block | Caller is already blocking | +| You want exactly one delivery per terminal state | You want strict consistency in your own loop | + +## Crawl jobs + +`GET /v1/jobs/{crawl_job_id}` returns `CrawlJobResponse` (different shape +from `JobResponse`). The crawl job lists each per-document `job_id` that +was spawned; iterate through those to fetch individual extraction results. + +## Errors + +| Status | Meaning | Action | +|---|---|---| +| `400` | Malformed UUID | Verify the ID came from `job_ids` / `crawl_job_ids`. | +| `401` | Bad API key | Check `Authorization` header. | +| `404` | Job not found | Wrong project key, or job purged. | +| `503` | DB unavailable | Retry with backoff. | diff --git a/plugins/kreuzberg/.factory-plugin/plugin.json b/plugins/kreuzberg/.factory-plugin/plugin.json new file mode 100644 index 0000000..1ca34e0 --- /dev/null +++ b/plugins/kreuzberg/.factory-plugin/plugin.json @@ -0,0 +1,24 @@ +{ + "name": "kreuzberg", + "version": "0.1.0", + "description": "Local document extraction: text, tables, metadata, images from 91+ formats with optional OCR.", + "author": { + "name": "Kreuzberg, Inc.", + "email": "support@kreuzberg.dev", + "url": "https://kreuzberg.dev" + }, + "homepage": "https://kreuzberg.dev", + "repository": "https://github.com/kreuzberg-dev/plugins", + "license": "MIT", + "category": "document-intelligence", + "keywords": [ + "document-intelligence", + "extraction", + "ocr", + "pdf", + "tables" + ], + "brandColor": "#1F6FEB", + "icon": "./assets/icon.svg", + "logo": "./assets/logo.png" +} diff --git a/plugins/kreuzberg/README.md b/plugins/kreuzberg/README.md new file mode 100644 index 0000000..0c89320 --- /dev/null +++ b/plugins/kreuzberg/README.md @@ -0,0 +1,110 @@ +# kreuzberg + +Extract text, tables, metadata, and images from 91+ document formats — PDF, Office, images with OCR, HTML, email, archives, academic — using the local `kreuzberg` CLI in your agent. + + + +## Install + +### From the marketplace (recommended) + +Pending review for official Claude marketplace. + +Self-host: + +```text +/plugin marketplace add kreuzberg-dev/plugins +/plugin install kreuzberg@kreuzberg +``` + +### Binary requirement + +Install the `kreuzberg` CLI: + +```bash +brew install kreuzberg-dev/tap/kreuzberg +# or +cargo install kreuzberg-cli +``` + +OCR ships with Tesseract by default. Install language packs for non-English documents: + +```bash +brew install tesseract-lang # macOS +sudo apt install tesseract-ocr-* # Debian/Ubuntu +``` + +## Skills shipped + +| Skill | Trigger | +|-------|---------| +| **kreuzberg** | Extract text, tables, metadata, and images from 91+ document formats (PDF, Office, images, HTML, email, archives, academic) using Kreuzberg. Use when writing code that calls Kreuzberg APIs in Python, Node.js/TypeScript, Rust, or CLI. Covers installation, extraction (sync/async), configuration (OCR, chunking, output format), batch processing, error handling, and plugins. | +| **extracting-with-ocr** | Use when extracting text from scanned PDFs, photographed pages, or images that have no embedded text layer. Covers OCR backends, language packs, force-OCR, and performance tuning. | +| **extracting-tables** | Use when extracting tabular data from PDFs, spreadsheets, or images. Covers layout-aware table detection, table model selection, output formats (markdown / JSON cells), and known limits. | +| **picking-a-format** | Use when choosing an output format for extracted documents — text, markdown, djot, html, or JSON. Maps consumer (LLM, parser, archive) to the right `--format` / `--content-format` pair. | + +**Reference materials** (linked from the `kreuzberg` skill): + +| Reference | Content | +|-----------|---------| +| **CLI Reference** | All commands, flags, config precedence, exit codes | +| **Configuration Reference** | TOML/YAML/JSON formats, auto-discovery, env vars, full schema | +| **Supported Formats** | All 91+ formats with file extensions and MIME types | +| **Python API Reference** | All functions, config classes, plugin protocols, exact signatures | +| **Node.js API Reference** | All functions, TypeScript interfaces, worker pool APIs | +| **Rust API Reference** | All functions with feature gates, structs, Cargo.toml examples | +| **Advanced Features** | Plugins, embeddings, MCP server, API server, security limits | +| **Other Language Bindings** | Go, Ruby, Java, C#, PHP, Elixir, WASM, Docker | + +## MCP tools + +The `kreuzberg` MCP server exposes: + +- `extract` — single file extraction with config. +- `extract_batch` — batch extraction from multiple files. +- `detect_mime` — MIME type detection from bytes or path. +- `cache_clear` — clear the extraction cache. + +## Configuration + +Kreuzberg auto-discovers `kreuzberg.toml` from the current directory upward. Set config via: + +1. **Environment variable**: `KREUZBERG_CONFIG_JSON='{"output_format":"markdown"}'` +2. **Config file** (TOML): `kreuzberg.toml` in cwd or a parent directory. +3. **CLI flag**: `kreuzberg extract doc.pdf --content-format markdown` + +See `skills/kreuzberg/references/configuration.md` for the full schema and precedence rules. + +## Examples + +Extract a PDF to plain text and print it: + +```text +kreuzberg extract document.pdf +``` + +Extract with markdown formatting for LLM context: + +```text +kreuzberg extract report.pdf --content-format markdown +``` + +Extract tables from a spreadsheet as JSON: + +```text +kreuzberg extract data.xlsx --format json +``` + +## Versioning + +The plugin version tracks the marketplace `VERSION` file. See [CHANGELOG.md](../../CHANGELOG.md) for release notes. + +## License + +MIT. The skill content uses Elastic-2.0 references to the upstream [kreuzberg](https://github.com/kreuzberg-dev/kreuzberg) repository. + +## See also + +- **Marketplace**: [kreuzberg-dev/plugins](https://github.com/kreuzberg-dev/plugins) +- **Upstream**: [kreuzberg-dev/kreuzberg](https://github.com/kreuzberg-dev/kreuzberg) +- **Sibling plugins**: [kreuzcrawl](../kreuzcrawl/README.md), [kreuzberg-cloud](../kreuzberg-cloud/README.md) diff --git a/plugins/kreuzberg/assets/icon.svg b/plugins/kreuzberg/assets/icon.svg new file mode 100644 index 0000000..c096294 --- /dev/null +++ b/plugins/kreuzberg/assets/icon.svg @@ -0,0 +1,8 @@ + + + + + + + + diff --git a/plugins/kreuzberg/assets/logo.png b/plugins/kreuzberg/assets/logo.png new file mode 100644 index 0000000000000000000000000000000000000000..67f8959612388c5365d8c21c6e34d76f69da6db3 GIT binary patch literal 9547 zcmc&)c{tSH_rEiC$&!5ROqPg3_R?f3VUQ(TWM7i4kS)Scp(tzFB5RT*TlQs&i0u1L z*^M=ZBK+?3`F?-D@AG_r|9>9OGc(Tn+;i?d_q@)z_nh|=ZB3PdbmfTLJdrRxA7 z;ZG!>ri5Qx-hI39i_%J6MF|{|{!(kRV*o$_tkNYt?^i41_g|UVCa-O*-FsKhiIpVQ<4-S5bk0qc^#BQE*XnA^#oW!Vbf+FDYj4I&`yBI`_nTOg)x%(sKz9md?%V zah?!u62Frp+;Ut_m@d0z#6Gawb%4@CvTWo=|BVa|tZnqW?FqqCIiJQGK_qLHVAl45 zS9YeEb-(4a^8y}h!`odBN#XX_8KsFrJ?(cV&ZmuzS3GX0;6dFvDL!)jmfm@O>J_SR z8qLyy6ao7;ndb*f5I7-Y!CI?uBRl5`F}k+3{9Uv0DjT);gpc1J0r}uyM`NK!f^)H` z($|HOB?5jou0J7E0zkzUxkY8Yb@_Jtdt{7%-TlSJTZ30-rI1soqnn9p4MiDt6qfSM zt!+{6#e@U^8j8Ew_Af~L@C^|MmgcD55G#3vUo(Pej=xslMfbe^Fkm9Qs2kmC+DQ(+ zyyL+-=CwUd@0rTxyqmM?hK_Wcpu~*J)t@HXT?ZB*hI*h_VxW;y;F71sZ03{w9n|7y0+%3hWSIN;rjH zs_Hbo;gK0XppiW3pQ7RY-p0WM3df_|IdXI1;QdA-N+qToA#uO^WpR4pvhVha)&?0s znT7K#V}~6_R=IXw5_IT=#K|m{K5g%O9Z?b3uOt`&z!pdhbocKqu97oha4WI#?99%s=e6=_>k*#fB_IIeiDpzo-C3y*S*+I0q|dp99MQ^*WOlN z64huTN?cD_zTHg1l)B(Sxy)`jNyC@4Y<2amO1V1o@@68cgEWD%C&fM{JuRfgq2|N= z6(eanCI)Vl^slhdUrVUJcgsRjOYNQ4IVu36Bu8wdKAJn<$unBJL$31Xw8NeY<4AU8 z9WR*5(PXB)m&EC4=DsndS%8c$OC^?~`oxERk%0>awhvv1eUDd4OfPuPAi5;#iFw*{ z_ShFXin0LQzZIM3S7Z4xDJr1PFEa#i%$29uKD)1R9UB{Q1YlAK44oE!OM5E$%MRF#4DSm7SUGqJOWA!b z9{Et|pZbalKcT*#5+^M?7(oFrh#~uiDRDQyCl;E28zI`CKR2iH0T%$YVfM0%>(lLC z&mCL(`Dd0za9G1T;^k>P*7FPiuRTn8(RYf2M15K z2$DSs=zk_Fw72xisq|aLc0jbdr&!rhfD;ezuhjMO{B5mSmb`LQw(YLsAjjVAKaWE4 zj+>R*5C074oKr5Tib*L;UGYoXy18*G005g_9`?=dYwC{XUk4~lJgHMx69;tCw&F>) zjdaeiU3p5HUnhL_7nuEgD<@O2b6^_)JmS{zIF`x8{AaeajQGA^EAuTwD#b--0C2HM zb6!HZG=&8&FbnC`sPpJ;Y{pzg)Kq=i+S;*PgxZGNRlYQd^h>_SSfxd^5{7}3b z7j?%*iL`(Fo6533>MiL2ZvMbwDX6TTyM;2?kJ_Tguzc_@hU4$!^yH!ua0Z2?tRZB* za?+YkoE^y_wMgv?+GE0=2LM!DS6~zf_S44~8h|&Zqpr8u9*KtRrNT6%9)8ZRw`WA= zO6_QuA-Gd|)iUze$H+D4eZ`$4%Hy}gf_=(QP+|59${Yox?!Db11K`}*T1&gV*(70=fY+#fkv@XzkF5ykJR)KC3|G8*(Ba#M|srlJw6cFDb?E?rN_}EU@5m zj*OmcUDAqf4ZiRfM97B;iHvDdyz`TgB34!6`hqn@8m z*GQTQWSF;PoJ=LNc2wo$muN9J&Mh7Lqv=kQpwr@^{K(Pg9H438-Es%C44-909fP73 z&@k%Tw#4$b?qZ;YX?at8oC#DBTkZUmm`_Vj$J+jIx(`Ryd$3Kjmg~j7VICFKpK}rh zAkxuvabhcTWk+M_+)vIBQ-m;4PJ`kiMDl(?vU!Wc82@Z(*B;!&yu0J~4bOr*J9~I-`8Sj-&9c}y zq_uqbn2ZcBFsFQ;9)Qr`4|Xeuom2QV?w{8H&=y`s-2E|mj>@7nZuBGtjdPek3aX)q zDPvAr;~L~!DMwl$6~2zELdy&~>8x+~9qwxa@=Gr7l3W462!#91E8)G6c;me%4U-%I z91r%|Au30c4izt{)vERvoF#+7NKr zQOiw%0Z#5R;}COJEy5!2T=$ARX)|P{*C=r;qVP!O{as*jJQuuC;39`G{CqoDAYmiy zrifbalX8>JKU>sWyQp(X^2Mt;pJQ+d?7Ai?uQ=kBg5g-asO6Q^aAy}ZJmQ5$9(_Xr zPfrE&@IiNx>?^JtoPCu%B%#f0x|~D!fr>x>v<#3pJtd75YU(48b};RPh~=}<=TzmG z^?`O|%;^ryJT|Bc08Ukq42%8y*9%5C;%5kH*cHLR%twS%&`6ob0twIvm@R5Zzn}y^ zrcRKKoD59x!#fmy;IBaSfQJu$tTRGAxR8H_Rm2Sq(sgvKq!|$6g3MBe>%7j7OzYb>FZsA@CB6 zfI`H`#=ZTOr0GVEq<(DSw8}R=B<|Y@K&$O>HMLrdJ3oTQeA(wX>X>5^ zDM1G}G$nvXZrU*)^RaSeQ3&v!|9sRwS@PMAG~R*6slruS|?V93qR%weuc`i^B20|-O$^51cOq+56>$2NCXXUD5 zc~3~_<>&%reX7#{hic`jfzQZUn&JR00o_;hu9a0dwdKH7ewI73c|})VKiX07GHa zI4#mQbQE*wST2lw;35iok3=Xqeja+za~w711V1DQLxKp25C{p-i!2x+fddjKpFqM1 zNT7xUT1W^Eg#;+Wf-Mv@!YP0TC2~{;RKkYf1Fl!0D^fEdLir(QF%&)qwMGgm0eIpE zxYK+{MpigG8+Im~JQRVV3I#I?kkjoEHRd9m1}X#+L`j5jNZ^A6q!RdX6e?jlfs7=r zB$SZ}_uvwQLL=lPgPdF_(}{5-Ki3UDis#grtIj5fMsyem{sB6AJ}= z;a39bkO7l&0?9}L1^j>#O-Q=6m=IqCNdz`XH~|TrkU-+Fg6k-Q#${wy0>w}Vjtx%c z3|vPe9TUz8Drq!u{X`^RWkj%@LIRA_Gcn?x;4<{Nb|3Q|B3tOG`T$?hI13eA92}?oH3>O<( zEB;!iAmYdK9mYKJsoNwb!`MiFZ~Lqz#|sCy@%&TFWUd0_02d@t_GzI`RK&wXM%Q_lHPdK1da5Z45~n2O{FvLk@YOEXcfUD^KI8>hj)3XFnhF)5oPrf*IB!6p0ecxyML+nel3&wVB8dW?rUpNNJBIQ;C z3JlF_0^}3|W|CL^b8}Zm^Lf#e1Q?zjslm)+eZuwKZR>H`;T?Xh#|Bluy6@!s3(mr) zqSQdXJVUB4J!HFZQ)S{ve(Cy0v`QJshT`I8BbyqEOzQyU&u?P!w>S8VT-C_w=^9 z@7TWJqQ^|I>pQrw&Av$VGERWWFz*D2H0+2^=o7!omaDEMs>{@=DgPJS;%{BG(z&Xl zSHy@Q-5DP+WznC3^dtPHH9c3Yk!y~%Z_hpDJe;7TZufT8pWt)hc;VAy88rHVOOD49 z(de9WQZcVqB7&|0)ZJ@&wC*i=*G(nEzLFzQqpCOmu_KCh6aj(bABL{+&ST%WQGiexgL<+Gy?ebnU8eC_*~Cv~oxl_w2_ zOd_bbZ~$Xw^g%4^D#MD7m*UzMQ$X5_lm|KHopmpJdcsr);R)D5e#DxEGwuE9gLxm~ z=RL2IO_h&}tL0Tob>q9;o3#Zhh>1%8C7x#-*Hg|*F!)l(OCeI)(;ngJNEAIam;s_8 zXac#D)BZx2h9oP}n%&NN%&^*>Wcq3hxdRd7HL8m~z1Q%6u)qL({z~;_ba7cHkqwh+ z@ViV);k*$A?j}Twab>}4_5G}%_0vjbKWFx0st!t6 zXm3zs0%^my+BZd{L#z+KFvAUP%F2GRxcXJ(92YJ~>RsQ$#8*+7wNgoF)-}|B!=cf{ zB6CygRCgkw$!$RKhbGC59_mNXZsl)^y3a04!a$WJ^Zre#t)23j$z~@JnKYGttO?^B5WCma|NxC@2<&JsJA|3d?!LK)yIM^lz8hbQ^}H%UyPNlrb< zvJ~~fub>&z_-{A>6FwNGTRrwomgdt)IRbEz;&#dVEYk>}LV^PjgGs|d_hu83+Qmj! zLD3fv4~3c^RxI}?4IvyLIDpG12nXO#8c50zbr24((No9j16|(Ly;NkJ&fVqk-^Qs! zxFn*J73rsQd+pXssQzgFJ3%ZYr{e%Hna*O>Bt_3}Dr%+?fuVk@)>(-+K7VESBkE4~ zn45mAi&}-a@(YeL7jnmlY&a6| zE{KI~F=<}^HhuF!U=0QaX%ipggw?vd=mrGEZM~fC+Bidp342=@gbG;^xEi;nwoL!~ z4L3|7*HdrLCbR#{?FsYwP=flsAZDgj0jo__IjV96Ved)|E{X*?`qd)Q9~VSAfu_e))d+TO2z+HIp)L zb@yrn;cO%oyh3)>3H8->ZdLJ+Vd%z)VP?*wEi3>9Xq#4YeYbcy<#VX;QTgNX11q`S zI}zpZ7Hfs#_uEjZ{!sZaf|A_jt+T=vzl{U*10P|SRgOoRjvK^{S0~M)0o71XRQLSa z>hl;JT7q#Mz9o~lA$s6^GWDHS1PdxHWB54TVD0#Quid0 zP+-30BG0ID6?mt^lnV<1yoz-9!}D3f`EWvr{Jq=}CSjHTez-bMzeUt?V+NY zQ{qR>d*3IwlfeR#MZkx(`t2h)uX;F;Hlk>*3s7J8`Pfb>b|Gb>xPG?R1=KJDQgJ_tT4oj2^aZpc6bu*E$M zmAoYGM1Aiezg`pL|rRKQ)UK8A}e+BP6z^zr)F)8l3@1p7?tkkMT zzqkR*yr|&3<#Cn0106CHyXV1;>*@x7FWI50>X3L_PTm1H-M)}EX>0&370O~AC#NW} z@|JhWs}DqGN<=vw4?t1FI`RU$Y8=mI``%eH3sMPOa9>`O|IZrd>YTmwJh=$Bn&|*m zO?U&1-EXcr0W(L=U6gg5Vmv3k2#^)aoj5)0t0Kz`s}e=kWikU!+-*mxSU|LZ^#QiW zQt=uzjQk+CaE)!D9FIG2f<1#PJ*EvDq)mv8s;!0H3^^*_y^rsKYH`l@HB|@RA9`?Q zamEjO(%0x`>95e@+P(BwqT9-Y|L71DHLc%xxm1ugW~YikVUx-R$DC~jjuQRhAR)7n zn%eDyvkEE558v(x1HNyE*c3~L|d8*ike7LpYvIgt3!JRiut6_6+ zv~W$a2k}Piwy#=`5&t-JKekdK*)zF&ZK|>o%PcrAImV6{=>-PpUVy*nR?qfR*cxd> zbd@^&PHIZbRj%TO#?%;~tjhb<;d+LDyepIp6P*}0c+kf-9`no@ZUTRimy_A`9dFsP z*cWiqCeqss1@iyZw=wG=-*31`IJDruNaGR^Ghdz(DRC=5^X%}L7yv1RHuCM*eDx;h zFRu|95P1qr73`NwPQ z@2_MGPuv-R^C5K#KdLHz{al6s>mh!b_nlU{UUVJ;K)ywK7vNgukeenFBdT!=wXnAq z?RQpa6xOuY_Y2<3*C@#)O0aTvE@5$03Z$m-V(Nli^hDCvW&}vZOHd?;rUfMn_5?{C@vXq5^Y0 zKrey@U^KRd@vlZpoyIo*Gz?MLYtLkI!@G^F1Ytn+^9hoW_u7med9Cs*^_(Kjy2}!4 zX@?}K!|A^96FTk_F~?6@cg@Zp80sh%#=wo){pBpgnUO8nUzG^47UIdCP}%;~;jrQJ zLxm&2eo`=K8`~mV@rn%KLY<}AHy4FFwiTmQI}!3jb4`iF#q_>Y(R}9cjSD=5BXJb- zEl14jAbJ|gT~UnhyHhohXdQmgPHJ7xy!3NV72nS>@&7bwagG6C9sDhf&F|^2OnymR z3sn@P1Hu;!l;!Sn-1Pdk?R%0~!K3WPY1-K%;0w85He15%_ecn} z=dejF+VLteLLuOkmLb?4ESzeuFQw&)l>SK4KzBLmsQd7ophrYC&&=%~ZWI;)?O&VT zOWW^V!X6HiIJg>^13xiu`pt`fwt!EGM~uZKCCk-0=1DWy?sc#|@ZT&R;jQMeDYSyo zL@&T&ekfI7AMNPz*h*dzq!zLzu;iro$&3a_9z1>#jpn0_XOXT>A&F0k^cTlPi`di!L^yYkc z0Opb^dj?gW`}CVE4+Rs3m@^bCV2{hoxp{^!=2umCrCuHkgu&T2o5o(Bt+FLVk2n{E zw3G$W;ru2x85El7w&>fWnts=2PLsMgp^pu$-3`y1_EfAqZ>1k>m{Ge#g}+F8b;P;< zZGB9?e)Pphdev@^!hj2wTk^nN9thde<&&%KiL&H)*4dH0nK z?Ig7wlEl%xA4jlEpD_i?Qj4uCvn@T}@8uP}j%hQ&W);{W;hDg&Sj#9kq(SL$nv}*p z(2++bDDel+>&r4ES%&U~FH=O%k$OKT0mh(KaJWDEmtvNn&+lhzgDbA^RTOgkN7#)e zR*zi#OV#dX!nx>+0tbo+441RCq}HQA9xeLi-7JP?Vn`78=-C=C*RK`sCpWtBOE#QK z%#w+M^sE3=#~!)m%fFec{q4x-&)@C^RC!#^C8YgeM@T6Bd3Of(%8itvo$F2hkcU`$8qvr8yW1t z^OzzTgEvj8GCq5~G@^on0xV(oj)JK(SbHoxwpwu1+SZd}n7*Y{{mr!=+2KN?OIp5GS2dLiE?Wfs4_i+5^8f$< literal 0 HcmV?d00001 diff --git a/plugins/kreuzberg/skills/extracting-tables/SKILL.md b/plugins/kreuzberg/skills/extracting-tables/SKILL.md new file mode 100644 index 0000000..d550090 --- /dev/null +++ b/plugins/kreuzberg/skills/extracting-tables/SKILL.md @@ -0,0 +1,148 @@ +--- +name: extracting-tables +description: Use when extracting tabular data from PDFs, spreadsheets, or images. Covers layout-aware table detection, table model selection, output formats (markdown / JSON cells), and known limits. +--- + +# Extracting tables + +Use this when the user wants structured tabular data — financial +statements, scientific tables, invoices, spreadsheet-style PDFs. Kreuzberg +detects tables via a layout model (RT-DETR v2) and reconstructs cell +structure with a configurable table model. + +## Basic usage + +```bash +# Markdown tables embedded in the content stream +kreuzberg extract report.pdf --layout --content-format markdown + +# Structured JSON output, tables appear under result.tables +kreuzberg extract report.pdf --layout --format json +``` + +`--layout` turns on layout-aware extraction; without it, tables fall back +to plain text reflow and you lose cell boundaries. + +## Output shapes + +Two surfaces, picked via `--format` (CLI shape) and `--content-format` +(content rendering): + +- **Markdown tables in `content`** — `--content-format markdown`. Tables + appear inline as `| col | col |` blocks. Good for LLM ingestion. +- **Structured `tables` array** — `--format json`. Each entry has + `cells[][]` (rows × cols), `markdown` (pre-rendered), `page_index`, + `bbox`. Use this when downstream code needs exact cell access. + +Both are populated at once when `--layout` is on. The `tables` array is +always structured; the `content` stream switches representation. + +```bash +kreuzberg extract financials.pdf --layout --format json \ + | jq '.tables[] | {page: .page_index, rows: (.cells | length)}' +``` + +## Table models + +`--layout-table-model` picks the reconstruction backend: + +| Model | Best for | Notes | +| ------------------ | ----------------------------------------------------- | ------------------------------------------- | +| `tatr` | dense complex tables (academic, financial) | **Default.** Heaviest, highest accuracy. | +| `slanet_auto` | dispatches per-table to wired/wireless | Good when table styles are mixed. | +| `slanet_wired` | tables with visible borders | Faster than tatr. | +| `slanet_wireless` | tables without borders (whitespace-separated) | For invoices, simple grids. | +| `slanet_plus` | hybrid wired / wireless | Lighter than `slanet_auto`. | +| `disabled` | layout detection only, no table structure | Use to skip table model cost. | + +```bash +kreuzberg extract bank-statement.pdf \ + --layout --layout-table-model tatr --content-format markdown +``` + +Drop `--layout-confidence` when the layout model misses tables (default +threshold ~0.5): + +```bash +kreuzberg extract noisy-scan.pdf --layout --layout-confidence 0.3 +``` + +## Spreadsheets + +`.xlsx`, `.ods`, `.csv`, `.tsv` are extracted by dedicated parsers — no +layout model needed. Each sheet becomes a markdown table (or structured +table) automatically: + +```bash +kreuzberg extract workbook.xlsx --content-format markdown +kreuzberg extract data.csv --format json +``` + +Pass `--no-cache=true` only when iterating on the same file with different +configs. + +## Config file alternative + +```toml +# `output_format` in config files equals `--content-format` on the CLI. +output_format = "markdown" + +[layout_detection] +enabled = true +confidence_threshold = 0.5 +table_model = "tatr" +``` + +Then: + +```bash +kreuzberg extract report.pdf --format json +``` + +## Programmatic access + +From Python, structured tables live on `result.tables`: + +```python +from kreuzberg import extract_file_sync, ExtractionConfig, LayoutDetectionConfig + +config = ExtractionConfig( + layout_detection=LayoutDetectionConfig(enabled=True, table_model="tatr"), + output_format="markdown", +) +result = extract_file_sync("report.pdf", config=config) +for table in result.tables: + print(table.markdown) # rendered markdown + print(table.cells[0][0]) # cell access +``` + +Node.js mirrors this (`extractFile`, `result.tables`, camelCase fields). +See `references/python-api.md` and `references/nodejs-api.md` in the +sibling `kreuzberg` skill for full type signatures. + +## Known limitations + +- **Merged cells** — reconstructed as repeated values across the spanned + region; the merge is not preserved as metadata in v0.1. +- **Rotated tables** — enable `--ocr-auto-rotate true` for image-based + PDFs before extraction. +- **Nested tables** — flattened. Detection succeeds; structural nesting is + lost. +- **Multi-page tables** — each page yields a separate `tables[]` entry. + Stitch by matching column headers if needed. +- **ONNX Runtime required** — layout and table models are unavailable in + WASM builds and on the Android x86_64 emulator; native targets ship + full support. + +## Common failure modes + +- **Empty `tables` with `--layout` on** — confidence threshold too high or + table model mismatched. Drop `--layout-confidence` to 0.3, try + `--layout-table-model tatr`. +- **Markdown tables look ragged** — switch `--layout-table-model` to + `slanet_wired` for bordered grids or `slanet_wireless` for invoices. +- **Slow extraction** — `tatr` is heavy. Use `slanet_auto` or + `slanet_plus` as a default; reach for `tatr` only when accuracy matters. + +See `references/cli-reference.md` for the full layout flag set and +`references/advanced-features.md` for the layout pipeline internals. diff --git a/plugins/kreuzberg/skills/extracting-with-ocr/SKILL.md b/plugins/kreuzberg/skills/extracting-with-ocr/SKILL.md new file mode 100644 index 0000000..d8bc1f3 --- /dev/null +++ b/plugins/kreuzberg/skills/extracting-with-ocr/SKILL.md @@ -0,0 +1,123 @@ +--- +name: extracting-with-ocr +description: Use when extracting text from scanned PDFs, photographed pages, or images that have no embedded text layer. Covers OCR backends, language packs, force-OCR, and performance tuning. +--- + +# Extracting with OCR + +Use this when a document is image-based: scanned PDFs, photographed pages, +screenshots, JPEG/PNG/TIFF with text. Kreuzberg auto-OCRs raster images and +auto-detects PDFs that lack a text layer. Force it on when extraction +returned empty/garbled text from a PDF that "looks" textual. + +## When to force OCR + +- Extraction returned an empty `content` field, but the file opens visually. +- The PDF text layer is junk (copy-paste from a viewer produces gibberish). +- You want consistent output across mixed scanned + digital PDFs. + +```bash +kreuzberg extract scan.pdf --force-ocr=true +kreuzberg extract scan.pdf --ocr=true --ocr-language eng +``` + +If a page has an unreliable text layer, `--force-ocr=true` re-rasterizes +and runs OCR on every page. + +## Backends + +Tesseract is the default and ships with the CLI — no extra install. Other +backends are opt-in: + +| Backend | Flag | Install | Notes | +| ------------- | ------------------------------------- | ------------------------------------------------ | -------------------------------------------------------------- | +| Tesseract | `--ocr-backend tesseract` (default) | bundled | Best general-purpose, 100+ languages via tessdata. | +| PaddleOCR | `--ocr-backend paddle-ocr` | bundled (ONNX Runtime) | Strong on Asian scripts. Not available on WASM or Windows. | +| EasyOCR | `--ocr-backend easyocr` | Python binding (`pip install kreuzberg[easyocr]`)| Heavier model. CUDA accel via `easyocr_kwargs={"gpu": True}`. | +| VLM (vision) | layout + a multimodal LLM via config | configured per backend | Use when OCR fails on dense or handwritten layouts. | + +Pick Tesseract first. Switch only when accuracy is unacceptable. + +## Language packs + +Tesseract uses ISO 639-2 codes. Default is `eng`. Combine with `+`: + +```bash +kreuzberg extract menu.jpg --ocr=true --ocr-language "eng+deu" +kreuzberg extract bilingual.pdf --ocr-language "eng+jpn" +kreuzberg extract any.pdf --ocr-language all # all installed packs +``` + +Install missing packs at the OS level: + +```bash +# macOS +brew install tesseract-lang + +# Debian/Ubuntu +sudo apt install tesseract-ocr-deu tesseract-ocr-jpn tesseract-ocr-fra + +# Specific lang only +sudo apt install tesseract-ocr- +``` + +Kreuzberg fails fast with a helpful error if you request a language pack +that is not installed. Read the error — it names the missing file. + +## Useful flags + +- `--ocr=true` — enable OCR (auto-enabled for images and scanned PDFs). +- `--force-ocr=true` — OCR every page even if a text layer exists. +- `--disable-ocr=true` — never OCR (extract embedded text only or fail). +- `--ocr-language ` — single code or `+`-joined list, or `all`. +- `--ocr-backend ` — pick backend. +- `--ocr-auto-rotate=true` — pre-rotate via the auto-rotate model. +- `--acceleration ` — ONNX accelerator for + paddle-ocr / auto-rotate / layout models. + +## Performance tips + +- Cache is on by default. Repeated extraction of the same file + config is + instant. Do not pass `--no-cache=true` unless you have a reason. +- For batch OCR, use `kreuzberg batch *.pdf --ocr=true` — internal worker + pool parallelizes across CPU cores. Cap with `--max-concurrent N` if + memory is tight. +- Raise `--target-dpi` (default 300) only for low-resolution scans. Higher + DPI is slower; 200 is usually enough for printed text. +- Enable `--ocr-auto-rotate=true` only when pages may be rotated; the + classifier adds latency. +- On Apple Silicon, `--acceleration coreml` typically beats CPU for + paddle-ocr and layout detection. + +## Config file alternative + +Long flag chains belong in `kreuzberg.toml` — auto-discovered from cwd +upward. + +```toml +force_ocr = true +output_format = "markdown" + +[ocr] +backend = "tesseract" +language = "eng+deu" +auto_rotate = true +``` + +Then just run: + +```bash +kreuzberg extract document.pdf +``` + +## Common failure modes + +- **"missing tessdata"** — install the language pack at OS level (see above). +- **Empty content on a scanned PDF without `--force-ocr`** — the file has a + bogus zero-width text layer. Re-run with `--force-ocr=true`. +- **OCR on a rotated page** — add `--ocr-auto-rotate=true` or pre-rotate. +- **Garbled CJK output** — ensure the right language pack is installed and + passed via `--ocr-language`; consider `paddle-ocr` for Chinese/Japanese. + +See `references/cli-reference.md` and `references/configuration.md` in the +sibling `kreuzberg` skill for the full flag and config schema. diff --git a/plugins/kreuzberg/skills/kreuzberg/SKILL.md b/plugins/kreuzberg/skills/kreuzberg/SKILL.md new file mode 100644 index 0000000..845e752 --- /dev/null +++ b/plugins/kreuzberg/skills/kreuzberg/SKILL.md @@ -0,0 +1,419 @@ +--- +name: kreuzberg +description: >- + Extract text, tables, metadata, and images from 91+ document formats + (PDF, Office, images, HTML, email, archives, academic) using Kreuzberg. + Use when writing code that calls Kreuzberg APIs in Python, Node.js/TypeScript, + Rust, or CLI. Covers installation, extraction (sync/async), configuration + (OCR, chunking, output format), batch processing, error handling, and plugins. +license: Elastic-2.0 +metadata: + author: kreuzberg-dev + version: "0.1.0" + repository: https://github.com/kreuzberg-dev/kreuzberg +--- + +# Kreuzberg Document Extraction + +Kreuzberg is a high-performance document intelligence library with a Rust core and native bindings for Python, Node.js/TypeScript, Ruby, Go, Java, C#, PHP, and Elixir. It extracts text, tables, metadata, and images from 91+ file formats including PDF, Office documents, images (with OCR), HTML, email, archives, and academic formats. + +Use this skill when writing code that: + +- Extracts text or metadata from documents +- Performs OCR on scanned documents or images +- Batch-processes multiple files +- Configures extraction options (output format, chunking, OCR, language detection) +- Implements custom plugins (post-processors, validators, OCR backends) + +> If the `kreuzberg` MCP server is registered in this session, prefer its tools over shelling out to the CLI — they expose the same extraction surface with structured arguments and results. + +## Installation + +### Python + +```bash +pip install kreuzberg +# Optional OCR backends: +pip install kreuzberg[easyocr] # EasyOCR +``` + +### Node.js + +```bash +npm install @kreuzberg/node +``` + +### Rust + +```toml +# Cargo.toml +[dependencies] +kreuzberg = { version = "4", features = ["tokio-runtime"] } +# features: tokio-runtime (required for sync + batch), pdf, ocr, chunking, +# embeddings, language-detection, keywords-yake, keywords-rake +``` + +### CLI + +```bash +# Download from GitHub releases, or: +cargo install kreuzberg-cli +``` + +## Quick Start + +### Python (Async) + +```python +from kreuzberg import extract_file + +result = await extract_file("document.pdf") +print(result.content) # extracted text +print(result.metadata) # document metadata +print(result.tables) # extracted tables +``` + +### Python (Sync) + +```python +from kreuzberg import extract_file_sync + +result = extract_file_sync("document.pdf") +print(result.content) +``` + +### Node.js + +```typescript +import { extractFile } from "@kreuzberg/node"; + +const result = await extractFile("document.pdf"); +console.log(result.content); +console.log(result.metadata); +console.log(result.tables); +``` + +### Node.js (Sync) + +```typescript +import { extractFileSync } from "@kreuzberg/node"; + +const result = extractFileSync("document.pdf"); +``` + +### Rust (Async) + +```rust +use kreuzberg::{extract_file, ExtractionConfig}; + +#[tokio::main] +async fn main() -> kreuzberg::Result<()> { + let config = ExtractionConfig::default(); + let result = extract_file("document.pdf", None, &config).await?; + println!("{}", result.content); + Ok(()) +} +``` + +### Rust (Sync) — requires `tokio-runtime` feature + +```rust +use kreuzberg::{extract_file_sync, ExtractionConfig}; + +fn main() -> kreuzberg::Result<()> { + let config = ExtractionConfig::default(); + let result = extract_file_sync("document.pdf", None, &config)?; + println!("{}", result.content); + Ok(()) +} +``` + +### CLI + +```bash +kreuzberg extract document.pdf +kreuzberg extract document.pdf --format json +kreuzberg extract document.pdf --content-format markdown +``` + +## Configuration + +All languages use the same configuration structure with language-appropriate naming conventions. + +### Python (snake_case) + +```python +from kreuzberg import ( + ExtractionConfig, OcrConfig, TesseractConfig, + PdfConfig, ChunkingConfig, +) + +config = ExtractionConfig( + ocr=OcrConfig( + backend="tesseract", + language="eng", + tesseract_config=TesseractConfig(psm=6, enable_table_detection=True), + ), + pdf_options=PdfConfig(passwords=["secret123"]), + chunking=ChunkingConfig(max_chars=1000, max_overlap=200), + output_format="markdown", +) + +result = await extract_file("document.pdf", config=config) +``` + +### Node.js (camelCase) + +```typescript +import { extractFile, type ExtractionConfig } from "@kreuzberg/node"; + +const config: ExtractionConfig = { + ocr: { backend: "tesseract", language: "eng" }, + pdfOptions: { passwords: ["secret123"] }, + chunking: { maxChars: 1000, maxOverlap: 200 }, + outputFormat: "markdown", +}; + +const result = await extractFile("document.pdf", null, config); +``` + +### Rust (snake_case) + +```rust +use kreuzberg::{ExtractionConfig, OcrConfig, ChunkingConfig, OutputFormat}; + +let config = ExtractionConfig { + ocr: Some(OcrConfig { + backend: "tesseract".into(), + language: "eng".into(), + ..Default::default() + }), + chunking: Some(ChunkingConfig { + max_characters: 1000, + overlap: 200, + ..Default::default() + }), + output_format: OutputFormat::Markdown, + ..Default::default() +}; + +let result = extract_file("document.pdf", None, &config).await?; +``` + +### Config File (TOML) + +```toml +output_format = "markdown" + +[ocr] +backend = "tesseract" +language = "eng" + +[chunking] +max_chars = 1000 +max_overlap = 200 + +[pdf_options] +passwords = ["secret123"] +``` + +```bash +# CLI: auto-discovers kreuzberg.toml in current/parent directories +kreuzberg extract doc.pdf +# or explicit: +kreuzberg extract doc.pdf --config kreuzberg.toml +kreuzberg extract doc.pdf --config-json '{"ocr":{"backend":"tesseract","language":"deu"}}' +``` + +## Batch Processing + +### Python + +```python +from kreuzberg import batch_extract_files, batch_extract_files_sync + +# Async +results = await batch_extract_files(["doc1.pdf", "doc2.docx", "doc3.xlsx"]) + +# Sync +results = batch_extract_files_sync(["doc1.pdf", "doc2.docx"]) + +for result in results: + print(f"{len(result.content)} chars extracted") +``` + +### Node.js + +```typescript +import { batchExtractFiles } from "@kreuzberg/node"; + +const results = await batchExtractFiles(["doc1.pdf", "doc2.docx"]); +``` + +### Rust — requires `tokio-runtime` feature + +```rust +use kreuzberg::{batch_extract_file, ExtractionConfig}; + +let config = ExtractionConfig::default(); +let paths = vec!["doc1.pdf", "doc2.docx"]; +let results = batch_extract_file(paths, &config).await?; +``` + +### CLI + +```bash +kreuzberg batch *.pdf --format json +kreuzberg batch docs/*.docx --content-format markdown +``` + +## OCR + +OCR runs automatically for images and scanned PDFs. Tesseract is the default backend (native binding, no external install required). + +### Backends + +- **Tesseract** (default): Built-in native binding. All Tesseract languages supported. +- **EasyOCR** (Python only): `pip install kreuzberg[easyocr]`. Pass `easyocr_kwargs={"gpu": True}`. +- **PaddleOCR** (Python only): Bundled since 4.8.5, no extra install needed. Pass `paddleocr_kwargs={"use_angle_cls": True}`. +- **Guten** (Node.js only): Built-in OCR backend via `GutenOcrBackend`. + +### Language Codes + +```python +config = ExtractionConfig(ocr=OcrConfig(language="eng")) # English +config = ExtractionConfig(ocr=OcrConfig(language="eng+deu")) # Multiple +config = ExtractionConfig(ocr=OcrConfig(language="all")) # All installed +``` + +### Force OCR + +```python +config = ExtractionConfig(force_ocr=True) # OCR even if text is extractable +``` + +## ExtractionResult Fields + +| Field | Python | Node.js | Rust | Description | +| ------------ | --------------------------- | -------------------------- | --------------------------- | --------------------------------------------- | +| Text content | `result.content` | `result.content` | `result.content` | Extracted text (str/String) | +| MIME type | `result.mime_type` | `result.mimeType` | `result.mime_type` | Input document MIME type | +| Metadata | `result.metadata` | `result.metadata` | `result.metadata` | Document metadata (dict/object/HashMap) | +| Tables | `result.tables` | `result.tables` | `result.tables` | Extracted tables with cells + markdown | +| Languages | `result.detected_languages` | `result.detectedLanguages` | `result.detected_languages` | Detected languages (if enabled) | +| Chunks | `result.chunks` | `result.chunks` | `result.chunks` | Text chunks (if chunking enabled) | +| Images | `result.images` | `result.images` | `result.images` | Extracted images (if enabled) | +| Elements | `result.elements` | `result.elements` | `result.elements` | Semantic elements (if element_based format) | +| Pages | `result.pages` | `result.pages` | `result.pages` | Per-page content (if page extraction enabled) | +| Keywords | `result.keywords` | `result.keywords` | `result.keywords` | Extracted keywords (if enabled) | + +## Error Handling + +### Python + +```python +from kreuzberg import ( + extract_file_sync, KreuzbergError, ParsingError, + OCRError, ValidationError, MissingDependencyError, +) + +try: + result = extract_file_sync("file.pdf") +except ParsingError as e: + print(f"Failed to parse: {e}") +except OCRError as e: + print(f"OCR failed: {e}") +except ValidationError as e: + print(f"Invalid input: {e}") +except MissingDependencyError as e: + print(f"Missing dependency: {e}") +except KreuzbergError as e: + print(f"Extraction failed: {e}") +``` + +### Node.js + +```typescript +import { + extractFile, + KreuzbergError, + ParsingError, + OcrError, + ValidationError, + MissingDependencyError, +} from "@kreuzberg/node"; + +try { + const result = await extractFile("file.pdf"); +} catch (e) { + if (e instanceof ParsingError) { + /* ... */ + } else if (e instanceof OcrError) { + /* ... */ + } else if (e instanceof ValidationError) { + /* ... */ + } else if (e instanceof KreuzbergError) { + /* ... */ + } +} +``` + +### Rust + +```rust +use kreuzberg::{extract_file, ExtractionConfig, KreuzbergError}; + +let config = ExtractionConfig::default(); +match extract_file("file.pdf", None, &config).await { + Ok(result) => println!("{}", result.content), + Err(KreuzbergError::Parsing(msg)) => eprintln!("Parse error: {msg}"), + Err(KreuzbergError::Ocr(msg)) => eprintln!("OCR error: {msg}"), + Err(e) => eprintln!("Error: {e}"), +} +``` + +## Common Pitfalls + +1. **Python ChunkingConfig fields**: Use `max_chars` and `max_overlap`, NOT `max_characters` or `overlap`. +2. **Rust extract_file signature**: Third argument is `&ExtractionConfig` (a reference), not `Option`. Use `&ExtractionConfig::default()` for defaults. +3. **Rust feature gates**: `extract_file_sync`, `batch_extract_file`, and `batch_extract_file_sync` all require `features = ["tokio-runtime"]` in Cargo.toml. +4. **Rust async context**: `extract_file` is async. Use `#[tokio::main]` or call from an async context. +5. **CLI --format vs --content-format**: `--format` controls CLI output (text/json). `--content-format` controls content format (plain/markdown/djot/html). The older `--output-format` is a deprecated alias that still works but prints a warning — prefer `--content-format`. +6. **Node.js extractFile signature**: `extractFile(path, mimeType?, config?)` — mimeType is the second arg (pass `null` to skip). +7. **Python detect_mime_type**: The function for detecting from bytes is `detect_mime_type(data)`. For paths use `detect_mime_type_from_path(path)`. +8. **Config file field names**: Use snake_case in TOML/YAML/JSON config files (e.g., `max_chars`, `max_overlap`, `pdf_options`). + +## Supported Formats (Summary) + +| Category | Extensions | +| ----------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **PDF** | `.pdf` | +| **Word** | `.docx`, `.odt` | +| **Spreadsheets** | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.ods` | +| **Presentations** | `.pptx`, `.ppt`, `.ppsx` | +| **eBooks** | `.epub`, `.fb2` | +| **Images** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif`, `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.jbig2`, `.jb2`, `.pnm`, `.pbm`, `.pgm`, `.ppm`, `.svg` | +| **Markup** | `.html`, `.htm`, `.xhtml`, `.xml` | +| **Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` | +| **Text** | `.txt`, `.md`, `.markdown`, `.djot`, `.rst`, `.org`, `.rtf` | +| **Email** | `.eml`, `.msg` | +| **Archives** | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z` | +| **Academic** | `.bib`, `.biblatex`, `.ris`, `.nbib`, `.enw`, `.csl`, `.tex`, `.latex`, `.typ`, `.jats`, `.ipynb`, `.docbook`, `.opml`, `.pod`, `.mdoc`, `.troff` | + +See [references/supported-formats.md](references/supported-formats.md) for the complete format reference with MIME types. + +## Additional Resources + +Detailed reference files for specific topics: + +- **[Python API Reference](references/python-api.md)** — All functions, config classes, plugin protocols, exact signatures +- **[Node.js API Reference](references/nodejs-api.md)** — All functions, TypeScript interfaces, worker pool APIs +- **[Rust API Reference](references/rust-api.md)** — All functions with feature gates, structs, Cargo.toml examples +- **[CLI Reference](references/cli-reference.md)** — All commands, flags, config precedence, exit codes +- **[Configuration Reference](references/configuration.md)** — TOML/YAML/JSON formats, auto-discovery, env vars, full schema +- **[Supported Formats](references/supported-formats.md)** — All 91+ formats with file extensions and MIME types +- **[Advanced Features](references/advanced-features.md)** — Plugins, embeddings, MCP server, API server, security limits +- **[Other Language Bindings](references/other-bindings.md)** — Go, Ruby, Java, C#, PHP, Elixir, WASM, Docker + +Full documentation: +GitHub: diff --git a/plugins/kreuzberg/skills/kreuzberg/references/advanced-features.md b/plugins/kreuzberg/skills/kreuzberg/references/advanced-features.md new file mode 100644 index 0000000..283846d --- /dev/null +++ b/plugins/kreuzberg/skills/kreuzberg/references/advanced-features.md @@ -0,0 +1,967 @@ +# Advanced Features Reference + +Kreuzberg provides powerful advanced features for customization, semantic processing, and integration with external systems. + +## Plugin System + +The plugin system allows you to extend Kreuzberg's extraction pipeline with custom post-processors, validators, and OCR backends. Plugins run within the extraction pipeline and have direct access to extraction results. + +### Custom Post-Processors + +Post-processors enrich extraction results after document parsing. They run non-destructively—if a post-processor fails, the extraction succeeds anyway (errors are logged). + +=== "Python" + + ```python + from kreuzberg import register_post_processor, ExtractionResult + + class MetadataEnricher: + def name(self) -> str: + return "metadata_enricher" + + def process(self, result: ExtractionResult) -> ExtractionResult: + result.metadata["processed_by"] = "metadata_enricher" + result.metadata["char_count"] = len(result.content) + return result + + def processing_stage(self) -> str: + # "early", "middle", or "late" + return "middle" + + def initialize(self) -> None: + print("Initializing metadata enricher") + + def shutdown(self) -> None: + print("Shutting down metadata enricher") + + register_post_processor(MetadataEnricher()) + + # Now use extraction with the registered processor + from kreuzberg import extract_file_sync + result = extract_file_sync("document.pdf") + print(result.metadata["char_count"]) + ``` + +=== "TypeScript" + + ```typescript + import { registerPostProcessor, ExtractionResult } from '@kreuzberg/node'; + + const enricher = { + name(): string { + return "metadata_enricher"; + }, + + async process(result: ExtractionResult): Promise { + result.metadata.processed_by = "metadata_enricher"; + result.metadata.char_count = result.content.length; + return result; + }, + + processingStage?(): "early" | "middle" | "late" { + return "middle"; + }, + + async initialize?(): Promise { + console.log("Initializing metadata enricher"); + }, + + async shutdown?(): Promise { + console.log("Shutting down metadata enricher"); + } + }; + + registerPostProcessor(enricher); + + // Now use extraction with the registered processor + const result = await extractFile("document.pdf"); + console.log(result.metadata.char_count); + ``` + +### Custom Validators + +Validators perform quality checks on extraction results. Unlike post-processors, validator failures cause the entire extraction to fail. Use validators to enforce quality standards. + +=== "Python" + + ```python + from kreuzberg import register_validator, ExtractionResult, ValidationError + + class MinimumContentValidator: + def name(self) -> str: + return "min_content_validator" + + def validate(self, result: ExtractionResult) -> None: + if len(result.content) < 100: + raise ValidationError("Extracted content too short (< 100 chars)") + + def priority(self) -> int: + # Higher priority runs first (0-1000, default 50) + return 100 + + def should_validate(self, result: ExtractionResult) -> bool: + # Only validate PDFs + return "pdf" in result.mime_type.lower() + + def initialize(self) -> None: + pass + + def shutdown(self) -> None: + pass + + register_validator(MinimumContentValidator()) + + # Extraction will fail if content < 100 chars + result = extract_file_sync("document.pdf") + ``` + +=== "TypeScript" + + ```typescript + import { registerValidator, ExtractionResult } from '@kreuzberg/node'; + + const validator = { + name(): string { + return "min_content_validator"; + }, + + async validate(result: ExtractionResult): Promise { + if (result.content.length < 100) { + throw new Error("Extracted content too short (< 100 chars)"); + } + }, + + priority?(): number { + return 100; + }, + + shouldValidate?(result: ExtractionResult): boolean { + return result.mimeType.toLowerCase().includes("pdf"); + }, + + async initialize?(): Promise {}, + + async shutdown?(): Promise {} + }; + + registerValidator(validator); + + // Extraction will fail if content < 100 chars + const result = await extractFile("document.pdf"); + ``` + +### Custom OCR Backends + +Implement custom OCR engines by registering an OCR backend. This allows integration with proprietary or specialized OCR solutions. + +=== "Python" + + ```python + from kreuzberg import register_ocr_backend + + class CustomOcrBackend: + def name(self) -> str: + return "custom_ocr" + + def supported_languages(self) -> list[str]: + return ["eng", "deu", "fra", "spa"] + + def process_image(self, image_bytes: bytes, language: str) -> dict: + # image_bytes: raw image data + # language: ISO 639-3 code (e.g., "eng", "deu") + + # Call your OCR engine here + # text = my_ocr_engine.recognize(image_bytes, language) + + return { + "content": "Extracted text from image", + "metadata": {"confidence": 0.95, "language": language}, + "tables": [] + } + + def process_file(self, path: str, language: str) -> dict: + # Optional: custom file processing + # Called when extracting OCR from a file path + with open(path, "rb") as f: + image_bytes = f.read() + return self.process_image(image_bytes, language) + + def initialize(self) -> None: + # Load models, initialize engine + pass + + def shutdown(self) -> None: + # Clean up resources + pass + + def version(self) -> str: + return "1.0.0" + + register_ocr_backend(CustomOcrBackend()) + + # Use in extraction config + from kreuzberg import ExtractionConfig, OcrConfig, extract_file_sync + + config = ExtractionConfig( + ocr=OcrConfig(backend="custom_ocr", language="eng") + ) + result = extract_file_sync("scanned.pdf", config=config) + ``` + +=== "TypeScript" + + ```typescript + import { registerOcrBackend, ExtractionConfig, extractFile } from '@kreuzberg/node'; + + const backend = { + name(): string { + return "custom_ocr"; + }, + + supportedLanguages(): string[] { + return ["eng", "deu", "fra", "spa"]; + }, + + async processImage( + imageBytes: Uint8Array | string, + language: string + ): Promise<{ + content: string; + mime_type: string; + metadata: Record; + tables: unknown[]; + }> { + const buffer = typeof imageBytes === "string" + ? Buffer.from(imageBytes, "base64") + : Buffer.from(imageBytes); + + // Call your OCR engine + // const text = await myOcrEngine.recognize(buffer, language); + + return { + content: "Extracted text from image", + mime_type: "text/plain", + metadata: { confidence: 0.95, language }, + tables: [] + }; + }, + + async initialize?(): Promise { + // Load models, initialize engine + }, + + async shutdown?(): Promise { + // Clean up resources + } + }; + + registerOcrBackend(backend); + + // Use in extraction config + const config: ExtractionConfig = { + ocr: { backend: "custom_ocr", language: "eng" } + }; + const result = await extractFile("scanned.pdf", null, config); + ``` + +## Per-File Configuration in Batch Operations + +Use `FileExtractionConfig` to override extraction settings for individual files within a batch. This is useful for mixed-format batches where different documents need different OCR, output, or processing settings. + +=== "Python" + + ```python + from kreuzberg import ( + batch_extract_files_sync, + ExtractionConfig, FileExtractionConfig, OcrConfig, + ) + + config = ExtractionConfig(output_format="markdown") + paths = ["report.pdf", "scan.tiff"] + file_configs = [ + None, # use batch defaults + FileExtractionConfig( + force_ocr=True, + ocr=OcrConfig(backend="tesseract", language="deu"), + ), + ] + results = batch_extract_files_sync(paths, config, file_configs=file_configs) + ``` + +=== "TypeScript" + + ```typescript + import { batchExtractFilesSync } from '@kreuzberg/node'; + + const results = batchExtractFilesSync( + ['report.pdf', 'scan.tiff'], + { outputFormat: 'markdown' }, + [null, { forceOcr: true, ocr: { backend: 'tesseract', language: 'deu' } }], + ); + ``` + +All `ExtractionConfig` fields except batch-level concerns (`max_concurrent_extractions`, `use_cache`, `acceleration`, `security_limits`) can be overridden. `None`/`null` fields inherit from the batch default. + +## Embeddings + +Generate vector embeddings for text chunks using ONNX-based models. Embeddings enable semantic search, clustering, and similarity operations on extracted content. + +**Requirements:** ONNX Runtime 1.22.x or later + +=== "Python" + + ```python + from kreuzberg import ( + ExtractionConfig, ChunkingConfig, EmbeddingConfig, + EmbeddingModelType, list_embedding_presets, + get_embedding_preset, extract_file_sync + ) + + # List available embedding presets + presets = list_embedding_presets() + print(f"Available presets: {presets}") # ['balanced', 'compact', 'large'] + + # Get details about a preset + preset_info = get_embedding_preset("balanced") + print(f"Model: {preset_info.model_name}") + print(f"Dimensions: {preset_info.dimensions}") + print(f"Recommended chunk size: {preset_info.chunk_size}") + + # Method 1: Use preset (recommended) + config = ExtractionConfig( + chunking=ChunkingConfig( + max_chars=512, + max_overlap=100, + embedding=EmbeddingConfig( + model=EmbeddingModelType.preset("balanced"), + normalize=True, + batch_size=32 + ) + ) + ) + + # Method 2: Use specific fastembed model + config = ExtractionConfig( + chunking=ChunkingConfig( + embedding=EmbeddingConfig( + model=EmbeddingModelType.fastembed( + model="BAAI/bge-small-en-v1.5", + dimensions=384 + ), + normalize=True + ) + ) + ) + + # Method 3: Use custom ONNX model from HuggingFace + config = ExtractionConfig( + chunking=ChunkingConfig( + embedding=EmbeddingConfig( + model=EmbeddingModelType.custom( + model_id="sentence-transformers/all-MiniLM-L6-v2", + dimensions=384 + ), + cache_dir="/path/to/model/cache" + ) + ) + ) + + result = extract_file_sync("document.pdf", config=config) + + # Access embeddings in chunks + for chunk in result.chunks: + embedding = chunk.embedding # list[float] or None + print(f"Chunk: {chunk.content[:50]}...") + print(f"Embedding dimensions: {len(embedding) if embedding else 0}") + ``` + +=== "TypeScript" + + ```typescript + import { + ExtractionConfig, ChunkingConfig, + listEmbeddingPresets, getEmbeddingPreset, + extractFile + } from '@kreuzberg/node'; + + // List available embedding presets + const presets = listEmbeddingPresets(); + console.log(`Available presets: ${presets}`); // ['balanced', 'compact', 'large'] + + // Get details about a preset + const preset = getEmbeddingPreset("balanced"); + console.log(`Model: ${preset.modelName}`); + console.log(`Dimensions: ${preset.dimensions}`); + console.log(`Recommended chunk size: ${preset.chunkSize}`); + + // Method 1: Use preset (recommended) + const config: ExtractionConfig = { + chunking: { + maxChars: 512, + maxOverlap: 100, + embedding: { + model: { type: 'preset', name: 'balanced' }, + normalize: true, + batchSize: 32 + } + } + }; + + // Method 2: Use specific fastembed model + const config2: ExtractionConfig = { + chunking: { + embedding: { + model: { + type: 'fastembed', + model: 'BAAI/bge-small-en-v1.5', + dimensions: 384 + }, + normalize: true + } + } + }; + + // Method 3: Use custom ONNX model + const config3: ExtractionConfig = { + chunking: { + embedding: { + model: { + type: 'custom', + modelId: 'sentence-transformers/all-MiniLM-L6-v2', + dimensions: 384 + }, + cacheDir: '/path/to/model/cache' + } + } + }; + + const result = await extractFile("document.pdf", null, config); + + // Access embeddings in chunks + if (result.chunks) { + for (const chunk of result.chunks) { + const embedding = chunk.embedding; // number[] | null + console.log(`Chunk: ${chunk.content.substring(0, 50)}...`); + console.log(`Embedding dimensions: ${embedding?.length ?? 0}`); + } + } + ``` + +## Keyword Extraction + +Extract important keywords and phrases from documents using YAKE (Yet Another Keyword Extractor) or RAKE (Rapid Automatic Keyword Extraction) algorithms. + +=== "Python" + + ```python + from kreuzberg import ( + ExtractionConfig, KeywordConfig, KeywordAlgorithm, + YakeParams, RakeParams, extract_file_sync + ) + + # YAKE algorithm (unsupervised, good for general use) + config = ExtractionConfig( + keywords=KeywordConfig( + algorithm=KeywordAlgorithm.Yake, + max_keywords=15, + min_score=0.1, + ngram_range=(1, 3), + language="en", + yake_params=YakeParams(window_size=2) + ) + ) + + # RAKE algorithm (co-occurrence based) + config = ExtractionConfig( + keywords=KeywordConfig( + algorithm=KeywordAlgorithm.Rake, + max_keywords=10, + min_score=0.0, + language="en", + rake_params=RakeParams( + min_word_length=3, + max_words_per_phrase=3 + ) + ) + ) + + result = extract_file_sync("document.pdf", config=config) + + # Access extracted keywords + if result.keywords: + for keyword in result.keywords: + print(f"Text: {keyword.text}") + print(f"Score: {keyword.score}") + print(f"Algorithm: {keyword.algorithm}") + ``` + +=== "TypeScript" + + ```typescript + import { + ExtractionConfig, KeywordConfig, + extractFile + } from '@kreuzberg/node'; + + // YAKE algorithm + const config: ExtractionConfig = { + keywords: { + algorithm: "yake", + maxKeywords: 15, + minScore: 0.1, + ngramRange: [1, 3], + language: "en", + yakeParams: { + windowSize: 2 + } + } + }; + + // RAKE algorithm + const config2: ExtractionConfig = { + keywords: { + algorithm: "rake", + maxKeywords: 10, + minScore: 0.0, + language: "en", + rakeParams: { + minWordLength: 3, + maxWordsPerPhrase: 3 + } + } + }; + + const result = await extractFile("document.pdf", null, config); + + // Access extracted keywords + if (result.keywords) { + for (const keyword of result.keywords) { + console.log(`Text: ${keyword.text}`); + console.log(`Score: ${keyword.score}`); + console.log(`Algorithm: ${keyword.algorithm}`); + } + } + ``` + +## Language Detection + +Automatically detect the language(s) in documents using ISO 639-1 language codes. + +=== "Python" + + ```python + from kreuzberg import ( + ExtractionConfig, LanguageDetectionConfig, + extract_file_sync + ) + + # Enable language detection + config = ExtractionConfig( + language_detection=LanguageDetectionConfig( + enabled=True, + min_confidence=0.8, + detect_multiple=False + ) + ) + + result = extract_file_sync("multilingual.pdf", config=config) + + # Access detected languages + if result.detected_languages: + for lang_code in result.detected_languages: + print(f"Detected language: {lang_code}") # e.g., "en", "de", "fr" + ``` + +=== "TypeScript" + + ```typescript + import { + ExtractionConfig, LanguageDetectionConfig, + extractFile + } from '@kreuzberg/node'; + + const config: ExtractionConfig = { + languageDetection: { + enabled: true, + minConfidence: 0.8, + detectMultiple: false + } + }; + + const result = await extractFile("multilingual.pdf", null, config); + + // Access detected languages + if (result.detectedLanguages) { + for (const langCode of result.detectedLanguages) { + console.log(`Detected language: ${langCode}`); // e.g., "en", "de", "fr" + } + } + ``` + +## Token Reduction + +Reduce the number of tokens in extracted content for cost optimization when working with LLM APIs. Higher modes are more aggressive but may lose more information. + +=== "Python" + + ```python + from kreuzberg import ( + ExtractionConfig, TokenReductionConfig, + extract_file_sync + ) + + # Light token reduction + config = ExtractionConfig( + token_reduction=TokenReductionConfig( + mode="light", + preserve_important_words=True + ) + ) + + # Moderate reduction + config = ExtractionConfig( + token_reduction=TokenReductionConfig( + mode="moderate", + preserve_important_words=True + ) + ) + + # Aggressive reduction + config = ExtractionConfig( + token_reduction=TokenReductionConfig( + mode="aggressive", + preserve_important_words=True + ) + ) + + # Maximum reduction + config = ExtractionConfig( + token_reduction=TokenReductionConfig( + mode="maximum", + preserve_important_words=True + ) + ) + + result = extract_file_sync("document.pdf", config=config) + print(f"Reduced content length: {len(result.content)}") + ``` + +=== "TypeScript" + + ```typescript + import { + ExtractionConfig, TokenReductionConfig, + extractFile + } from '@kreuzberg/node'; + + const config: ExtractionConfig = { + tokenReduction: { + mode: "moderate", + preserveImportantWords: true + } + }; + + const result = await extractFile("document.pdf", null, config); + console.log(`Reduced content length: ${result.content.length}`); + ``` + +**Token Reduction Modes:** + +- `off`: No reduction (default) +- `light`: Remove extra whitespace and redundant punctuation +- `moderate`: Also remove common filler words and some formatting +- `aggressive`: Also remove longer stopwords and collapse similar phrases +- `maximum`: Maximum reduction while preserving semantic content + +## Page Extraction + +Extract and track per-page content separately. Useful for multi-page documents where you need page-level granularity. + +=== "Python" + + ```python + from kreuzberg import ( + ExtractionConfig, PageConfig, + extract_file_sync + ) + + config = ExtractionConfig( + pages=PageConfig( + extract_pages=True, + insert_page_markers=True, + marker_format="\n\n\n\n" + ) + ) + + result = extract_file_sync("multi_page.pdf", config=config) + + # Access per-page content + if result.pages: + for page in result.pages: + print(f"Page {page.page_number}:") + print(f"Content: {page.content[:100]}...") + print(f"Tables: {len(page.tables)}") + print(f"Images: {len(page.images)}") + ``` + +=== "TypeScript" + + ```typescript + import { + ExtractionConfig, PageExtractionConfig, + extractFile + } from '@kreuzberg/node'; + + const config: ExtractionConfig = { + pages: { + extractPages: true, + insertPageMarkers: true, + markerFormat: "\n\n\n\n" + } + }; + + const result = await extractFile("multi_page.pdf", null, config); + + // Access per-page content + if (result.pages) { + for (const page of result.pages) { + console.log(`Page ${page.pageNumber}:`); + console.log(`Content: ${page.content.substring(0, 100)}...`); + console.log(`Tables: ${page.tables.length}`); + console.log(`Images: ${page.images.length}`); + } + } + ``` + +## Element-Based Output + +Extract semantic elements instead of unified content. This format is compatible with the Unstructured library and provides structured access to different content types (titles, headings, text, tables, images, etc.). + +=== "Python" + + ```python + from kreuzberg import ExtractionConfig, ResultFormat, extract_file_sync + + config = ExtractionConfig( + result_format="element_based" + ) + + result = extract_file_sync("document.pdf", config=config) + + # Access semantic elements + if result.elements: + for element in result.elements: + print(f"Type: {element.element_type}") # title, heading, narrative_text, etc. + print(f"Text: {element.text}") + if element.metadata.get("page_number"): + print(f"Page: {element.metadata['page_number']}") + ``` + +=== "TypeScript" + + ```typescript + import { ExtractionConfig, extractFile } from '@kreuzberg/node'; + + const config: ExtractionConfig = { + resultFormat: "element_based" + }; + + const result = await extractFile("document.pdf", null, config); + + // Access semantic elements + if (result.elements) { + for (const element of result.elements) { + console.log(`Type: ${element.elementType}`); + console.log(`Text: ${element.text}`); + if (element.metadata.pageNumber) { + console.log(`Page: ${element.metadata.pageNumber}`); + } + } + } + ``` + +**Element Types:** + +- `title`: Document or section title +- `heading`: Section headings +- `narrative_text`: Regular paragraph text +- `list_item`: Items in bullet/numbered lists +- `table`: Table structures +- `image`: Images or figures +- `page_break`: Page boundaries +- `code_block`: Code snippets +- `block_quote`: Quoted text +- `footer`: Footer content +- `header`: Header content + +## Djot Content + +Output extracted content in Djot markup format (a lighter alternative to Markdown with enhanced structure). + +=== "Python" + + ```python + from kreuzberg import ExtractionConfig, OutputFormat, extract_file_sync + + config = ExtractionConfig( + output_format="djot" + ) + + result = extract_file_sync("document.pdf", config=config) + print(result.content) # Djot-formatted content + + # Access structured Djot content + if result.djot_content: + print(f"Plain text: {result.djot_content['plain_text']}") + print(f"Blocks: {result.djot_content['blocks']}") + print(f"Links: {result.djot_content['links']}") + print(f"Images: {result.djot_content['images']}") + print(f"Footnotes: {result.djot_content['footnotes']}") + ``` + +=== "TypeScript" + + ```typescript + import { ExtractionConfig, extractFile } from '@kreuzberg/node'; + + const config: ExtractionConfig = { + outputFormat: "djot" + }; + + const result = await extractFile("document.pdf", null, config); + console.log(result.content); // Djot-formatted content + + // Access structured Djot content (if available) + if (result.djotContent) { + console.log(`Plain text: ${result.djotContent.plain_text}`); + console.log(`Blocks: ${result.djotContent.blocks}`); + console.log(`Links: ${result.djotContent.links}`); + console.log(`Images: ${result.djotContent.images}`); + console.log(`Footnotes: ${result.djotContent.footnotes}`); + } + ``` + +## API Server + +Run Kreuzberg as an HTTP API server for integration with external services. + + # Start server on default port 8000 + kreuzberg serve + + # Custom host and port + kreuzberg serve --host 0.0.0.0 --port 9000 + + # Enable CORS and other options + kreuzberg serve --host localhost --port 8000 + +**API Endpoints:** + +- `POST /extract` - Extract from uploaded file +- `POST /batch` - Batch extraction +- `POST /detect` - Detect MIME type + +**Example:** + + curl -X POST -F "file=@document.pdf" http://localhost:8000/extract + +## MCP Server + +Run Kreuzberg as a Model Context Protocol server for integration with Claude and other AI models. + + # Start MCP server with stdio transport + kreuzberg mcp --transport stdio + + # Start MCP server with HTTP transport + kreuzberg mcp --transport http --host 127.0.0.1 --port 8001 + +The MCP server exposes extraction functions to AI models, allowing them to process documents directly. + +## Security Limits + +Set resource limits to prevent abuse and control memory/file size consumption. + +=== "Python" + + ```python + from kreuzberg import ExtractionConfig, extract_file_sync + + config = ExtractionConfig( + security_limits={ + "max_file_size": 100_000_000, # 100 MB + "max_archive_files": 1000, + "max_text_length": 10_000_000, # 10 MB of text + "max_pages": 10000, + "max_concurrent_extractions": 4 + } + ) + + result = extract_file_sync("document.pdf", config=config) + ``` + +=== "TypeScript" + + ```typescript + import { ExtractionConfig, extractFile } from '@kreuzberg/node'; + + const config: ExtractionConfig = { + securityLimits: { + max_file_size: 100_000_000, // 100 MB + max_archive_files: 1000, + max_text_length: 10_000_000, // 10 MB of text + max_pages: 10000, + max_concurrent_extractions: 4 + } + }; + + const result = await extractFile("document.pdf", null, config); + ``` + +**Common Limits:** + +- `max_file_size`: Maximum input file size in bytes +- `max_archive_files`: Maximum files in archives (zip, tar, etc.) +- `max_text_length`: Maximum extracted text length +- `max_pages`: Maximum number of pages to process +- `max_concurrent_extractions`: Maximum concurrent extraction operations + +## Caching + +Extraction results are cached by default to improve performance on repeated extractions of identical documents. Control caching behavior through configuration. + +=== "Python" + + ```python + from kreuzberg import ExtractionConfig, extract_file_sync + + # Enable caching (default) + config = ExtractionConfig(use_cache=True) + result = extract_file_sync("document.pdf", config=config) + + # Disable caching for a specific extraction + config = ExtractionConfig(use_cache=False) + result = extract_file_sync("document.pdf", config=config) + ``` + +=== "TypeScript" + + ```typescript + import { ExtractionConfig, extractFile } from '@kreuzberg/node'; + + // Enable caching (default) + const config: ExtractionConfig = { useCache: true }; + const result = await extractFile("document.pdf", null, config); + + // Disable caching + const config2: ExtractionConfig = { useCache: false }; + const result2 = await extractFile("document.pdf", null, config2); + ``` + +**CLI Cache Management:** + + # View cache statistics + kreuzberg cache stats + + # Clear all cached results + kreuzberg cache clear + +Caching is transparent and automatic—same input produces cached output instantly on subsequent extractions. diff --git a/plugins/kreuzberg/skills/kreuzberg/references/cli-reference.md b/plugins/kreuzberg/skills/kreuzberg/references/cli-reference.md new file mode 100644 index 0000000..bf5b6cc --- /dev/null +++ b/plugins/kreuzberg/skills/kreuzberg/references/cli-reference.md @@ -0,0 +1,449 @@ +# Kreuzberg CLI Reference + +Comprehensive command-line interface for the Kreuzberg document intelligence library. + +## Installation + +Install from crates.io: + +```bash +cargo install kreuzberg-cli +``` + +Or download pre-built binaries from [GitHub Releases](https://github.com/kreuzberg-dev/kreuzberg/releases). + +## Commands + +### extract + +Extract text and structure from a single document. + +```bash +kreuzberg extract [FLAGS] +``` + +## Positional Arguments + +- `` — Path to the document file + +## Flags + +- `-c, --config ` — Path to config file (TOML, YAML, or JSON). Auto-discovers `kreuzberg.{toml,yaml,json}` in current and parent directories if omitted. +- `--config-json ` — Inline JSON configuration (merged after config file, before CLI flags). +- `--config-json-base64 ` — Base64-encoded JSON configuration. +- `-m, --mime-type ` — MIME type hint (auto-detected if not provided). +- `-f, --format ` — CLI output format (default: `text`). Controls how results display, not extraction content format. +- `--content-format ` — Extraction content format (default: `plain`). Controls format of extracted content. (Note: `--output-format` is a deprecated alias.) +- `--ocr ` — Enable OCR processing. +- `--ocr-backend ` — OCR backend: `tesseract`, `paddle-ocr`, `easyocr`. +- `--ocr-language ` — OCR language code. +- `--ocr-auto-rotate ` — Auto-rotate images before OCR. +- `--force-ocr ` — Force OCR even if text extraction succeeds. +- `--disable-ocr ` — Disable OCR entirely (even for images). +- `--no-cache ` — Disable caching. +- `--chunk ` — Enable text chunking. +- `--chunk-size ` — Chunk size in characters. +- `--chunk-overlap ` — Chunk overlap in characters. +- `--chunking-tokenizer ` — Tokenizer model for token-based sizing. +- `--include-structure ` — Include hierarchical document structure. +- `--quality ` — Enable quality processing. +- `--detect-language ` — Enable language detection. +- `--layout` — Enable layout detection (RT-DETR v2). Use `--layout false` to disable. +- `--layout-confidence ` — Layout confidence threshold (0.0-1.0). +- `--layout-table-model ` — Table structure model: `tatr`, `slanet_wired`, `slanet_wireless`, `slanet_plus`, `slanet_auto`, `disabled`. +- `--acceleration ` — ONNX execution provider: `auto`, `cpu`, `coreml`, `cuda`, `tensorrt`. +- `--extract-pages ` — Extract pages as separate array. +- `--page-markers ` — Insert page marker comments. +- `--extract-images ` — Enable image extraction. +- `--target-dpi ` — Target DPI for images (36-2400). +- `--pdf-password ` — Password for encrypted PDFs (repeatable). +- `--pdf-extract-images ` — Extract images from PDF pages. +- `--pdf-extract-metadata ` — Extract PDF metadata. +- `--token-reduction ` — Token reduction: `off`, `light`, `moderate`, `aggressive`, `maximum`. +- `--msg-codepage ` — Windows codepage fallback for MSG files. +- `--max-concurrent ` — Max parallel extractions in batch mode. +- `--max-threads ` — Cap all internal thread pools. +- `--cache-namespace ` — Cache namespace for tenant isolation. +- `--cache-ttl-secs ` — Per-request cache TTL in seconds. + +## Examples + +```bash +# Extract with default settings +kreuzberg extract document.pdf + +# Extract with OCR enabled +kreuzberg extract scanned.pdf --ocr=true + +# Extract with specific content format +kreuzberg extract doc.docx --content-format markdown + +# Extract with inline JSON config +kreuzberg extract file.pdf --config-json '{"ocr":{"backend":"tesseract"}}' + +# Extract with base64-encoded config +kreuzberg extract file.pdf --config-json-base64 eyJvY3IiOnsiYmFja2VuZCI6InRlc3NlcmFjdCJ9fQ== + +# Extract and output as JSON +kreuzberg extract doc.pdf --format json + +# Extract with chunking +kreuzberg extract large-doc.pdf --chunk true --chunk-size 2000 --chunk-overlap 200 + +# Layout-aware markdown extraction +kreuzberg extract document.pdf --layout --content-format markdown + +# With custom confidence threshold +kreuzberg extract document.pdf --layout-confidence 0.7 --content-format markdown +``` + +### batch + +Batch extract from multiple documents in parallel. + +```bash +kreuzberg batch [FLAGS] +``` + +## Positional Arguments + +- `` — One or more document file paths + +## Flags + +- `-c, --config ` — Path to config file (TOML, YAML, or JSON). Auto-discovers `kreuzberg.{toml,yaml,json}` in current and parent directories if omitted. +- `--config-json ` — Inline JSON configuration (merged after config file, before CLI flags). +- `--config-json-base64 ` — Base64-encoded JSON configuration. +- `-f, --format ` — CLI output format (default: `json`). Controls how results display, not extraction content format. +- All extraction override flags from `extract` are also supported (e.g., `--content-format`, `--ocr`, `--layout`, `--force-ocr`, `--no-cache`, `--quality`, `--acceleration`, etc.). See the `extract` command flags for the full list. + +## Notes + +- Batch command defaults to JSON output format (unlike `extract` which defaults to text). +- Does not support `--mime-type` or `--detect-language` flags. + +## Examples + +```bash +# Batch extract multiple PDFs +kreuzberg batch document1.pdf document2.pdf document3.pdf + +# Batch extract with glob patterns (shell expansion) +kreuzberg batch *.pdf + +# Batch extract with custom output format +kreuzberg batch doc1.pdf doc2.pdf --content-format markdown + +# Batch extract with OCR +kreuzberg batch scanned*.pdf --ocr=true + +# Batch extract with text output format +kreuzberg batch files*.docx --format text +``` + +### detect + +Identify MIME type of a file. + +```bash +kreuzberg detect [FLAGS] +``` + +## Positional Arguments + +- `` — Path to the file + +## Flags + +- `-f, --format ` — Output format (default: `text`) + +## Examples + +```bash +# Detect MIME type (text output) +kreuzberg detect unknown-file.bin + +# Detect MIME type (JSON output) +kreuzberg detect file.xyz --format json +``` + +### version + +Display version information. + +```bash +kreuzberg version [FLAGS] +``` + +## Flags + +- `-f, --format ` — Output format (default: `text`) + +## Examples + +```bash +# Show version as text +kreuzberg version + +# Show version as JSON +kreuzberg version --format json +``` + +### cache + +Manage extraction cache. + +#### cache stats + +Display cache statistics. + +```bash +kreuzberg cache stats [FLAGS] +``` + +## Flags + +- `--cache-dir ` — Cache directory (default: `.kreuzberg` in current directory) +- `-f, --format ` — Output format (default: `text`) + +## Examples + +```bash +# Show cache stats +kreuzberg cache stats + +# Show cache stats as JSON +kreuzberg cache stats --format json + +# Show stats for specific cache directory +kreuzberg cache stats --cache-dir /tmp/my-cache +``` + +### cache clear + +Clear all cached extractions. + +```bash +kreuzberg cache clear [FLAGS] +``` + +## Flags + +- `--cache-dir ` — Cache directory (default: `.kreuzberg` in current directory) +- `-f, --format ` — Output format (default: `text`) + +## Examples + +```bash +# Clear cache +kreuzberg cache clear + +# Clear specific cache directory +kreuzberg cache clear --cache-dir /tmp/my-cache +``` + +### serve + +Start the API server (requires `api` feature). + +```bash +kreuzberg serve [FLAGS] +``` + +## Flags + +- `-H, --host ` — Host to bind to (e.g., `127.0.0.1` or `0.0.0.0`). CLI arg overrides config file and environment variables. +- `-p, --port ` — Port to bind to. CLI arg overrides config file and environment variables. +- `-c, --config ` — Path to config file (TOML, YAML, or JSON). Auto-discovers `kreuzberg.{toml,yaml,json}` in current and parent directories if omitted. + +## Configuration Precedence + +1. CLI arguments (`--host`, `--port`) +2. Environment variables (`KREUZBERG_HOST`, `KREUZBERG_PORT`) +3. Config file (`[server]` section) +4. Built-in defaults (`127.0.0.1:8000`) + +## Examples + +```bash +# Start server with defaults +kreuzberg serve + +# Start server on specific host and port +kreuzberg serve --host 0.0.0.0 --port 3000 + +# Start server with config file +kreuzberg serve --config kreuzberg.toml + +# Start server (environment variables override defaults) +KREUZBERG_HOST=192.168.1.100 KREUZBERG_PORT=8080 kreuzberg serve +``` + +### mcp + +Start the Model Context Protocol (MCP) server (requires `mcp` feature). + +```bash +kreuzberg mcp [FLAGS] +``` + +## Flags + +- `-c, --config ` — Path to config file (TOML, YAML, or JSON). Auto-discovers `kreuzberg.{toml,yaml,json}` in current and parent directories if omitted. +- `--transport ` — Transport mode (default: `stdio`) +- `--host ` — HTTP host for http transport (default: `127.0.0.1`) +- `--port ` — HTTP port for http transport (default: `8001`) + +## Examples + +```bash +# Start MCP server with stdio transport +kreuzberg mcp + +# Start MCP server with HTTP transport +kreuzberg mcp --transport http + +# Start MCP server on custom HTTP host/port +kreuzberg mcp --transport http --host 0.0.0.0 --port 9000 + +# Start MCP server with config file +kreuzberg mcp --config kreuzberg.toml +``` + +## Configuration + +### File Format + +Configuration files support three formats with automatic detection: + +- **TOML** — `.toml` extension (recommended) +- **YAML** — `.yaml` or `.yml` extension +- **JSON** — `.json` extension + +### Configuration Precedence + +Settings are applied in order from highest to lowest priority: + +1. **Individual CLI flags** (e.g., `--ocr=true`, `--content-format markdown`) +2. **Inline JSON config** (`--config-json` or `--config-json-base64`) +3. **Config file** (explicit `--config path.toml` or auto-discovered) +4. **Default values** (built-in library defaults) + +### Auto-Discovery + +When no config file is specified, Kreuzberg searches for configuration in this order: + +1. `kreuzberg.toml` in current directory +2. `kreuzberg.yaml` in current directory +3. `kreuzberg.json` in current directory +4. Parent directories (same search pattern, up to filesystem root) + +### Example Configuration + +```toml +# Top-level extraction options +use_cache = true +enable_quality_processing = true +force_ocr = false +output_format = "markdown" + +# OCR settings +[ocr] +backend = "tesseract" +language = "eng" + +# Chunking settings +[chunking] +max_chars = 2000 +max_overlap = 200 + +# Language detection +[language_detection] +enabled = true + +# Server configuration (for serve command) +[server] +host = "127.0.0.1" +port = 8000 +``` + +## Exit Codes + +- `0` — Success +- Non-zero — Error (see stderr for details) + +## Error Handling + +The CLI validates input and provides clear error messages: + +- **File not found** — Verify path exists and is readable +- **Invalid MIME type** — Ensure file is accessible and format is supported +- **Invalid JSON** — Check `--config-json` syntax +- **Invalid config file** — Verify TOML/YAML/JSON format +- **Invalid chunk parameters** — Ensure chunk-size > 0 and overlap < chunk-size + +## Environment Variables + +- `RUST_LOG` — Set logging level (e.g., `RUST_LOG=debug`) +- `KREUZBERG_HOST` — Server bind host (used by `serve` command) +- `KREUZBERG_PORT` — Server bind port (used by `serve` command) + +## Common Patterns + +### Extract with Custom Configuration + +```bash +kreuzberg extract document.pdf \ + --content-format markdown \ + --ocr=true \ + --quality true +``` + +### Batch Process with Config File + +```bash +kreuzberg batch *.pdf --config extraction-config.toml +``` + +### CI/CD Integration + +```bash +# Extract to JSON for downstream processing +kreuzberg extract file.pdf --format json | jq '.content' + +# Batch process with error handling +kreuzberg batch docs/*.pdf --format json || exit 1 +``` + +### Performance Tuning + +```bash +# Disable cache for temporary processing +kreuzberg extract file.pdf --no-cache=true + +# Enable chunking for large documents +kreuzberg extract large-file.pdf \ + --chunk true \ + --chunk-size 5000 \ + --chunk-overlap 500 +``` + +## Debugging + +Enable detailed logging: + +```bash +RUST_LOG=debug kreuzberg extract document.pdf +``` + +Check cache statistics: + +```bash +kreuzberg cache stats --format json +``` + +Detect file MIME type: + +```bash +kreuzberg detect unknown-file --format json +``` diff --git a/plugins/kreuzberg/skills/kreuzberg/references/configuration.md b/plugins/kreuzberg/skills/kreuzberg/references/configuration.md new file mode 100644 index 0000000..3e34845 --- /dev/null +++ b/plugins/kreuzberg/skills/kreuzberg/references/configuration.md @@ -0,0 +1,417 @@ +# Configuration Reference + +Kreuzberg uses a hierarchical configuration system supporting multiple formats and auto-discovery mechanisms. This reference covers all available configuration options, field names across programming languages, and loading strategies. + +## Supported Formats + +Kreuzberg configurations can be defined in three formats: + +- **TOML** (recommended): `kreuzberg.toml` +- **YAML**: `kreuzberg.yaml` +- **JSON**: `kreuzberg.json` + +All formats support the same schema and configuration options. + +## Auto-Discovery + +When no configuration file is explicitly specified, Kreuzberg searches for configuration files in the following order: + +1. Current working directory: `kreuzberg.toml`, `kreuzberg.yaml`, `kreuzberg.json` +2. Parent directories (recursively up the tree, same file name pattern) + +The first matching configuration file is loaded. + +## Programmatic Loading + +### Python + +```python +from kreuzberg import ExtractionConfig + +# Load from explicit path +config = ExtractionConfig.from_file("kreuzberg.toml") + +# Auto-discover configuration +config = ExtractionConfig.discover() +``` + +### Node.js / TypeScript + +```typescript +import { ExtractionConfig } from "@kreuzberg/node"; + +// Load from explicit path +const config = ExtractionConfig.fromFile("kreuzberg.toml"); + +// Auto-discover configuration +const config = ExtractionConfig.discover(); +``` + +### CLI + +```bash +# Explicit configuration file +kreuzberg extract --config kreuzberg.toml document.pdf + +# Auto-discovery (searches default locations) +kreuzberg extract document.pdf +``` + +## Configuration Schema + +The complete TOML schema with all available sections and options: + +### Top-Level Options + +```toml +use_cache = true +enable_quality_processing = true +force_ocr = false +output_format = "markdown" +result_format = "text" +max_concurrent_extractions = 4 +``` + +| Option | Type | Default | Description | +| ---------------------------- | ------- | ------------ | ----------------------------------------------------------------------------------- | +| `use_cache` | boolean | `true` | Enable caching of extraction results | +| `enable_quality_processing` | boolean | `true` | Enable post-processing for output quality | +| `force_ocr` | boolean | `false` | Force OCR processing even for searchable PDFs | +| `disable_ocr` | boolean | `false` | Disable OCR entirely — image files return empty content instead of errors (v4.7.0+) | +| `output_format` | string | `"markdown"` | Output format (markdown, html, text) | +| `result_format` | string | `"text"` | Result format for structured output | +| `max_concurrent_extractions` | integer | `4` | Maximum concurrent document extractions | + +### OCR Configuration + +```toml +[ocr] +backend = "tesseract" +language = "eng" +``` + +| Option | Type | Default | Description | +| ---------- | ------ | ------------- | --------------------------------------------- | +| `backend` | string | `"tesseract"` | OCR backend (currently tesseract) | +| `language` | string | `"eng"` | ISO 639-3 language code (eng, deu, fra, etc.) | + +#### Tesseract Configuration + +```toml +[ocr.tesseract_config] +psm = 3 +oem = 3 +min_confidence = 0.0 +output_format = "text" +enable_table_detection = false +table_min_confidence = 0.5 +table_column_threshold = 50 +table_row_threshold_ratio = 0.5 +use_cache = true +``` + +| Option | Type | Default | Description | +| --------------------------- | ------- | -------- | ------------------------------------------ | +| `psm` | integer | `3` | Page Segmentation Mode (0-13) | +| `oem` | integer | `3` | OCR Engine Mode (0-3) | +| `min_confidence` | float | `0.0` | Minimum OCR confidence threshold (0.0-1.0) | +| `output_format` | string | `"text"` | Output format from OCR | +| `enable_table_detection` | boolean | `false` | Enable table detection during OCR | +| `table_min_confidence` | float | `0.5` | Minimum confidence for table cells | +| `table_column_threshold` | integer | `50` | Pixel threshold for column detection | +| `table_row_threshold_ratio` | float | `0.5` | Row height ratio threshold | +| `use_cache` | boolean | `true` | Cache OCR results | + +#### Tesseract Preprocessing + +```toml +[ocr.tesseract_config.preprocessing] +target_dpi = 300 +auto_rotate = true +deskew = true +denoise = true +contrast_enhance = true +binarization_method = "otsu" +invert_colors = false +``` + +| Option | Type | Default | Description | +| --------------------- | ------- | -------- | ---------------------------------------------- | +| `target_dpi` | integer | `300` | Target DPI for preprocessing | +| `auto_rotate` | boolean | `true` | Automatically detect and correct page rotation | +| `deskew` | boolean | `true` | Correct skewed pages | +| `denoise` | boolean | `true` | Remove noise from images | +| `contrast_enhance` | boolean | `true` | Enhance image contrast | +| `binarization_method` | string | `"otsu"` | Method for image binarization | +| `invert_colors` | boolean | `false` | Invert image colors if needed | + +### PDF Options + +```toml +[pdf_options] +extract_images = true +extract_metadata = true + +[pdf_options.hierarchy] +enabled = true +k_clusters = 6 +include_bbox = true +ocr_coverage_threshold = 0.5 +``` + +| Option | Type | Default | Description | +| ---------------------------------- | ------- | ------- | ---------------------------------------------- | +| `extract_images` | boolean | `true` | Extract images from PDF documents | +| `extract_metadata` | boolean | `true` | Extract PDF metadata | +| `hierarchy.enabled` | boolean | `true` | Enable PDF hierarchy extraction (v4.0.0+) | +| `hierarchy.k_clusters` | integer | `6` | Number of clusters for hierarchy detection | +| `hierarchy.include_bbox` | boolean | `true` | Include bounding boxes in hierarchy | +| `hierarchy.ocr_coverage_threshold` | float | `0.5` | OCR coverage threshold for hierarchy (0.0-1.0) | + +### Image Processing + +```toml +[images] +extract_images = true +target_dpi = 300 +max_image_dimension = 4096 +auto_adjust_dpi = true +min_dpi = 72 +max_dpi = 600 +``` + +| Option | Type | Default | Description | +| --------------------- | ------- | ------- | -------------------------------------------- | +| `extract_images` | boolean | `true` | Extract images from documents | +| `target_dpi` | integer | `300` | Target DPI for image processing | +| `max_image_dimension` | integer | `4096` | Maximum image dimension in pixels | +| `auto_adjust_dpi` | boolean | `true` | Automatically adjust DPI based on image size | +| `min_dpi` | integer | `72` | Minimum DPI threshold | +| `max_dpi` | integer | `600` | Maximum DPI threshold | + +### Chunking Configuration + +```toml +[chunking] +max_chars = 1000 +max_overlap = 200 + +[chunking.embedding] +batch_size = 32 +normalize = true +show_download_progress = true +cache_dir = "~/.cache/kreuzberg/embeddings" + +[chunking.embedding.model] +type = "preset" +name = "balanced" +``` + +| Option | Type | Default | Description | +| ---------------------------------- | ------- | --------------------------------- | ---------------------------------------------------------- | +| `max_chars` | integer | `1000` | Maximum characters per chunk | +| `max_overlap` | integer | `200` | Overlap between consecutive chunks | +| `embedding.batch_size` | integer | `32` | Batch size for embedding generation | +| `embedding.normalize` | boolean | `true` | Normalize embeddings to unit length | +| `embedding.show_download_progress` | boolean | `true` | Show progress when downloading models | +| `embedding.cache_dir` | string | `"~/.cache/kreuzberg/embeddings"` | Directory for caching embeddings | +| `embedding.model.type` | string | `"preset"` | Model type: preset, fastembed, or custom | +| `embedding.model.name` | string | `"balanced"` | Preset model name (balanced, fast, accurate, multilingual) | +| `embedding.model.model` | string | | FastEmbed model identifier | +| `embedding.model.model_id` | string | | Custom HuggingFace model ID | +| `embedding.model.dimensions` | integer | | Embedding dimensions | + +### Keywords Configuration + +```toml +[keywords] +algorithm = "yake" +max_keywords = 10 +min_score = 0.0 +ngram_range = [1, 3] +language = "en" +``` + +| Option | Type | Default | Description | +| -------------- | ------- | -------- | ------------------------------------------- | +| `algorithm` | string | `"yake"` | Keyword extraction algorithm (yake or rake) | +| `max_keywords` | integer | `10` | Maximum keywords to extract | +| `min_score` | float | `0.0` | Minimum relevance score for keywords | +| `ngram_range` | array | `[1, 3]` | N-gram size range [min, max] | +| `language` | string | `"en"` | Language code for keyword extraction | + +### Token Reduction + +```toml +[token_reduction] +mode = "off" +preserve_important_words = true +``` + +| Option | Type | Default | Description | +| -------------------------- | ------- | ------- | ----------------------------------------- | +| `mode` | string | `"off"` | Mode: off, aggressive, moderate, minimal | +| `preserve_important_words` | boolean | `true` | Preserve important words during reduction | + +### Language Detection + +```toml +[language_detection] +enabled = true +min_confidence = 0.8 +detect_multiple = false +``` + +| Option | Type | Default | Description | +| ----------------- | ------- | ------- | ------------------------------------------ | +| `enabled` | boolean | `true` | Enable automatic language detection | +| `min_confidence` | float | `0.8` | Minimum confidence threshold for detection | +| `detect_multiple` | boolean | `false` | Detect multiple languages in document | + +### Post-Processor + +```toml +[postprocessor] +enabled = true +``` + +| Option | Type | Default | Description | +| --------- | ------- | ------- | ------------------------------------------- | +| `enabled` | boolean | `true` | Enable post-processing of extracted content | + +## FileExtractionConfig (Per-File Overrides) + +Passed as an optional parameter to `batch_extract_file` / `batch_extract_bytes` (and their sync variants) to override settings per file in a batch. All fields optional — `None` = use batch default. The separate `_with_configs` functions were removed in v4.5.0. + +**Overridable fields:** `enable_quality_processing`, `ocr`, `force_ocr`, `chunking`, `images`, `pdf_options`, `token_reduction`, `language_detection`, `pages`, `keywords`, `postprocessor`, `html_options`, `result_format`, `output_format`, `include_document_structure`, `layout`. + +**Batch-level only (not overridable):** `max_concurrent_extractions`, `use_cache`, `acceleration`, `security_limits`. + +**Merge semantics:** For each file, `FileExtractionConfig` fields are overlaid on the batch `ExtractionConfig`. `None` falls through to batch default; `Some(value)` replaces the batch default for that file. + +```toml +# FileExtractionConfig cannot be specified in config files — +# it is a programmatic API for per-file overrides at runtime. +``` + +## Naming Conventions + +Kreuzberg uses consistent naming conventions across different contexts: + +| Context | Convention | Example | +| -------------------- | ---------- | --------------------------------------------- | +| Python | snake_case | `max_chars`, `pdf_options`, `use_cache` | +| Node.js / TypeScript | camelCase | `maxChars`, `pdfOptions`, `useCache` | +| Rust | snake_case | `max_chars`, `pdf_options`, `use_cache` | +| TOML / YAML / JSON | snake_case | `max_chars`, `pdf_options`, `use_cache` | +| CLI flags | kebab-case | `--max-chars`, `--pdf-options`, `--use-cache` | + +When switching between languages, apply the appropriate conversion: + +- Python → Node.js: `snake_case` to `camelCase` +- CLI → Python: `kebab-case` to `snake_case` +- TOML → Python: No conversion needed (both use `snake_case`) + +## Environment Variables + +The following environment variables can override configuration: + +| Variable | Purpose | Example | +| ---------------- | ----------------------------------- | ----------- | +| `KREUZBERG_HOST` | Server bind address (serve command) | `127.0.0.1` | +| `KREUZBERG_PORT` | Server port (serve command) | `8080` | + +## Configuration Merging + +Configuration sources are merged in priority order (highest to lowest): + +1. **CLI flags** (highest priority) +2. **Inline JSON configuration** (programmatic) +3. **Configuration file** (lowest priority) + +Later sources override earlier ones. For example, a CLI flag `--max-chars 2000` overrides `max_chars = 1000` in the configuration file. + +## Example Configurations + +### Minimal Configuration + +```toml +use_cache = true +enable_quality_processing = true + +[ocr] +backend = "tesseract" +language = "eng" +``` + +### High-Quality PDF Extraction + +```toml +use_cache = true +enable_quality_processing = true +force_ocr = false + +[ocr] +backend = "tesseract" +language = "eng" + +[ocr.tesseract_config] +psm = 3 +oem = 3 +enable_table_detection = true +table_min_confidence = 0.7 + +[pdf_options] +extract_images = true +extract_metadata = true + +[pdf_options.hierarchy] +enabled = true +k_clusters = 6 + +[images] +extract_images = true +target_dpi = 300 +``` + +### Semantic Search Configuration + +```toml +[chunking] +max_chars = 800 +max_overlap = 150 + +[chunking.embedding] +batch_size = 32 +normalize = true +cache_dir = "~/.cache/kreuzberg/embeddings" + +[chunking.embedding.model] +type = "preset" +name = "accurate" + +[keywords] +algorithm = "yake" +max_keywords = 15 +``` + +## Field Name Reference + +Critical field names to use in configuration files: + +- `max_chars` (NOT `max_characters`) +- `max_overlap` (NOT `overlap`) +- `table_min_confidence` +- `table_column_threshold` +- `table_row_threshold_ratio` +- `ocr_coverage_threshold` +- `k_clusters` +- `include_bbox` +- `enable_table_detection` +- `auto_rotate` +- `auto_adjust_dpi` +- `show_download_progress` +- `min_confidence` +- `detect_multiple` + +Always verify field names against the source configuration file when adding new options. diff --git a/plugins/kreuzberg/skills/kreuzberg/references/nodejs-api.md b/plugins/kreuzberg/skills/kreuzberg/references/nodejs-api.md new file mode 100644 index 0000000..586456c --- /dev/null +++ b/plugins/kreuzberg/skills/kreuzberg/references/nodejs-api.md @@ -0,0 +1,1380 @@ +# Node.js/TypeScript API Reference + +## Overview + +**Package**: `@kreuzberg/node` — A high-performance TypeScript SDK built on a Rust core for document intelligence and content extraction. + +Supports both **ESM** (`import`) and **CommonJS** (`require`): + +```typescript +// ESM +import { extractFile, batchExtractFiles } from "@kreuzberg/node"; + +// CommonJS +const { extractFile, batchExtractFiles } = require("@kreuzberg/node"); +``` + +**Current Version**: 4.2.14 + +--- + +## Core Extraction Functions + +All extraction functions return `ExtractionResult` containing extracted content, metadata, tables, and optional chunks/images. + +### Single File Extraction + +#### `extractFile(filePath, mimeType?, config?): Promise` + +Extract content from a single file asynchronously. + +```typescript +import { extractFile } from "@kreuzberg/node"; + +// Auto-detect MIME type from file extension +const result = await extractFile("document.pdf"); +console.log(result.content); + +// Explicit MIME type +const result2 = await extractFile("document.pdf", "application/pdf"); + +// With configuration +const result3 = await extractFile("document.pdf", null, { + chunking: { + maxChars: 1000, + maxOverlap: 200, + }, +}); +``` + +**Parameters**: + +- `filePath: string` — Path to the file to extract +- `mimeType?: string | null` — Optional MIME type hint (auto-detect if null) +- `config?: ExtractionConfig` — Optional extraction configuration + +**Returns**: `Promise` + +**Throws**: `ParsingError`, `OcrError`, `ValidationError`, `KreuzbergError` + +#### `extractFileSync(filePath, mimeType?, config?): ExtractionResult` + +Extract content from a single file synchronously. + +```typescript +import { extractFileSync } from "@kreuzberg/node"; + +const result = extractFileSync("document.pdf"); +console.log(result.content); +``` + +**Parameters**: Same as `extractFile()` + +**Returns**: `ExtractionResult` + +--- + +### Raw Bytes Extraction + +#### `extractBytes(data, mimeType, config?): Promise` + +Extract content from raw bytes (Buffer or Uint8Array) asynchronously. + +```typescript +import { extractBytes } from "@kreuzberg/node"; +import { readFile } from "fs/promises"; + +const data = await readFile("document.pdf"); +const result = await extractBytes(data, "application/pdf"); +console.log(result.content); +``` + +**Parameters**: + +- `data: Buffer | Uint8Array` — Raw file content +- `mimeType: string` — MIME type (required) +- `config?: ExtractionConfig` — Optional configuration + +**Returns**: `Promise` + +#### `extractBytesSync(data, mimeType, config?): ExtractionResult` + +Extract content from raw bytes synchronously. + +```typescript +import { extractBytesSync } from "@kreuzberg/node"; +import { readFileSync } from "fs"; + +const data = readFileSync("document.pdf"); +const result = extractBytesSync(data, "application/pdf"); +``` + +**Parameters**: Same as `extractBytes()` + +**Returns**: `ExtractionResult` + +--- + +### Batch Extraction (Recommended) + +For processing multiple documents, batch APIs provide superior performance and memory management. + +#### `batchExtractFiles(paths, config?): Promise` + +Extract content from multiple files in parallel (asynchronous). + +```typescript +import { batchExtractFiles } from "@kreuzberg/node"; + +const files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"]; +const results = await batchExtractFiles(files); + +results.forEach((result, i) => { + console.log(`${files[i]}: ${result.content.substring(0, 100)}...`); +}); +``` + +**Parameters**: + +- `paths: string[]` — Array of file paths +- `config?: ExtractionConfig` — Configuration (applied to all files) + +**Returns**: `Promise` — Results in same order as input + +#### `batchExtractFilesSync(paths, config?): ExtractionResult[]` + +Extract content from multiple files synchronously. + +```typescript +import { batchExtractFilesSync } from "@kreuzberg/node"; + +const files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"]; +const results = batchExtractFilesSync(files); +``` + +**Parameters**: Same as `batchExtractFiles()` + +**Returns**: `ExtractionResult[]` + +#### `batchExtractBytes(dataList, mimeTypes, config?): Promise` + +Extract content from multiple byte arrays in parallel (asynchronous). + +```typescript +import { batchExtractBytes } from "@kreuzberg/node"; +import { readFile } from "fs/promises"; + +const files = ["doc1.pdf", "doc2.docx"]; +const dataList = await Promise.all(files.map((f) => readFile(f))); +const mimeTypes = [ + "application/pdf", + "application/vnd.openxmlformats-officedocument.wordprocessingml.document", +]; + +const results = await batchExtractBytes(dataList, mimeTypes); +``` + +**Parameters**: + +- `dataList: Uint8Array[]` — Array of file contents +- `mimeTypes: string[]` — MIME types (one per item, must match length) +- `config?: ExtractionConfig` — Configuration (applied to all items) + +**Returns**: `Promise` + +#### `batchExtractBytesSync(dataList, mimeTypes, config?): ExtractionResult[]` + +Extract content from multiple byte arrays synchronously. + +```typescript +import { batchExtractBytesSync } from "@kreuzberg/node"; +import { readFileSync } from "fs"; + +const dataList = ["doc1.pdf", "doc2.docx"].map((f) => readFileSync(f)); +const mimeTypes = [ + "application/pdf", + "application/vnd.openxmlformats-officedocument.wordprocessingml.document", +]; + +const results = batchExtractBytesSync(dataList, mimeTypes); +``` + +**Parameters**: Same as `batchExtractBytes()` + +**Returns**: `ExtractionResult[]` + +#### `batchExtractFilesWithConfigs(paths, fileConfigs, config?): Promise` + +Extract multiple files with per-file configuration overrides (asynchronous). + +```typescript +const results = await batchExtractFilesWithConfigs( + ["report.pdf", "scanned.pdf"], + [null, { forceOcr: true, ocr: { backend: "tesseract", language: "deu" } }], +); +``` + +**Parameters**: + +- `paths: string[]` — File paths +- `fileConfigs: (FileExtractionConfig | null)[]` — Per-file configs (null = use batch defaults) +- `config?: ExtractionConfig` — Batch-level configuration + +#### `batchExtractFilesWithConfigsSync(paths, fileConfigs, config?): ExtractionResult[]` + +Synchronous variant. + +#### `batchExtractBytesWithConfigs(dataList, mimeTypes, fileConfigs, config?): Promise` + +Extract multiple byte arrays with per-file overrides (asynchronous). + +#### `batchExtractBytesWithConfigsSync(dataList, mimeTypes, fileConfigs, config?): ExtractionResult[]` + +Synchronous variant. + +--- + +## Worker Pool APIs + +Worker pools enable concurrent extraction using Node.js worker threads for CPU-bound processing. + +### `createWorkerPool(size?): WorkerPool` + +Create a worker pool for concurrent extraction. + +```typescript +import { createWorkerPool } from "@kreuzberg/node"; + +// Create pool with default size (number of CPU cores) +const pool = createWorkerPool(); + +// Create pool with specific size +const pool4 = createWorkerPool(4); +``` + +**Parameters**: + +- `size?: number` — Number of workers (defaults to CPU core count) + +**Returns**: `WorkerPool` — Opaque handle for use with worker extraction functions + +### `extractFileInWorker(pool, filePath, mimeType?, config?): Promise` + +Extract a single file using a worker from the pool. + +```typescript +import { createWorkerPool, extractFileInWorker, closeWorkerPool } from "@kreuzberg/node"; + +const pool = createWorkerPool(4); + +try { + const files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"]; + const results = await Promise.all(files.map((f) => extractFileInWorker(pool, f))); + + results.forEach((r, i) => { + console.log(`${files[i]}: ${r.content.substring(0, 100)}...`); + }); +} finally { + await closeWorkerPool(pool); +} +``` + +**Parameters**: + +- `pool: WorkerPool` — Worker pool instance +- `filePath: string` — File path +- `mimeType?: string | null` — Optional MIME type +- `config?: ExtractionConfig` — Optional configuration + +**Returns**: `Promise` + +### `batchExtractFilesInWorker(pool, paths, config?): Promise` + +Extract multiple files using the worker pool for concurrent processing. + +```typescript +import { createWorkerPool, batchExtractFilesInWorker, closeWorkerPool } from "@kreuzberg/node"; + +const pool = createWorkerPool(4); + +try { + const files = ["invoice1.pdf", "invoice2.pdf", "invoice3.pdf"]; + const results = await batchExtractFilesInWorker(pool, files, { + ocr: { backend: "tesseract", language: "eng" }, + }); + + const total = results.reduce((sum, r) => sum + extractAmount(r.content), 0); + console.log(`Total: $${total}`); +} finally { + await closeWorkerPool(pool); +} +``` + +**Parameters**: + +- `pool: WorkerPool` — Worker pool instance +- `paths: string[]` — File paths +- `config?: ExtractionConfig` — Configuration (applied to all files) + +**Returns**: `Promise` + +### `getWorkerPoolStats(pool): WorkerPoolStats` + +Get statistics about a worker pool. + +```typescript +import { createWorkerPool, getWorkerPoolStats } from "@kreuzberg/node"; + +const pool = createWorkerPool(4); +const stats = getWorkerPoolStats(pool); + +console.log(`Pool size: ${stats.size}`); +console.log(`Active workers: ${stats.activeWorkers}`); +console.log(`Queued tasks: ${stats.queuedTasks}`); +``` + +**Parameters**: + +- `pool: WorkerPool` — Worker pool instance + +**Returns**: `WorkerPoolStats` + +### `closeWorkerPool(pool): Promise` + +Close a worker pool and shut down all worker threads. + +```typescript +import { createWorkerPool, closeWorkerPool } from "@kreuzberg/node"; + +const pool = createWorkerPool(4); + +try { + // Use pool +} finally { + await closeWorkerPool(pool); +} +``` + +**Parameters**: + +- `pool: WorkerPool` — Worker pool instance to close + +**Returns**: `Promise` + +--- + +## Configuration Interface + +### `ExtractionConfig` + +Main configuration object controlling extraction behavior. + +```typescript +interface ExtractionConfig { + // Caching and processing + useCache?: boolean; // Default: true + enableQualityProcessing?: boolean; // Default: false + + // OCR configuration + ocr?: OcrConfig; // OCR settings + forceOcr?: boolean; // Default: false + + // Document processing + chunking?: ChunkingConfig; // Break into chunks + images?: ImageExtractionConfig; // Image extraction + pdfOptions?: PdfConfig; // PDF-specific options + tokenReduction?: TokenReductionConfig; // Token optimization + languageDetection?: LanguageDetectionConfig; // Language detection + postprocessor?: PostProcessorConfig; // Post-processing + htmlOptions?: HtmlConversionOptions; // HTML conversion + keywords?: KeywordConfig; // Keyword extraction + pages?: PageExtractionConfig; // Page extraction + + // Output control + maxConcurrentExtractions?: number; // Default: 4 + outputFormat?: "plain" | "markdown" | "djot" | "html"; // Default: 'plain' + resultFormat?: "unified" | "element_based"; // Default: 'unified' +} +``` + +### `FileExtractionConfig` + +Per-file overrides for batch operations. All fields optional (omitted = use batch default). + +```typescript +interface FileExtractionConfig { + enableQualityProcessing?: boolean; + ocr?: OcrConfig; + forceOcr?: boolean; + chunking?: ChunkingConfig; + images?: ImageExtractionConfig; + pdfOptions?: PdfConfig; + tokenReduction?: TokenReductionConfig; + languageDetection?: LanguageDetectionConfig; + pages?: PageExtractionConfig; + keywords?: KeywordConfig; + postprocessor?: PostProcessorConfig; + outputFormat?: "plain" | "markdown" | "djot" | "html"; + resultFormat?: "unified" | "element_based"; + includeDocumentStructure?: boolean; +} +``` + +Excluded (batch-level only): `maxConcurrentExtractions`, `useCache`, `securityLimits`. + +### `ChunkingConfig` + +Configuration for breaking documents into chunks (useful for RAG and vector databases). + +```typescript +interface ChunkingConfig { + maxChars?: number; // Max characters per chunk (default: 4096) + maxOverlap?: number; // Overlap between chunks (default: 512) + chunkSize?: number; // Alternative unit (mutually exclusive with maxChars) + chunkOverlap?: number; // Alternative unit (mutually exclusive with maxOverlap) + preset?: string; // Named preset ('default', 'aggressive', 'minimal') + embedding?: Record; // Embedding config + enabled?: boolean; // Enable chunking (default: true when config provided) +} +``` + +**Key Point**: Use `maxChars` and `maxOverlap`, NOT `maxCharacters` or `overlap`. + +### `OcrConfig` + +Configuration for optical character recognition. + +```typescript +interface OcrConfig { + backend: string; // OCR backend name (e.g., 'tesseract') + language?: string; // Language code (e.g., 'eng', 'deu') + tesseractConfig?: TesseractConfig; +} + +interface TesseractConfig { + psm?: number; // Page Segmentation Mode (0-13) + enableTableDetection?: boolean; + tesseditCharWhitelist?: string; // Character whitelist +} +``` + +### `ImageExtractionConfig` + +Configuration for extracting and optimizing images. + +```typescript +interface ImageExtractionConfig { + extractImages?: boolean; // Default: true + targetDpi?: number; // Target DPI (default: 150) + maxImageDimension?: number; // Max width/height in pixels (default: 2000) + autoAdjustDpi?: boolean; // Auto-adjust DPI (default: true) + minDpi?: number; // Minimum DPI (default: 72) + maxDpi?: number; // Maximum DPI (default: 300) +} +``` + +### `PdfConfig` + +PDF-specific extraction options. + +```typescript +interface PdfConfig { + extractImages?: boolean; // Default: true + passwords?: string[]; // Passwords for encrypted PDFs + extractMetadata?: boolean; // Default: true + hierarchy?: HierarchyConfig; // Hierarchy extraction +} +``` + +### `LanguageDetectionConfig` + +Configuration for automatic language detection. + +```typescript +interface LanguageDetectionConfig { + enabled?: boolean; // Default: true + minConfidence?: number; // Threshold 0.0-1.0 (default: 0.5) + detectMultiple?: boolean; // Detect multiple languages (default: false) +} +``` + +### `TokenReductionConfig` + +Configuration for optimizing token usage. + +```typescript +interface TokenReductionConfig { + mode?: string; // 'aggressive' or 'conservative' (default: 'conservative') + preserveImportantWords?: boolean; // Default: true +} +``` + +### `KeywordConfig` + +Configuration for keyword extraction. + +```typescript +interface KeywordConfig { + algorithm?: "yake" | "rake"; // Default: 'yake' + maxKeywords?: number; // Maximum keywords (default: 10) + minScore?: number; // Minimum relevance score (default: 0.1) + ngramRange?: [number, number]; // N-gram range (default: [1, 3]) + language?: string; // Language code (default: 'en') + yakeParams?: YakeParams; + rakeParams?: RakeParams; +} +``` + +### `PageExtractionConfig` + +Configuration for page-level content tracking. + +```typescript +interface PageExtractionConfig { + extractPages?: boolean; // Extract as separate pages array + insertPageMarkers?: boolean; // Insert page markers in content + markerFormat?: string; // Marker format with {page_num} placeholder +} +``` + +### `HtmlConversionOptions` + +Configuration for HTML to Markdown conversion. + +```typescript +interface HtmlConversionOptions { + headingStyle?: "atx" | "underlined" | "atx_closed"; + listIndentType?: "spaces" | "tabs"; + listIndentWidth?: number; + bullets?: string; + strongEmSymbol?: string; + escapeAsterisks?: boolean; + escapeUnderscores?: boolean; + escapeMisc?: boolean; + escapeAscii?: boolean; + codeLanguage?: string; + autolinks?: boolean; + defaultTitle?: boolean; + brInTables?: boolean; + hocrSpatialTables?: boolean; + highlightStyle?: "double_equal" | "html" | "bold" | "none"; + extractMetadata?: boolean; + whitespaceMode?: "normalized" | "strict"; + stripNewlines?: boolean; + wrap?: boolean; + wrapWidth?: number; + convertAsInline?: boolean; + subSymbol?: string; + supSymbol?: string; + newlineStyle?: "spaces" | "backslash"; + codeBlockStyle?: "indented" | "backticks" | "tildes"; + keepInlineImagesIn?: string[]; + encoding?: string; + debug?: boolean; + stripTags?: string[]; + preserveTags?: string[]; + preprocessing?: HtmlPreprocessingOptions; +} +``` + +--- + +## Result Types + +### `ExtractionResult` + +Complete extraction result from document processing. + +```typescript +interface ExtractionResult { + // Main content + content: string; + + // Document type + mimeType: string; + + // Metadata (format-specific) + metadata: Metadata; + + // Extracted structures + tables: Table[]; + + // Optional processed data + detectedLanguages: string[] | null; + chunks: Chunk[] | null; // From chunking config + images: ExtractedImage[] | null; // From image extraction + elements?: Element[] | null; // From element_based result format + pages?: PageContent[] | null; // From page extraction + extractedKeywords?: ExtractedKeyword[] | null; // Extracted keywords with scores + qualityScore?: number | null; // Overall extraction quality (0.0-1.0) + processingWarnings?: ProcessingWarning[]; // Non-fatal warnings from pipeline +} +``` + +### `Table` + +Extracted table data with cell structure. + +```typescript +interface Table { + cells: string[][]; // 2D array of cell contents (rows × columns) + markdown: string; // Markdown representation + pageNumber: number; // 1-indexed page number +} +``` + +### `Chunk` + +Text chunk for RAG or vector database indexing. + +```typescript +interface Chunk { + content: string; + embedding?: number[] | null; // Vector embedding if computed + metadata: ChunkMetadata; +} + +interface ChunkMetadata { + byteStart: number; // UTF-8 byte offset in original text + byteEnd: number; // UTF-8 byte offset + tokenCount?: number | null; + chunkIndex: number; // Zero-based index + totalChunks: number; // Total number of chunks + firstPage?: number | null; // 1-indexed, if page tracking enabled + lastPage?: number | null; +} +``` + +### `ExtractedImage` + +Image extracted from document. + +```typescript +interface ExtractedImage { + data: Uint8Array; // Raw image bytes + format: string; // Format (e.g., 'png', 'jpeg', 'tiff') + imageIndex: number; // Sequential index (0-indexed) + pageNumber?: number | null; + width?: number | null; + height?: number | null; + colorspace?: string | null; + bitsPerComponent?: number | null; + isMask: boolean; + description?: string | null; + ocrResult?: ExtractionResult | null; // OCR result if processed +} +``` + +### `PageContent` + +Per-page content when page extraction is enabled. + +```typescript +interface PageContent { + pageNumber: number; // 1-indexed + content: string; // Page text content + tables: Table[]; // Tables on this page + images: ExtractedImage[]; // Images on this page +} +``` + +### `ExtractedKeyword` + +Extracted keyword with relevance score and position information. + +```typescript +interface ExtractedKeyword { + text: string; // Keyword text + score: number; // Relevance score (0.0-1.0) + algorithm: string; // Algorithm used ("tfidf", "textrank", "yake", etc.) + positions?: number[] | null; // Character positions in content (if available) +} +``` + +### `ProcessingWarning` + +Non-fatal warning encountered during document processing. + +```typescript +interface ProcessingWarning { + source: string; // Component that generated the warning + message: string; // Warning message describing the issue +} +``` + +### `Metadata` + +Extraction result metadata (format-specific). + +```typescript +interface Metadata { + // Common fields + language?: string | null; + date?: string | null; + subject?: string | null; + format_type?: + | "pdf" + | "excel" + | "email" + | "pptx" + | "archive" + | "image" + | "xml" + | "text" + | "html" + | "ocr"; + + // PDF metadata + title?: string | null; + author?: string | null; + creator?: string | null; + producer?: string | null; + creation_date?: string | null; + modification_date?: string | null; + page_count?: number; + + // Excel metadata + sheet_count?: number; + sheet_names?: string[]; + + // Email metadata + from_email?: string | null; + from_name?: string | null; + to_emails?: string[]; + cc_emails?: string[]; + bcc_emails?: string[]; + message_id?: string | null; + attachments?: string[]; + + // Image metadata + width?: number; + height?: number; + exif?: Record; + + // OCR metadata + psm?: number; + output_format?: string; + table_count?: number; + + // HTML metadata + canonical_url?: string | null; + html_language?: string | null; + text_direction?: "ltr" | "rtl" | "auto" | null; + open_graph?: Record; + twitter_card?: Record; + meta_tags?: Record; + html_headers?: HeaderMetadata[]; + html_links?: LinkMetadata[]; + html_images?: HtmlImageMetadata[]; + structured_data?: StructuredData[]; + + // Text metadata + line_count?: number; + word_count?: number; + character_count?: number; + headers?: string[] | null; + links?: [string, string][] | null; + code_blocks?: [string, string][] | null; + + // Page structure + page_structure?: PageStructure | null; + + // Additional typed fields + category?: string | null; + tags?: string[]; + document_version?: string | null; + abstract_text?: string | null; + + // Custom fields from postprocessors + [key: string]: unknown; +} +``` + +--- + +## Error Handling + +### Error Classes + +```typescript +import { + KreuzbergError, + ParsingError, + OcrError, // Note: camelCase, not "OCRError" + ValidationError, + MissingDependencyError, + CacheError, + ImageProcessingError, + PluginError, + ErrorCode, +} from "@kreuzberg/node"; +``` + +**Error Hierarchy**: + +- `KreuzbergError` — Base class for all Kreuzberg errors + - `ParsingError` — Document format invalid or corrupted + - `OcrError` — OCR processing failed + - `ValidationError` — Extraction validation failed + - `MissingDependencyError` — Required dependency unavailable + - `CacheError` — Cache operation failed + - `ImageProcessingError` — Image extraction or processing failed + - `PluginError` — Plugin registration or execution failed + +### Error Diagnostics + +```typescript +import { + classifyError, + getErrorCodeDescription, + getErrorCodeName, + getLastErrorCode, + getLastPanicContext, +} from "@kreuzberg/node"; + +try { + const result = await extractFile("document.pdf"); +} catch (error) { + const classification = classifyError(error.message); + console.log(`Error code: ${getErrorCodeName(classification.code)}`); + console.log(`Description: ${getErrorCodeDescription(classification.code)}`); + console.log(`Confidence: ${classification.confidence}`); +} +``` + +### `ErrorCode` Enum + +```typescript +enum ErrorCode { + Success = 0, + GenericError = 1, + Panic = 2, + InvalidArgument = 3, + IoError = 4, + ParsingError = 5, + OcrError = 6, + MissingDependency = 7, +} +``` + +--- + +## Plugin System + +### Post-Processors + +Custom post-processors can enrich extraction results without failing the extraction if they encounter errors. + +#### `registerPostProcessor(processor): void` + +Register a custom post-processor. + +```typescript +import { registerPostProcessor, extractFile } from "@kreuzberg/node"; + +const processor = { + name() { + return "my_processor"; + }, + + async process(result) { + // Enrich result with custom metadata + result.metadata["custom_field"] = "value"; + return result; + }, + + processingStage() { + return "late"; // 'early', 'middle', or 'late' + }, + + async initialize() { + // Called once when registered + }, + + async shutdown() { + // Called when unregistered + }, +}; + +registerPostProcessor(processor); +const result = await extractFile("document.pdf"); +``` + +#### `unregisterPostProcessor(name): void` + +Remove a registered post-processor. + +```typescript +import { unregisterPostProcessor } from "@kreuzberg/node"; + +unregisterPostProcessor("my_processor"); +``` + +#### `listPostProcessors(): string[]` + +List all registered post-processor names. + +```typescript +import { listPostProcessors } from "@kreuzberg/node"; + +const processors = listPostProcessors(); +console.log("Registered processors:", processors); +``` + +#### `clearPostProcessors(): void` + +Unregister all post-processors. + +```typescript +import { clearPostProcessors } from "@kreuzberg/node"; + +clearPostProcessors(); +``` + +### Validators + +Custom validators check extraction results and fail the extraction if validation fails (unlike post-processors). + +#### `registerValidator(validator): void` + +Register a custom validator. + +```typescript +import { registerValidator, extractFile } from "@kreuzberg/node"; + +const validator = { + name() { + return "content_length_validator"; + }, + + validate(result) { + if (result.content.length < 10) { + throw new Error("Content too short"); + } + }, + + priority() { + return 100; // Higher = runs first + }, + + shouldValidate(result) { + return result.mimeType === "application/pdf"; // Conditional validation + }, + + async initialize() { + // Called once when registered + }, + + async shutdown() { + // Called when unregistered + }, +}; + +registerValidator(validator); +const result = await extractFile("document.pdf"); +``` + +#### `unregisterValidator(name): void` + +Remove a registered validator. + +```typescript +import { unregisterValidator } from "@kreuzberg/node"; + +unregisterValidator("content_length_validator"); +``` + +#### `listValidators(): string[]` + +List all registered validator names. + +```typescript +import { listValidators } from "@kreuzberg/node"; + +const validators = listValidators(); +``` + +#### `clearValidators(): void` + +Unregister all validators. + +```typescript +import { clearValidators } from "@kreuzberg/node"; + +clearValidators(); +``` + +### OCR Backends + +Custom OCR backends can be registered to handle image text extraction. + +#### `registerOcrBackend(backend): void` + +Register a custom OCR backend. + +```typescript +import { registerOcrBackend, extractFile } from "@kreuzberg/node"; + +const backend = { + name() { + return "my-ocr"; + }, + + supportedLanguages() { + return ["eng", "deu", "fra"]; + }, + + async processImage(imageBytes, language) { + // imageBytes: Uint8Array or Base64 string + const buffer = + typeof imageBytes === "string" ? Buffer.from(imageBytes, "base64") : Buffer.from(imageBytes); + + // Process and extract text + return { + content: "extracted text", + mime_type: "text/plain", + metadata: { confidence: 0.95, language }, + tables: [], + }; + }, + + async initialize() { + // Load models, setup resources + }, + + async shutdown() { + // Cleanup resources + }, +}; + +registerOcrBackend(backend); +``` + +#### `GutenOcrBackend` + +Built-in OCR backend implementation using Guten-OCR. + +```typescript +import { GutenOcrBackend, registerOcrBackend, extractFile } from "@kreuzberg/node"; + +const backend = new GutenOcrBackend(); +await backend.initialize(); +registerOcrBackend(backend); + +const result = await extractFile("scanned.pdf", null, { + ocr: { backend: "guten-ocr", language: "eng" }, +}); +``` + +#### `unregisterOcrBackend(name): void` + +Remove a registered OCR backend. + +```typescript +import { unregisterOcrBackend } from "@kreuzberg/node"; + +unregisterOcrBackend("my-ocr"); +``` + +#### `listOcrBackends(): string[]` + +List all registered OCR backend names. + +```typescript +import { listOcrBackends } from "@kreuzberg/node"; + +const backends = listOcrBackends(); +``` + +#### `clearOcrBackends(): void` + +Unregister all OCR backends. + +```typescript +import { clearOcrBackends } from "@kreuzberg/node"; + +clearOcrBackends(); +``` + +--- + +## MIME Type Utilities + +### `detectMimeType(data): string | null` + +Detect MIME type from file content (magic bytes). + +```typescript +import { detectMimeType } from "@kreuzberg/node"; +import { readFileSync } from "fs"; + +const data = readFileSync("document"); +const mimeType = detectMimeType(data); +console.log(`Detected MIME type: ${mimeType}`); +``` + +### `detectMimeTypeFromPath(filePath): string | null` + +Detect MIME type from file extension. + +```typescript +import { detectMimeTypeFromPath } from "@kreuzberg/node"; + +const mimeType = detectMimeTypeFromPath("document.pdf"); +console.log(`MIME type: ${mimeType}`); // 'application/pdf' +``` + +### `getExtensionsForMime(mimeType): string[]` + +Get file extensions for a MIME type. + +```typescript +import { getExtensionsForMime } from "@kreuzberg/node"; + +const extensions = getExtensionsForMime("application/pdf"); +console.log(`Extensions: ${extensions}`); // ['.pdf'] +``` + +### `validateMimeType(mimeType): boolean` + +Check if a MIME type is valid. + +```typescript +import { validateMimeType } from "@kreuzberg/node"; + +if (validateMimeType("application/pdf")) { + console.log("Valid MIME type"); +} +``` + +--- + +## Configuration Loading + +### `ExtractionConfig.fromFile(filePath): ExtractionConfig` + +Load extraction configuration from a file (JSON, YAML, or TOML). + +```typescript +import { ExtractionConfig, extractFile } from "@kreuzberg/node"; + +const config = ExtractionConfig.fromFile("./kreuzberg.toml"); +const result = await extractFile("document.pdf", null, config); +``` + +### `ExtractionConfig.discover(): ExtractionConfig | null` + +Auto-discover extraction configuration file in current and parent directories. + +```typescript +import { ExtractionConfig, extractFile } from "@kreuzberg/node"; + +// Searches for kreuzberg.{toml,yaml,json} in current directory and parents +const config = ExtractionConfig.discover(); +if (config) { + const result = await extractFile("document.pdf", null, config); +} +``` + +--- + +## Embeddings + +### `getEmbeddingPreset(name): EmbeddingPreset | null` + +Get a named embedding model preset. + +```typescript +import { getEmbeddingPreset } from "@kreuzberg/node"; + +const preset = getEmbeddingPreset("default"); +if (preset) { + console.log(`Model: ${preset.modelName}`); + console.log(`Dimensions: ${preset.dimensions}`); +} +``` + +### `listEmbeddingPresets(): string[]` + +List all available embedding presets. + +```typescript +import { listEmbeddingPresets } from "@kreuzberg/node"; + +const presets = listEmbeddingPresets(); +console.log("Available presets:", presets); +``` + +### `EmbeddingPreset` + +Type definition for embedding model presets. + +```typescript +interface EmbeddingPreset { + name: string; // Preset name (e.g., "fast", "balanced", "quality", "multilingual") + chunkSize: number; // Recommended chunk size in characters + overlap: number; // Recommended overlap in characters + modelName: string; // Model identifier (e.g., "AllMiniLML6V2Q", "BGEBaseENV15") + dimensions: number; // Embedding vector dimensions + description: string; // Human-readable description +} +``` + +--- + +## Plugin Protocols + +### `PostProcessorProtocol` + +Interface for custom post-processors. + +```typescript +interface PostProcessorProtocol { + name(): string; + + process(result: ExtractionResult): ExtractionResult | Promise; + + processingStage?(): ProcessingStage; // 'early' | 'middle' | 'late' + + initialize?(): void | Promise; + + shutdown?(): void | Promise; +} +``` + +### `ValidatorProtocol` + +Interface for custom validators. + +```typescript +interface ValidatorProtocol { + name(): string; + + validate(result: ExtractionResult): void | Promise; + + priority?(): number; // Higher = runs first + + shouldValidate?(result: ExtractionResult): boolean; + + initialize?(): void | Promise; + + shutdown?(): void | Promise; +} +``` + +### `OcrBackendProtocol` + +Interface for custom OCR backends. + +```typescript +interface OcrBackendProtocol { + name(): string; + + supportedLanguages(): string[]; + + processImage( + imageBytes: Uint8Array | string, + language: string, + ): Promise<{ + content: string; + mime_type: string; + metadata: Record; + tables: unknown[]; + }>; + + initialize?(): void | Promise; + + shutdown?(): void | Promise; +} +``` + +--- + +## Supported Document Formats + +- **Documents**: PDF, DOCX, PPTX, XLSX, DOC, PPT +- **Text**: Markdown, Plain Text, XML, JSON, YAML, TOML +- **Web**: HTML (converted to Markdown) +- **Email**: EML, MSG +- **Images**: PNG, JPEG, TIFF (with OCR support) +- **Archives**: ZIP, TAR, GZIP (file listing) + +--- + +## Registry Functions + +### Document Extractors + +```typescript +import { + listDocumentExtractors, + unregisterDocumentExtractor, + clearDocumentExtractors, +} from "@kreuzberg/node"; + +// List registered extractors +const extractors = listDocumentExtractors(); + +// Unregister a specific extractor +unregisterDocumentExtractor("pdf"); + +// Clear all extractors +clearDocumentExtractors(); +``` + +--- + +## Type Exports + +All types are exported from `@kreuzberg/node`: + +```typescript +export type { + Chunk, + ChunkingConfig, + ExtractionConfig, + ExtractionResult, + ExtractedImage, + KeywordConfig, + LanguageDetectionConfig, + OcrBackendProtocol, + OcrConfig, + PageContent, + PageExtractionConfig, + PdfConfig, + PostProcessorProtocol, + Table, + TokenReductionConfig, + ValidatorProtocol, + WorkerPool, + WorkerPoolStats, + EmbeddingPreset, + // ... and many more +}; +``` + +--- + +## Best Practices + +1. **Use batch APIs for multiple documents**: `batchExtractFiles()` provides superior performance vs. calling `extractFile()` in a loop. + +2. **Enable chunking for RAG/vector DB**: Set `chunking` config to automatically break documents into overlapping chunks. + +3. **Use worker pools for high-concurrency scenarios**: Distribute CPU-bound work across multiple threads for 4+ concurrent extractions. + +4. **Configure language detection**: Enable automatic language detection for multilingual documents. + +5. **Register validators early**: Set up validators before calling extraction functions to catch quality issues immediately. + +6. **Use specific MIME types**: Provide explicit MIME types when available to avoid detection overhead. + +7. **Clean up resources**: Always call `closeWorkerPool()` when done to prevent resource leaks. + +8. **Handle extraction errors gracefully**: Catch specific error types (`ParsingError`, `OcrError`, etc.) for appropriate error handling. + +--- + +## Version + +**Package Version**: 4.2.14 diff --git a/plugins/kreuzberg/skills/kreuzberg/references/other-bindings.md b/plugins/kreuzberg/skills/kreuzberg/references/other-bindings.md new file mode 100644 index 0000000..515b8c8 --- /dev/null +++ b/plugins/kreuzberg/skills/kreuzberg/references/other-bindings.md @@ -0,0 +1,212 @@ +# Language Bindings Reference + +Kreuzberg provides native bindings for multiple programming languages, each with precompiled binaries for x86_64 and aarch64 on Linux and macOS. This reference covers installation and basic usage for each binding. + +## Go + +**Installation:** + +```bash +go get github.com/kreuzberg-dev/kreuzberg/packages/go/v5 +``` + +**Basic Extraction:** + +```go +package main + +import ( + "context" + "fmt" + "github.com/kreuzberg-dev/kreuzberg/packages/go/v5/kreuzberg" +) + +func main() { + ctx := context.Background() + result, err := kreuzberg.ExtractFile(ctx, "document.pdf", nil) + if err != nil { + panic(err) + } + fmt.Println(result.Content) +} +``` + +See the [Go binding documentation](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/go) for complete API reference. + +## Ruby + +**Installation:** + +```bash +gem install kreuzberg +``` + +Or in your Gemfile: + +```ruby +gem 'kreuzberg' +``` + +**Basic Extraction:** + +```ruby +require 'kreuzberg' + +result = Kreuzberg.extract_file_sync('document.pdf') +puts result.content +``` + +See the [Ruby binding documentation](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/ruby) for complete API reference. + +## Java + +**Installation:** +Add to your Maven `pom.xml`: + +```xml + + dev.kreuzberg + kreuzberg + 4.2.x + +``` + +**Basic Extraction:** + +```java +import dev.kreuzberg.Kreuzberg; +import dev.kreuzberg.ExtractionResult; + +public class Example { + public static void main(String[] args) throws Exception { + ExtractionResult result = Kreuzberg.extractFile("document.pdf"); + System.out.println(result.getContent()); + } +} +``` + +See the [Java binding documentation](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/java) for complete API reference. + +## C + +**Installation:** + +```bash +dotnet add package Kreuzberg +``` + +**Basic Extraction:** + +```csharp +using Kreuzberg; + +var result = KreuzbergClient.ExtractFileSync("document.pdf"); +Console.WriteLine(result.Content); +``` + +See the [C# binding documentation](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/csharp) for complete API reference. + +## PHP + +**Installation:** + +```bash +composer require kreuzberg/kreuzberg +``` + +**Basic Extraction:** + +```php +extractFile('document.pdf'); +echo $result->content; +``` + +See the [PHP binding documentation](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/php) for complete API reference. + +## Elixir + +**Installation:** +Add to your `mix.exs` dependencies: + +```elixir +def deps do + [ + kreuzberg: "~> 4.2" + ] +end +``` + +**Basic Extraction:** + +```elixir +{:ok, result} = Kreuzberg.extract_file("document.pdf") +IO.puts(result.content) +``` + +See the [Elixir binding documentation](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/elixir) for complete API reference. + +## WebAssembly (WASM) + +**Installation:** + +```bash +npm install @kreuzberg/wasm +``` + +**Basic Extraction:** + +```typescript +import { extractBytes } from "@kreuzberg/wasm"; + +const fileData = await fs.promises.readFile("document.pdf"); +const result = await extractBytes(fileData, "application/pdf"); +console.log(result.content); +``` + +Supports browsers, Deno, and Cloudflare Workers. + +See the [WASM binding documentation](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/typescript) for complete API reference. + +## Docker + +**Installation:** +Pull the official image from GitHub Container Registry: + +```bash +docker pull ghcr.io/kreuzberg-dev/kreuzberg +``` + +**API Server Mode:** + +```bash +docker run -p 8000:8000 ghcr.io/kreuzberg-dev/kreuzberg serve --host 0.0.0.0 +``` + +**CLI Mode:** + +```bash +docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg extract /data/document.pdf +``` + +**MCP Server Mode:** + +```bash +docker run -i ghcr.io/kreuzberg-dev/kreuzberg mcp +``` + +Image sizes: + +- Core image: 1.0-1.3GB +- Full image: ~1.0-1.3GB + +See the [Docker guide](https://docs.kreuzberg.dev/guides/docker/) for deployment details. + +## Platform Support + +All language bindings include precompiled binaries for x86_64 and aarch64 on Linux and macOS. Windows support varies by binding. Refer to the main [README](https://github.com/kreuzberg-dev/kreuzberg) for platform compatibility matrix. diff --git a/plugins/kreuzberg/skills/kreuzberg/references/python-api.md b/plugins/kreuzberg/skills/kreuzberg/references/python-api.md new file mode 100644 index 0000000..bf6adaf --- /dev/null +++ b/plugins/kreuzberg/skills/kreuzberg/references/python-api.md @@ -0,0 +1,1440 @@ +# Kreuzberg Python API Reference + +Comprehensive documentation for the Kreuzberg Python API. All extraction logic and heavy lifting is implemented in high-performance Rust, with Python adding OCR backends (EasyOCR, PaddleOCR) and custom post-processor support. + +## Extraction Functions + +### Synchronous File Extraction + +```python +def extract_file_sync( + file_path: str | Path, + mime_type: str | None = None, + config: ExtractionConfig | None = None, + *, + easyocr_kwargs: dict[str, Any] | None = None, + paddleocr_kwargs: dict[str, Any] | None = None, +) -> ExtractionResult +``` + +Extract content from a file (synchronous). + +**Parameters:** + +- `file_path` (str | Path): Path to the file +- `mime_type` (str | None): Optional MIME type hint (auto-detected if None) +- `config` (ExtractionConfig | None): Extraction configuration (uses defaults if None) +- `easyocr_kwargs` (dict | None): EasyOCR initialization options (languages, use_gpu, beam_width, etc.) +- `paddleocr_kwargs` (dict | None): PaddleOCR initialization options (lang, use_angle_cls, show_log, etc.) + +**Returns:** ExtractionResult with content, metadata, and tables + +**Example:** + +```python +from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig, TesseractConfig + +# Basic usage +result = extract_file_sync("document.pdf") + +# With Tesseract configuration +config = ExtractionConfig( + ocr=OcrConfig( + backend="tesseract", + language="eng", + tesseract_config=TesseractConfig(psm=6, enable_table_detection=True), + ) +) +result = extract_file_sync("invoice.pdf", config=config) + +# With EasyOCR custom options +config = ExtractionConfig(ocr=OcrConfig(backend="easyocr", language="eng")) +result = extract_file_sync("scanned.pdf", config=config, easyocr_kwargs={"use_gpu": True}) +``` + +### Asynchronous File Extraction + +```python +async def extract_file( + file_path: str | Path, + mime_type: str | None = None, + config: ExtractionConfig | None = None, + *, + easyocr_kwargs: dict[str, Any] | None = None, + paddleocr_kwargs: dict[str, Any] | None = None, +) -> ExtractionResult +``` + +Extract content from a file (asynchronous). Same parameters and behavior as `extract_file_sync`. + +### Synchronous Bytes Extraction + +```python +def extract_bytes_sync( + data: bytes | bytearray, + mime_type: str, + config: ExtractionConfig | None = None, + *, + easyocr_kwargs: dict[str, Any] | None = None, + paddleocr_kwargs: dict[str, Any] | None = None, +) -> ExtractionResult +``` + +Extract content from bytes (synchronous). + +**Parameters:** + +- `data` (bytes | bytearray): File content as bytes or bytearray +- `mime_type` (str): MIME type of the data (required for format detection) +- `config` (ExtractionConfig | None): Extraction configuration +- `easyocr_kwargs` (dict | None): EasyOCR initialization options +- `paddleocr_kwargs` (dict | None): PaddleOCR initialization options + +**Returns:** ExtractionResult with content, metadata, and tables + +### Asynchronous Bytes Extraction + +```python +async def extract_bytes( + data: bytes | bytearray, + mime_type: str, + config: ExtractionConfig | None = None, + *, + easyocr_kwargs: dict[str, Any] | None = None, + paddleocr_kwargs: dict[str, Any] | None = None, +) -> ExtractionResult +``` + +Extract content from bytes (asynchronous). Same parameters and behavior as `extract_bytes_sync`. + +### Batch File Extraction + +```python +async def batch_extract_files( + paths: list[str | Path], + config: ExtractionConfig | None = None, + *, + easyocr_kwargs: dict[str, Any] | None = None, + paddleocr_kwargs: dict[str, Any] | None = None, +) -> list[ExtractionResult] +``` + +Extract content from multiple files in parallel (asynchronous). + +**Parameters:** + +- `paths` (list[str | Path]): List of file paths +- `config` (ExtractionConfig | None): Extraction configuration +- `easyocr_kwargs` (dict | None): EasyOCR initialization options +- `paddleocr_kwargs` (dict | None): PaddleOCR initialization options + +**Returns:** List of ExtractionResults (one per file) + +### Batch File Extraction (Synchronous) + +```python +def batch_extract_files_sync( + paths: list[str | Path], + config: ExtractionConfig | None = None, + *, + easyocr_kwargs: dict[str, Any] | None = None, + paddleocr_kwargs: dict[str, Any] | None = None, +) -> list[ExtractionResult] +``` + +Extract content from multiple files in parallel (synchronous). + +### Batch Bytes Extraction + +```python +async def batch_extract_bytes( + data_list: list[bytes | bytearray], + mime_types: list[str], + config: ExtractionConfig | None = None, + *, + easyocr_kwargs: dict[str, Any] | None = None, + paddleocr_kwargs: dict[str, Any] | None = None, +) -> list[ExtractionResult] +``` + +Extract content from multiple byte arrays in parallel (asynchronous). + +**Parameters:** + +- `data_list` (list[bytes | bytearray]): List of file contents as bytes/bytearray +- `mime_types` (list[str]): List of MIME types (one per data item) +- `config` (ExtractionConfig | None): Extraction configuration +- `easyocr_kwargs` (dict | None): EasyOCR initialization options +- `paddleocr_kwargs` (dict | None): PaddleOCR initialization options + +**Returns:** List of ExtractionResults (one per data item) + +### Batch Bytes Extraction (Synchronous) + +```python +def batch_extract_bytes_sync( + data_list: list[bytes | bytearray], + mime_types: list[str], + config: ExtractionConfig | None = None, + *, + easyocr_kwargs: dict[str, Any] | None = None, + paddleocr_kwargs: dict[str, Any] | None = None, +) -> list[ExtractionResult] +``` + +Extract content from multiple byte arrays in parallel (synchronous). + +### Per-File Config in Batch Functions + +As of v4.5.0, per-file configuration overrides are passed as an optional `file_configs` parameter on the unified batch functions: + +```python +def batch_extract_files_sync( + paths: list[str | Path], + config: ExtractionConfig | None = None, + *, + file_configs: list[FileExtractionConfig | None] | None = None, + easyocr_kwargs: dict[str, Any] | None = None, +) -> list[ExtractionResult] +``` + +The `file_configs` list must have the same length as `paths`. Each element is either a `FileExtractionConfig` override or `None` to use batch defaults. The same parameter is available on `batch_extract_files`, `batch_extract_bytes_sync`, and `batch_extract_bytes`. + +> **Note:** The separate `batch_extract_files_with_configs_sync` / `batch_extract_files_with_configs` / `batch_extract_bytes_with_configs_sync` / `batch_extract_bytes_with_configs` functions have been removed in v4.5.0. + +## Configuration Classes + +### ExtractionConfig + +Main extraction configuration for document processing. All attributes are optional and use sensible defaults when not specified. + +**Attributes:** + +| Field | Type | Default | Description | +| ---------------------------- | ------------------------------- | ------------- | ----------------------------------------------------------------------------------------- | +| `use_cache` | bool | True | Enable caching of extraction results to improve performance on repeated extractions | +| `enable_quality_processing` | bool | True | Enable quality post-processing to clean and normalize extracted text | +| `ocr` | OcrConfig \| None | None | OCR configuration for extracting text from images. None = OCR disabled | +| `force_ocr` | bool | False | Force OCR processing even for searchable PDFs that contain extractable text | +| `chunking` | ChunkingConfig \| None | None | Text chunking configuration for dividing content into manageable chunks. None = disabled | +| `images` | ImageExtractionConfig \| None | None | Image extraction configuration for extracting images FROM documents. None = no extraction | +| `pdf_options` | PdfConfig \| None | None | PDF-specific options like password handling and metadata extraction | +| `token_reduction` | TokenReductionConfig \| None | None | Token reduction configuration for reducing token count in extracted content | +| `language_detection` | LanguageDetectionConfig \| None | None | Language detection configuration for identifying document language(s) | +| `keywords` | KeywordConfig \| None | None | Keyword extraction configuration for identifying important terms and phrases | +| `postprocessor` | PostProcessorConfig \| None | None | Post-processor configuration for custom text processing | +| `max_concurrent_extractions` | int \| None | num_cpus \* 2 | Maximum concurrent extractions in batch operations | +| `html_options` | HtmlConversionOptions \| None | None | HTML conversion options for converting documents to markdown | +| `pages` | PageConfig \| None | None | Page extraction configuration for tracking page boundaries | +| `security_limits` | dict[str, int] \| None | None | Security limits configuration | +| `result_format` | str | "unified" | Result format: "unified" or "element_based" | +| `output_format` | str | "plain" | Output content format: "plain", "markdown", "djot", or "html" | + +**Example:** + +```python +from kreuzberg import ExtractionConfig, ChunkingConfig, OcrConfig + +# Basic extraction with defaults +config = ExtractionConfig() + +# Enable chunking with 512-char chunks and 100-char overlap +config = ExtractionConfig(chunking=ChunkingConfig(max_chars=512, max_overlap=100)) + +# Enable OCR with Tesseract +config = ExtractionConfig(ocr=OcrConfig(backend="tesseract", language="eng")) + +# Multiple options +config = ExtractionConfig( + use_cache=True, + enable_quality_processing=True, + output_format="markdown", + result_format="unified" +) +``` + +### FileExtractionConfig + +Per-file extraction overrides for batch operations. All fields optional (`None` = use batch default). + +**Key fields:** `enable_quality_processing`, `ocr`, `force_ocr`, `chunking`, `images`, `pdf_options`, `token_reduction`, `language_detection`, `pages`, `keywords`, `postprocessor`, `html_options`, `result_format`, `output_format`, `include_document_structure`, `layout`. + +Excluded (batch-level only): `max_concurrent_extractions`, `use_cache`, `acceleration`, `security_limits`. + +```python +per_file = FileExtractionConfig( + force_ocr=True, + ocr=OcrConfig(backend="tesseract", language="deu"), +) +``` + +### OcrConfig + +OCR configuration for extracting text from images. + +**Attributes:** + +| Field | Type | Default | Description | +| ------------------ | ----------------------- | ----------- | ----------------------------------------------------------------------------------------------------- | +| `backend` | str | "tesseract" | OCR backend: "tesseract", "easyocr", or "paddleocr" | +| `language` | str | "eng" | Language code (ISO 639-3 three-letter: "eng", "deu", "fra" or ISO 639-1 two-letter: "en", "de", "fr") | +| `tesseract_config` | TesseractConfig \| None | None | Tesseract-specific configuration (only used when backend="tesseract") | + +**Example:** + +```python +from kreuzberg import OcrConfig + +# Tesseract with German language +config = OcrConfig(backend="tesseract", language="deu") + +# EasyOCR for faster recognition +config = OcrConfig(backend="easyocr", language="eng") + +# PaddleOCR for production deployments +config = OcrConfig(backend="paddleocr", language="chi_sim") +``` + +### TesseractConfig + +Detailed Tesseract OCR configuration for advanced tuning. Fine-tune Tesseract OCR behavior for specific document types and quality levels. + +**Attributes:** + +| Field | Type | Default | Description | +| ------------------------------------ | -------------------------------- | ---------- | ----------------------------------------------------------------------------------------- | +| `language` | str | "eng" | OCR language (ISO 639-3 three-letter code) | +| `psm` | int | 3 | Page Segmentation Mode: 0 (detection only), 3 (auto), 6 (uniform block), 11 (sparse text) | +| `output_format` | str | "markdown" | Output format for OCR results | +| `oem` | int | 3 | OCR Engine Mode: 0 (legacy), 1 (LSTM), 2 (both), 3 (auto) | +| `min_confidence` | float | 0.0 | Minimum confidence threshold (0.0-1.0) for accepting OCR results | +| `preprocessing` | ImagePreprocessingConfig \| None | None | Image preprocessing configuration before OCR | +| `enable_table_detection` | bool | True | Enable automatic table detection and extraction | +| `table_min_confidence` | float | 0.0 | Minimum confidence for table detection (0.0-1.0) | +| `table_column_threshold` | int | 50 | Minimum pixel width between columns | +| `table_row_threshold_ratio` | float | 0.5 | Minimum row height ratio | +| `use_cache` | bool | True | Cache OCR results for improved performance | +| `classify_use_pre_adapted_templates` | bool | True | Use pre-adapted character templates | +| `language_model_ngram_on` | bool | False | Enable language model n-gram processing | +| `tessedit_dont_blkrej_good_wds` | bool | True | Don't block-reject good words | +| `tessedit_dont_rowrej_good_wds` | bool | True | Don't row-reject good words | +| `tessedit_enable_dict_correction` | bool | True | Enable dictionary-based spelling correction | +| `tessedit_char_whitelist` | str | "" | Whitelist of characters to recognize (empty = all) | +| `tessedit_char_blacklist` | str | "" | Blacklist of characters to ignore | +| `tessedit_use_primary_params_model` | bool | True | Use primary parameters model | +| `textord_space_size_is_variable` | bool | True | Allow variable space sizes | +| `thresholding_method` | bool | False | Thresholding method for binarization | + +**Example:** + +```python +from kreuzberg import TesseractConfig, ImagePreprocessingConfig + +# General document OCR +config = TesseractConfig(psm=3, oem=3) + +# Invoice/form OCR with table detection +config = TesseractConfig(psm=6, oem=2, enable_table_detection=True, min_confidence=0.6) + +# High-precision technical document OCR +config = TesseractConfig( + psm=3, + oem=2, + preprocessing=ImagePreprocessingConfig(denoise=True, contrast_enhance=True, auto_rotate=True), + min_confidence=0.7, + tessedit_enable_dict_correction=True, +) + +# Numeric-only OCR (for receipts, barcodes) +config = TesseractConfig(psm=6, tessedit_char_whitelist="0123456789.-,", min_confidence=0.8) + +# Multiple language document +config = TesseractConfig(language="eng+fra+deu", psm=3, oem=2) +``` + +### ChunkingConfig + +Text chunking configuration for dividing content into chunks. Chunking is useful for preparing content for embedding, indexing, or processing with length-limited systems (like LLM context windows). + +**Attributes:** + +| Field | Type | Default | Description | +| ------------- | ----------------------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------- | +| `max_chars` | int | 1000 | Maximum number of characters per chunk. Chunks larger than this will be split intelligently at sentence/paragraph boundaries | +| `max_overlap` | int | 200 | Overlap between consecutive chunks in characters. Creates context bridges between chunks for smoother processing | +| `embedding` | EmbeddingConfig \| None | None | Configuration for generating embeddings for each chunk using ONNX models. None = no embeddings | +| `preset` | str \| None | None | Use a preset chunking configuration (overrides individual settings if provided). Use list_embedding_presets() to see available presets | + +**IMPORTANT:** The fields are `max_chars` and `max_overlap` (NOT `max_characters` or `overlap`). + +**Example:** + +```python +from kreuzberg import ExtractionConfig, ChunkingConfig, EmbeddingConfig, EmbeddingModelType + +# Basic chunking with defaults +config = ExtractionConfig(chunking=ChunkingConfig()) + +# Custom chunk size with overlap +config = ExtractionConfig(chunking=ChunkingConfig(max_chars=512, max_overlap=100)) + +# Chunking with embeddings +config = ExtractionConfig( + chunking=ChunkingConfig( + max_chars=512, + embedding=EmbeddingConfig(model=EmbeddingModelType.preset("balanced")) + ) +) + +# Using preset configuration +config = ExtractionConfig(chunking=ChunkingConfig(preset="semantic")) +``` + +### PdfConfig + +PDF-specific extraction configuration. + +**Attributes:** + +| Field | Type | Default | Description | +| ------------------ | ----------------------- | ------- | --------------------------------------------------------------------------------------------------- | +| `extract_images` | bool | False | Extract images from PDF documents | +| `passwords` | list[str] \| None | None | List of passwords to try when opening encrypted PDFs. Try each password in order until one succeeds | +| `extract_metadata` | bool | True | Extract PDF metadata (title, author, creation date, etc.) | +| `hierarchy` | HierarchyConfig \| None | None | Document hierarchy detection configuration. None = no hierarchy detection | + +**Example:** + +```python +from kreuzberg import ExtractionConfig, PdfConfig, HierarchyConfig + +# Basic PDF configuration +config = ExtractionConfig(pdf_options=PdfConfig()) + +# Extract metadata and images from PDF +config = ExtractionConfig(pdf_options=PdfConfig(extract_images=True, extract_metadata=True)) + +# Handle encrypted PDFs +config = ExtractionConfig(pdf_options=PdfConfig(passwords=["password123", "fallback_password"])) + +# Enable hierarchy detection +config = ExtractionConfig(pdf_options=PdfConfig(hierarchy=HierarchyConfig(k_clusters=6))) +``` + +### ImageExtractionConfig + +Configuration for extracting images FROM documents. This is NOT for preprocessing images before OCR. + +**Attributes:** + +| Field | Type | Default | Description | +| --------------------- | ---- | ------- | ------------------------------------------------------------------------------------ | +| `extract_images` | bool | True | Enable image extraction from documents | +| `target_dpi` | int | 300 | Target DPI for image normalization. Images are resampled to this DPI for consistency | +| `max_image_dimension` | int | 4096 | Maximum width or height for extracted images. Larger images are downscaled to fit | +| `auto_adjust_dpi` | bool | True | Automatically adjust DPI based on image content quality | +| `min_dpi` | int | 72 | Minimum DPI threshold. Images with lower DPI are upscaled | +| `max_dpi` | int | 600 | Maximum DPI threshold. Images with higher DPI are downscaled | + +**Example:** + +```python +from kreuzberg import ExtractionConfig, ImageExtractionConfig + +# Basic image extraction +config = ExtractionConfig(images=ImageExtractionConfig()) + +# Extract images with custom DPI settings +config = ExtractionConfig( + images=ImageExtractionConfig(target_dpi=150, max_image_dimension=2048, auto_adjust_dpi=False) +) +``` + +### EmbeddingConfig + +Embedding generation configuration for text chunks. Configures embedding generation using ONNX models via fastembed-rs. + +**Attributes:** + +| Field | Type | Default | Description | +| ------------------------ | ------------------ | ----------------- | ------------------------------------------------------------------------------------------- | +| `model` | EmbeddingModelType | Preset "balanced" | The embedding model to use (preset, fastembed, or custom) | +| `normalize` | bool | True | Whether to normalize embedding vectors to unit length (recommended for cosine similarity) | +| `batch_size` | int | 32 | Number of texts to process simultaneously. Higher values use more memory but may be faster | +| `show_download_progress` | bool | False | Display progress during embedding model download | +| `cache_dir` | str \| None | None | Custom directory for caching downloaded models (defaults to ~/.cache/kreuzberg/embeddings/) | + +**Example:** + +```python +from kreuzberg import EmbeddingConfig, EmbeddingModelType + +# Basic preset embedding (recommended) +config = EmbeddingConfig() + +# Specific preset with settings +config = EmbeddingConfig( + model=EmbeddingModelType.preset("balanced"), + normalize=True, + batch_size=64 +) + +# Custom ONNX model +config = EmbeddingConfig( + model=EmbeddingModelType.custom(model_id="sentence-transformers/all-MiniLM-L6-v2", dimensions=384) +) + +# With custom cache directory +config = EmbeddingConfig(cache_dir="/path/to/model/cache") +``` + +### EmbeddingModelType + +Embedding model type selector with multiple configurations. + +**Static Methods:** + +```python +@staticmethod +def preset(name: str) -> EmbeddingModelType +``` + +Use a preset configuration (recommended for most use cases). Available presets: balanced, compact, large. + +```python +@staticmethod +def fastembed(model: str, dimensions: int) -> EmbeddingModelType +``` + +Use a specific fastembed model by name. + +```python +@staticmethod +def custom(model_id: str, dimensions: int) -> EmbeddingModelType +``` + +Use a custom ONNX model from HuggingFace (e.g., sentence-transformers/\*). + +**Example:** + +```python +from kreuzberg import EmbeddingModelType, list_embedding_presets + +# Using the balanced preset (recommended for general use) +model = EmbeddingModelType.preset("balanced") + +# Using a specific fast embedding model +model = EmbeddingModelType.fastembed(model="BAAI/bge-small-en-v1.5", dimensions=384) + +# Using a custom HuggingFace model +model = EmbeddingModelType.custom( + model_id="sentence-transformers/all-MiniLM-L6-v2", + dimensions=384 +) + +# Listing available presets +presets = list_embedding_presets() +print(f"Available presets: {presets}") +``` + +### TokenReductionConfig + +Configuration for reducing token count in extracted content. Reduces token count to lower costs when working with LLM APIs. + +**Attributes:** + +| Field | Type | Default | Description | +| -------------------------- | ---- | ------- | ------------------------------------------------------------------------------------------------ | +| `mode` | str | "off" | Token reduction mode: "off", "light", "moderate", "aggressive", or "maximum" | +| `preserve_important_words` | bool | True | Preserve capitalized words, technical terms, and proper nouns even in aggressive reduction modes | + +**Example:** + +```python +from kreuzberg import ExtractionConfig, TokenReductionConfig + +# Moderate token reduction +config = ExtractionConfig( + token_reduction=TokenReductionConfig(mode="moderate", preserve_important_words=True) +) + +# Maximum reduction for large batches +config = ExtractionConfig( + token_reduction=TokenReductionConfig(mode="maximum", preserve_important_words=True) +) + +# No reduction (default) +config = ExtractionConfig( + token_reduction=TokenReductionConfig(mode="off") +) +``` + +### LanguageDetectionConfig + +Configuration for detecting document language(s). + +**Attributes:** + +| Field | Type | Default | Description | +| ----------------- | ----- | ------- | --------------------------------------------------------------------------------------------------- | +| `enabled` | bool | True | Enable language detection for extracted content | +| `min_confidence` | float | 0.8 | Minimum confidence threshold (0.0-1.0) for language detection | +| `detect_multiple` | bool | False | Detect multiple languages in the document. When False, only the most confident language is returned | + +**Example:** + +```python +from kreuzberg import ExtractionConfig, LanguageDetectionConfig, extract_file_sync + +# Basic language detection +config = ExtractionConfig(language_detection=LanguageDetectionConfig()) + +# Detect multiple languages with lower confidence threshold +config = ExtractionConfig( + language_detection=LanguageDetectionConfig(detect_multiple=True, min_confidence=0.6) +) + +# Access detected languages in result +result = extract_file_sync("multilingual.pdf", config=config) +print(f"Languages: {result.detected_languages}") +``` + +### KeywordConfig + +Keyword extraction configuration. + +**Attributes:** + +| Field | Type | Default | Description | +| -------------- | ------------------ | ------- | ----------------------------------------------------------------------------- | +| `algorithm` | KeywordAlgorithm | - | Keyword extraction algorithm (KeywordAlgorithm.Yake or KeywordAlgorithm.Rake) | +| `max_keywords` | int | 10 | Maximum number of keywords to extract | +| `min_score` | float | 0.0 | Minimum score threshold | +| `ngram_range` | tuple[int, int] | (1, 3) | N-gram range for keyword extraction | +| `language` | str \| None | "en" | Optional language hint | +| `yake_params` | YakeParams \| None | None | YAKE-specific tuning parameters | +| `rake_params` | RakeParams \| None | None | RAKE-specific tuning parameters | + +### PageConfig + +Page extraction and tracking configuration. + +**Attributes:** + +| Field | Type | Default | Description | +| --------------------- | ---- | -------------------------------------- | -------------------------------------------- | +| `extract_pages` | bool | False | Enable page tracking and per-page extraction | +| `insert_page_markers` | bool | False | Insert page markers into content | +| `marker_format` | str | "\\n\\n\\n\\n" | Marker template containing {page_num} | + +**Example:** + +```python +from kreuzberg import ExtractionConfig, PageConfig + +config = ExtractionConfig(pages=PageConfig(extract_pages=True)) +``` + +### PostProcessorConfig + +Configuration for post-processors in the extraction pipeline. + +**Attributes:** + +| Field | Type | Default | Description | +| --------------------- | ----------------- | ------- | ----------------------------------------------------------- | +| `enabled` | bool | True | Enable post-processors in the extraction pipeline | +| `enabled_processors` | list[str] \| None | None | Whitelist of processor names to run. None = run all enabled | +| `disabled_processors` | list[str] \| None | None | Blacklist of processor names to skip. None = none disabled | + +**Example:** + +```python +from kreuzberg import ExtractionConfig, PostProcessorConfig + +# Basic post-processing with defaults +config = ExtractionConfig(postprocessor=PostProcessorConfig()) + +# Enable only specific processors +config = ExtractionConfig( + postprocessor=PostProcessorConfig( + enabled=True, + enabled_processors=["normalize_whitespace", "fix_encoding"] + ) +) + +# Disable specific processors +config = ExtractionConfig( + postprocessor=PostProcessorConfig( + enabled=True, + disabled_processors=["experimental_cleanup"] + ) +) + +# Disable all post-processing +config = ExtractionConfig(postprocessor=PostProcessorConfig(enabled=False)) +``` + +### ImagePreprocessingConfig + +Configuration for preprocessing images before OCR. This is NOT for extracting images from documents. + +**Attributes:** + +| Field | Type | Default | Description | +| --------------------- | ---- | ------- | ------------------------------------------------- | +| `target_dpi` | int | 300 | Target DPI for image normalization before OCR | +| `auto_rotate` | bool | True | Automatically detect and correct image rotation | +| `deskew` | bool | True | Correct skewed images to improve OCR accuracy | +| `denoise` | bool | False | Apply denoising filters to reduce noise in images | +| `contrast_enhance` | bool | False | Enhance contrast to improve text readability | +| `binarization_method` | str | "otsu" | Method for converting images to black and white | +| `invert_colors` | bool | False | Invert colors (white text on black background) | + +**Example:** + +```python +from kreuzberg import TesseractConfig, ImagePreprocessingConfig + +# Basic preprocessing for OCR +config = TesseractConfig(preprocessing=ImagePreprocessingConfig()) + +# Aggressive preprocessing for low-quality scans +config = TesseractConfig( + preprocessing=ImagePreprocessingConfig( + target_dpi=300, + denoise=True, + contrast_enhance=True, + auto_rotate=True, + deskew=True + ) +) +``` + +## ExtractionResult + +Result object returned by extraction functions. + +**Attributes:** + +| Field | Type | Description | +| --------------------- | ------------------------------ | -------------------------------------------------------------------------------- | +| `content` | str | Main extracted text content in the specified output_format | +| `mime_type` | str | MIME type of the processed document | +| `metadata` | Metadata | Extracted document metadata (title, author, created_at, format_type, etc.) | +| `tables` | list[ExtractedTable] | Extracted tables from the document | +| `detected_languages` | list[str] \| None | Detected language codes (e.g., ["en", "de"]) if language detection is enabled | +| `chunks` | list[Chunk] \| None | Text chunks if chunking is enabled (each chunk has content, embedding, metadata) | +| `images` | list[ExtractedImage] \| None | Extracted images if image extraction is enabled | +| `pages` | list[PageContent] \| None | Per-page content and metadata if page extraction is enabled | +| `elements` | list[Element] \| None | Semantic elements if result_format="element_based" | +| `output_format` | str \| None | Format of the content field (plain, markdown, djot, html) | +| `result_format` | str \| None | Result format used (unified or element_based) | +| `extracted_keywords` | list[ExtractedKeyword] \| None | Extracted keywords with relevance scores if keyword extraction enabled | +| `quality_score` | float \| None | Overall quality score for the extraction result (0.0-1.0) | +| `processing_warnings` | list[ProcessingWarning] | Non-fatal warnings encountered during extraction pipeline | + +**Methods:** + +```python +def get_page_count(self) -> int +``` + +Get the total number of pages in the document. + +```python +def get_chunk_count(self) -> int +``` + +Get the total number of chunks if chunking is enabled. + +```python +def get_detected_language(self) -> str | None +``` + +Get the most confident detected language code. + +```python +def get_metadata_field(self, field_name: str) -> Any | None +``` + +Get a specific metadata field by name. + +**Example:** + +```python +from kreuzberg import extract_file_sync, ExtractionConfig, ChunkingConfig + +config = ExtractionConfig( + chunking=ChunkingConfig(max_chars=512), + output_format="markdown" +) +result = extract_file_sync("document.pdf", config=config) + +print(f"Content preview: {result.content[:200]}") +print(f"MIME type: {result.mime_type}") +print(f"Page count: {result.get_page_count()}") +print(f"Chunk count: {result.get_chunk_count()}") +print(f"Detected language: {result.get_detected_language()}") + +if result.tables: + print(f"Found {len(result.tables)} tables") + +if result.chunks: + first_chunk = result.chunks[0] + print(f"First chunk: {first_chunk.content[:100]}") + if first_chunk.embedding: + print(f"Embedding dimensions: {len(first_chunk.embedding)}") +``` + +## Error Classes + +All exceptions inherit from `KreuzbergError`, the base exception class. + +### KreuzbergError + +Base exception class for all Kreuzberg errors. + +```python +class KreuzbergError(Exception): + """Base exception for all Kreuzberg errors.""" +``` + +### ParsingError + +Raised when document parsing fails. + +```python +class ParsingError(KreuzbergError): + """Document parsing failed (corrupt, malformed, etc.).""" +``` + +### OCRError + +Raised when OCR processing fails. + +```python +class OCRError(KreuzbergError): + """OCR operation failed.""" +``` + +### ValidationError + +Raised when validation fails. + +```python +class ValidationError(KreuzbergError): + """Validation failed (invalid parameters, constraints, format mismatches).""" +``` + +### MissingDependencyError + +Raised when required dependencies are not available. + +```python +class MissingDependencyError(KreuzbergError): + """Required dependency not available (easyocr, paddleocr, tesseract, etc.).""" + + @staticmethod + def create_for_package(dependency_group: str, functionality: str, package_name: str) -> MissingDependencyError +``` + +**Example:** + +```python +from kreuzberg import extract_file_sync, MissingDependencyError, OCRError, ParsingError + +try: + result = extract_file_sync("document.pdf") +except ParsingError as e: + print(f"Failed to parse document: {e}") +except OCRError as e: + print(f"OCR failed: {e}") +except MissingDependencyError as e: + print(f"Missing dependency: {e}") +``` + +## Utility Functions + +### MIME Type Detection + +```python +def detect_mime_type(data: bytes | bytearray) -> str +``` + +Detect MIME type from file bytes using magic number detection. + +**Parameters:** + +- `data` (bytes | bytearray): File content as bytes or bytearray + +**Returns:** Detected MIME type (e.g., "application/pdf", "image/png") + +```python +def detect_mime_type_from_path(path: str | Path) -> str +``` + +Detect MIME type from file path by reading the file and detecting its MIME type. + +**Parameters:** + +- `path` (str | Path): Path to the file + +**Returns:** Detected MIME type + +**Raises:** + +- `OSError`: If file cannot be read (file not found, permission denied, etc.) +- `RuntimeError`: If MIME type detection fails + +**Example:** + +```python +from kreuzberg import detect_mime_type, detect_mime_type_from_path + +# From bytes +pdf_bytes = b"%PDF-1.4\n" +mime_type = detect_mime_type(pdf_bytes) + +# From path +mime_type = detect_mime_type_from_path("document.pdf") +``` + +### MIME Type Validation + +```python +def validate_mime_type(mime_type: str) -> str +``` + +Validate a MIME type string and return the canonical form. + +```python +def get_extensions_for_mime(mime_type: str) -> list[str] +``` + +Get file extensions associated with a MIME type. + +**Example:** + +```python +from kreuzberg import validate_mime_type, get_extensions_for_mime + +canonical = validate_mime_type("application/pdf") +extensions = get_extensions_for_mime("application/pdf") # Returns ["pdf"] +``` + +### Configuration Loading + +```python +def load_extraction_config_from_file(path: str | Path) -> ExtractionConfig +``` + +Load extraction configuration from a specific file. + +**Parameters:** + +- `path` (str | Path): Path to the configuration file (.toml, .yaml, or .json) + +**Returns:** ExtractionConfig parsed from the file + +**Raises:** + +- `FileNotFoundError`: If the configuration file does not exist +- `RuntimeError`: If the file cannot be read or parsed +- `ValueError`: If the file format is invalid or unsupported + +```python +def discover_extraction_config() -> ExtractionConfig | None +``` + +Discover extraction configuration from the environment (deprecated). + +Attempts to locate a Kreuzberg configuration file using: + +1. KREUZBERG_CONFIG_PATH environment variable +2. Search for kreuzberg.toml, kreuzberg.yaml, or kreuzberg.json in current and parent directories + +**Returns:** ExtractionConfig if found, None otherwise + +**Note:** Deprecated in favor of `load_extraction_config_from_file` for more predictable behavior. + +**Example:** + +```python +from kreuzberg import load_extraction_config_from_file, extract_file_sync + +# Load from specific file +config = load_extraction_config_from_file("kreuzberg.toml") +result = extract_file_sync("document.pdf", config=config) + +# Auto-discover configuration +import os +os.environ["KREUZBERG_CONFIG_PATH"] = "config/kreuzberg.yaml" +# Then extraction will use the discovered config +``` + +## Plugin System + +### Registering Post-Processors + +```python +def register_post_processor(processor: Any) -> None +``` + +Register a Python PostProcessor with the Rust core. Once registered, the processor will be called automatically after extraction to enrich results. + +**Required Methods:** + +- `name() -> str`: Return processor name (must be non-empty) +- `process(result: ExtractionResult) -> ExtractionResult`: Process and enrich the extraction result +- `processing_stage() -> str`: Return "early", "middle", or "late" + +**Optional Methods:** + +- `initialize() -> None`: Called when processor is registered +- `shutdown() -> None`: Called when processor is unregistered + +**Example:** + +```python +from kreuzberg import register_post_processor, ExtractionResult + +class EntityExtractor: + def name(self) -> str: + return "entity_extraction" + + def processing_stage(self) -> str: + return "early" + + def process(self, result: ExtractionResult) -> ExtractionResult: + entities = {"PERSON": ["John Doe"], "ORG": ["Microsoft"]} + result.metadata["entities"] = entities + return result + +register_post_processor(EntityExtractor()) +``` + +### Registering OCR Backends + +```python +def register_ocr_backend(backend: Any) -> None +``` + +Register a Python OCR backend with the Rust core. + +**Required Methods:** + +- `name() -> str`: Return backend name (must be non-empty) +- `supported_languages() -> list[str]`: Return list of supported language codes +- `process_image(image_bytes: bytes, language: str) -> OcrResult`: Process image and return OCR result +- `process_file(path: str, language: str) -> OcrResult`: Process file and return OCR result +- `initialize() -> None`: Called when backend is registered +- `shutdown() -> None`: Called when backend is unregistered +- `version() -> str`: Return backend version string + +**Example:** + +```python +from kreuzberg import register_ocr_backend + +class MyOcrBackend: + def name(self) -> str: + return "my-ocr" + + def supported_languages(self) -> list[str]: + return ["eng", "deu", "fra"] + + def process_image(self, image_bytes: bytes, language: str) -> dict: + return { + "content": "extracted text", + "metadata": {"confidence": 0.95}, + "tables": [] + } + +register_ocr_backend(MyOcrBackend()) +``` + +### Registering Validators + +```python +def register_validator(validator: Any) -> None +``` + +Register a Python Validator with the Rust core. Validators are called automatically after extraction to validate results. + +**Required Methods:** + +- `name() -> str`: Return validator name (must be non-empty) +- `validate(result: ExtractionResult) -> None`: Validate the extraction result (raise error to fail) + +**Optional Methods:** + +- `should_validate(result: ExtractionResult) -> bool`: Check if validator should run (defaults to True) +- `priority() -> int`: Return priority (defaults to 50, higher runs first) + +**Example:** + +```python +from kreuzberg import register_validator, ValidationError, ExtractionResult + +class MinLengthValidator: + def name(self) -> str: + return "min_length_validator" + + def priority(self) -> int: + return 100 + + def validate(self, result: ExtractionResult) -> None: + if len(result.content) < 100: + raise ValidationError("Content too short") + +register_validator(MinLengthValidator()) +``` + +### Plugin Management Functions + +```python +def list_post_processors() -> list[str] +``` + +List names of all registered post-processors. + +```python +def list_validators() -> list[str] +``` + +List names of all registered validators. + +```python +def list_ocr_backends() -> list[str] +``` + +List names of all available OCR backends. + +```python +def unregister_post_processor(name: str) -> None +``` + +Unregister a post-processor by name. + +```python +def unregister_validator(name: str) -> None +``` + +Unregister a validator by name. + +```python +def unregister_ocr_backend(name: str) -> None +``` + +Unregister an OCR backend by name. + +```python +def clear_post_processors() -> None +``` + +Clear all registered post-processors. + +```python +def clear_validators() -> None +``` + +Clear all registered validators. + +```python +def clear_ocr_backends() -> None +``` + +Clear all registered OCR backends. + +## Format Enums + +### OutputFormat + +Output format for extraction results. + +```python +class OutputFormat(str, Enum): + PLAIN = "plain" # Plain text format + MARKDOWN = "markdown" # Markdown format + DJOT = "djot" # Djot lightweight markup format + HTML = "html" # HTML format +``` + +### ResultFormat + +Result format controlling extraction output structure. + +```python +class ResultFormat(str, Enum): + UNIFIED = "unified" # All content in `content` field + ELEMENT_BASED = "element_based" # Unstructured-compatible output with semantic elements +``` + +## Error Handling + +### Error Code Functions + +```python +def get_last_error_code() -> int +``` + +Get the last error code from the FFI layer. + +**Returns:** + +- 0 (SUCCESS): No error occurred +- 1 (GENERIC_ERROR): Generic unspecified error +- 2 (PANIC): A panic occurred in the Rust core +- 3 (INVALID_ARGUMENT): Invalid argument provided +- 4 (IO_ERROR): I/O operation failed +- 5 (PARSING_ERROR): Document parsing failed +- 6 (OCR_ERROR): OCR operation failed +- 7 (MISSING_DEPENDENCY): Required dependency not available + +```python +def get_error_details() -> dict[str, Any] +``` + +Get detailed error information from the FFI layer. + +**Returns:** dict with keys: + +- `message` (str): Human-readable error message +- `error_code` (int): Numeric error code (0-7) +- `error_type` (str): Error type name (e.g., "validation", "ocr") +- `source_file` (str | None): Source file path if available +- `source_function` (str | None): Function name if available +- `source_line` (int): Line number (0 if unknown) +- `context_info` (str | None): Additional context if available +- `is_panic` (bool): Whether error came from a panic + +```python +def classify_error(message: str) -> int +``` + +Classify an error message into a Kreuzberg error code. + +**Parameters:** + +- `message` (str): The error message to classify + +**Returns:** int error code (0-7) representing the classification + +```python +def error_code_name(code: int) -> str +``` + +Get the human-readable name of an error code. + +**Parameters:** + +- `code` (int): Numeric error code (0-7) + +**Returns:** Human-readable error code name (e.g., "validation", "ocr") + +**Example:** + +```python +from kreuzberg import get_error_details, get_last_error_code, error_code_name, classify_error + +try: + result = extract_file_sync("document.pdf") +except Exception as e: + code = get_last_error_code() + if code: + print(f"Error code: {code} ({error_code_name(code)})") + + details = get_error_details() + print(f"Error: {details['message']}") + print(f"Type: {details['error_type']}") + + classified = classify_error(str(e)) + print(f"Classified as: {error_code_name(classified)}") +``` + +## Validation Functions + +### Parameter Validation + +```python +def validate_chunking_params(max_chars: int, max_overlap: int) -> bool +``` + +Validate chunking parameters. + +```python +def validate_confidence(confidence: float) -> bool +``` + +Validate confidence value (0.0-1.0). + +```python +def validate_dpi(dpi: int) -> bool +``` + +Validate DPI value. + +```python +def validate_tesseract_psm(psm: int) -> bool +``` + +Validate Tesseract Page Segmentation Mode. + +```python +def validate_tesseract_oem(oem: int) -> bool +``` + +Validate Tesseract OCR Engine Mode. + +```python +def validate_ocr_backend(backend: str) -> bool +``` + +Validate OCR backend name. + +```python +def validate_language_code(code: str) -> bool +``` + +Validate language code format. + +```python +def validate_token_reduction_level(level: str) -> bool +``` + +Validate token reduction level. + +```python +def validate_output_format(output_format: str) -> bool +``` + +Validate output format string. + +```python +def validate_binarization_method(method: str) -> bool +``` + +Validate binarization method for image preprocessing. + +### Getting Valid Values + +```python +def get_valid_binarization_methods() -> list[str] +``` + +Get list of valid binarization methods. + +```python +def get_valid_language_codes() -> list[str] +``` + +Get list of valid language codes. + +```python +def get_valid_ocr_backends() -> list[str] +``` + +Get list of valid OCR backend names. + +```python +def get_valid_token_reduction_levels() -> list[str] +``` + +Get list of valid token reduction levels. + +```python +def list_embedding_presets() -> list[str] +``` + +List available embedding presets. + +```python +def get_embedding_preset(name: str) -> EmbeddingPreset | None +``` + +Get details about a specific embedding preset. + +**Example:** + +```python +from kreuzberg import ( + validate_dpi, + get_valid_binarization_methods, + list_embedding_presets, + get_embedding_preset +) + +# Validate parameters +if not validate_dpi(300): + print("Invalid DPI") + +# List valid values +binarization_methods = get_valid_binarization_methods() +presets = list_embedding_presets() + +# Get preset details +preset = get_embedding_preset("balanced") +if preset: + print(f"Balanced preset: {preset.description}") + print(f"Dimensions: {preset.dimensions}") + print(f"Recommended chunk size: {preset.chunk_size}") +``` + +## Configuration Utilities + +### Config Manipulation + +```python +def config_to_json(config: ExtractionConfig) -> str +``` + +Convert ExtractionConfig to JSON string. + +```python +def config_get_field(config: ExtractionConfig, field_name: str) -> Any | None +``` + +Get a specific field value from ExtractionConfig. + +```python +def config_merge(base: ExtractionConfig, override: ExtractionConfig) -> None +``` + +Merge override config into base config (mutates base). + +**Example:** + +```python +from kreuzberg import ExtractionConfig, config_to_json, config_get_field, config_merge + +config = ExtractionConfig(use_cache=True, enable_quality_processing=False) + +# Convert to JSON +json_str = config_to_json(config) +print(json_str) + +# Get field +use_cache = config_get_field(config, "use_cache") +print(f"use_cache: {use_cache}") + +# Merge configs +override = ExtractionConfig(use_cache=False) +config_merge(config, override) +``` + +## Version Information + +```python +__version__: str +``` + +Current version of the kreuzberg package. + +**Example:** + +```python +from kreuzberg import __version__ + +print(f"Kreuzberg version: {__version__}") +``` diff --git a/plugins/kreuzberg/skills/kreuzberg/references/rust-api.md b/plugins/kreuzberg/skills/kreuzberg/references/rust-api.md new file mode 100644 index 0000000..ca86179 --- /dev/null +++ b/plugins/kreuzberg/skills/kreuzberg/references/rust-api.md @@ -0,0 +1,866 @@ +# Kreuzberg Rust API Reference + +Complete API reference for the Kreuzberg document extraction library in Rust. + +## Setup + +Add to your `Cargo.toml`: + +```toml +[dependencies] +kreuzberg = { version = "4", features = [ + "tokio-runtime", + "pdf", + "ocr", + "chunking", + "embeddings", + "language-detection", + "keywords-yake", + "keywords-rake", + "api", + "mcp" +] } +tokio = { version = "1", features = ["full"] } +``` + +### Core Features + +- **tokio-runtime**: Enables async/sync extraction (default). Required for `extract_file_sync`, `batch_extract_file_sync`, `batch_extract_file` +- **pdf**: PDF extraction with PDFium +- **ocr**: Tesseract-based OCR for scanned documents +- **chunking**: Text chunking for RAG pipelines +- **embeddings**: Vector embeddings generation +- **language-detection**: Detect document language +- **keywords-yake** / **keywords-rake**: Extract keywords using YAKE or RAKE +- **api**: HTTP API with Axum +- **mcp**: Model Context Protocol support + +--- + +## Core Extraction Functions + +### `extract_file` (async) + +Extract content from a file path. + +```rust +pub async fn extract_file( + path: impl AsRef, + mime_type: Option<&str>, + config: &ExtractionConfig, +) -> Result +``` + +**Always available.** Requires async context (`#[tokio::main]`, `tokio::spawn`, etc.). + +```rust +use kreuzberg::{extract_file, ExtractionConfig}; +use std::path::Path; + +#[tokio::main] +async fn main() -> kreuzberg::Result<()> { + let config = ExtractionConfig::default(); + let result = extract_file("document.pdf", None, &config).await?; + println!("Content: {}", result.content); + Ok(()) +} +``` + +### `extract_bytes` (async) + +Extract content from byte data. + +```rust +pub async fn extract_bytes( + data: &[u8], + mime_type: &str, + config: &ExtractionConfig, +) -> Result +``` + +**Always available.** Requires async context. + +```rust +#[tokio::main] +async fn main() -> kreuzberg::Result<()> { + let config = ExtractionConfig::default(); + let pdf_bytes = std::fs::read("document.pdf")?; + let result = extract_bytes(&pdf_bytes, "application/pdf", &config).await?; + Ok(()) +} +``` + +### `extract_file_sync` (sync) + +Synchronous wrapper around `extract_file`. + +```rust +pub fn extract_file_sync( + path: impl AsRef, + mime_type: Option<&str>, + config: &ExtractionConfig, +) -> Result +``` + +**Requires tokio-runtime feature.** Blocks the current thread using a global Tokio runtime. + +```rust +use kreuzberg::{extract_file_sync, ExtractionConfig}; + +fn main() -> kreuzberg::Result<()> { + let config = ExtractionConfig::default(); + let result = extract_file_sync("document.pdf", None, &config)?; + println!("Content: {}", result.content); + Ok(()) +} +``` + +### `extract_bytes_sync` (sync) + +Synchronous wrapper around `extract_bytes`. + +```rust +pub fn extract_bytes_sync( + content: &[u8], + mime_type: &str, + config: &ExtractionConfig, +) -> Result +``` + +**Always available.** Works in sync and async contexts. + +```rust +fn main() -> kreuzberg::Result<()> { + let config = ExtractionConfig::default(); + let bytes = b"Hello, world!"; + let result = extract_bytes_sync(bytes, "text/plain", &config)?; + Ok(()) +} +``` + +### `batch_extract_file` (async, parallel) + +Extract multiple files concurrently. + +```rust +pub async fn batch_extract_file( + paths: Vec>, + config: &ExtractionConfig, +) -> Result> +``` + +**Requires tokio-runtime feature.** Processes files in parallel with automatic concurrency management (defaults to `num_cpus * 1.5`). + +```rust +#[tokio::main] +async fn main() -> kreuzberg::Result<()> { + let config = ExtractionConfig::default(); + let paths = vec!["doc1.pdf", "doc2.pdf", "doc3.pdf"]; + let results = batch_extract_file(paths, &config).await?; + println!("Processed {} files", results.len()); + Ok(()) +} +``` + +### `batch_extract_bytes` (async, parallel) + +Extract multiple byte arrays concurrently. + +```rust +pub async fn batch_extract_bytes( + contents: Vec<(Vec, String)>, + config: &ExtractionConfig, +) -> Result> +``` + +**Requires tokio-runtime feature.** Each tuple is `(bytes, mime_type)`. + +```rust +#[tokio::main] +async fn main() -> kreuzberg::Result<()> { + let config = ExtractionConfig::default(); + let contents = vec![ + (b"PDF content".to_vec(), "application/pdf".to_string()), + (b"Text content".to_vec(), "text/plain".to_string()), + ]; + let results = batch_extract_bytes(contents, &config).await?; + Ok(()) +} +``` + +### `batch_extract_file_sync` (sync, parallel) + +Synchronous wrapper for batch file extraction. + +```rust +pub fn batch_extract_file_sync( + paths: Vec>, + config: &ExtractionConfig, +) -> Result> +``` + +**Requires tokio-runtime feature.** Uses global runtime for concurrency. + +```rust +fn main() -> kreuzberg::Result<()> { + let config = ExtractionConfig::default(); + let paths = vec!["doc1.pdf", "doc2.pdf"]; + let results = batch_extract_file_sync(paths, &config)?; + Ok(()) +} +``` + +### `batch_extract_bytes_sync` (sync, parallel) + +Synchronous wrapper for batch byte extraction. + +```rust +pub fn batch_extract_bytes_sync( + contents: Vec<(Vec, String)>, + config: &ExtractionConfig, +) -> Result> +``` + +**Always available.** Each tuple is `(bytes, mime_type)`. + +```rust +fn main() -> kreuzberg::Result<()> { + let config = ExtractionConfig::default(); + let contents = vec![ + (b"content 1".to_vec(), "text/plain".to_string()), + (b"content 2".to_vec(), "text/plain".to_string()), + ]; + let results = batch_extract_bytes_sync(contents, &config)?; + Ok(()) +} +``` + +### `FileExtractionConfig` + +Per-file overrides for batch operations, passed as an optional parameter to `batch_extract_file` / `batch_extract_bytes` (and their sync variants). All fields `Option` — `None` = use batch default. + +> **Note (v4.5.0):** The separate `batch_extract_file_with_configs` / `batch_extract_bytes_with_configs` functions have been removed. Per-file configs are now an optional parameter on the unified batch functions. + +```rust +pub struct FileExtractionConfig { + pub enable_quality_processing: Option, + pub ocr: Option, + pub force_ocr: Option, + pub chunking: Option, + pub images: Option, + pub pdf_options: Option, + pub token_reduction: Option, + pub language_detection: Option, + pub pages: Option, + pub postprocessor: Option, + pub output_format: Option, + pub include_document_structure: Option, +} +``` + +Excluded batch-level fields: `max_concurrent_extractions`, `use_cache`, `acceleration`, `security_limits`. + +--- + +## Configuration + +### `ExtractionConfig` + +Main configuration struct for all extraction operations. + +```rust +pub struct ExtractionConfig { + /// Enable caching (default: true) + pub use_cache: bool, + + /// Enable quality post-processing (default: true) + pub enable_quality_processing: bool, + + /// OCR configuration (None = OCR disabled) + pub ocr: Option, + + /// Force OCR even for searchable PDFs (default: false) + pub force_ocr: bool, + + /// Text chunking configuration (None = disabled) + pub chunking: Option, + + /// Image extraction configuration (None = disabled) + pub images: Option, + + /// PDF-specific options (requires pdf feature) + #[cfg(feature = "pdf")] + pub pdf_options: Option, + + /// Token reduction configuration (None = disabled) + pub token_reduction: Option, + + /// Language detection configuration (None = disabled) + pub language_detection: Option, + + /// Page extraction configuration (None = disabled) + pub pages: Option, + + /// Keyword extraction configuration (requires keywords-yake or keywords-rake) + #[cfg(any(feature = "keywords-yake", feature = "keywords-rake"))] + pub keywords: Option, + + /// Post-processor configuration (None = use defaults) + pub postprocessor: Option, + + /// HTML to Markdown conversion options (requires html feature) + #[cfg(feature = "html")] + pub html_options: Option, + + /// Maximum concurrent extractions in batch (None = num_cpus * 1.5) + pub max_concurrent_extractions: Option, + + /// Result structure format (default: Unified) + /// Uses types::OutputFormat (Unified | ElementBased) + pub result_format: types::OutputFormat, + + /// Security limits for archives (requires archives feature) + #[cfg(feature = "archives")] + pub security_limits: Option, + + /// Content output format (default: Plain) + /// Uses config::OutputFormat (Plain | Markdown | Djot | Html) + pub output_format: OutputFormat, +} +``` + +#### Creating Configs + +```rust +use kreuzberg::{ExtractionConfig, OcrConfig, ChunkingConfig, OutputFormat}; + +// Default configuration +let config = ExtractionConfig::default(); + +// With OCR +let config = ExtractionConfig { + ocr: Some(OcrConfig { + backend: "tesseract".to_string(), + ..Default::default() + }), + ..Default::default() +}; + +// With chunking +let config = ExtractionConfig { + chunking: Some(ChunkingConfig { + max_characters: 512, + overlap: 50, + ..Default::default() + }), + output_format: OutputFormat::Markdown, + ..Default::default() +}; +``` + +--- + +## Output Formats + +There are two separate enums both named `OutputFormat` in different modules: + +### Content `OutputFormat` (`core::config::formats::OutputFormat`) + +Controls the format of the `content` field text. Used by `ExtractionConfig::output_format`. + +```rust +pub enum OutputFormat { + /// Plain text (default) + Plain, + /// Markdown formatted + Markdown, + /// Djot markup format + Djot, + /// HTML format + Html, +} +``` + +### Result `OutputFormat` (`types::extraction::OutputFormat`) + +Controls the result structure. Used by `ExtractionConfig::result_format`. + +```rust +pub enum OutputFormat { + /// Unified format with all content in `content` field (default) + Unified, + /// Element-based format with semantic element extraction + ElementBased, +} +``` + +```rust +use kreuzberg::{ExtractionConfig, OutputFormat}; + +let config = ExtractionConfig { + output_format: OutputFormat::Markdown, // content format (Plain/Markdown/Djot/Html) + // result_format uses types::OutputFormat (Unified/ElementBased) — defaults to Unified + ..Default::default() +}; +``` + +--- + +## Extraction Result + +### `ExtractionResult` + +Result returned by all extraction functions. + +```rust +pub struct ExtractionResult { + /// Main extracted content + pub content: String, + + /// Document MIME type + pub mime_type: Cow<'static, str>, + + /// Metadata about extraction + pub metadata: Metadata, + + /// Extracted tables (HTML/Markdown) + pub tables: Vec, + + /// Detected languages (if language-detection enabled) + pub detected_languages: Option>, + + /// Text chunks (if chunking enabled) + pub chunks: Option>, + + /// Extracted images (if image extraction enabled) + pub images: Option>, + + /// Per-page content (if page extraction enabled) + pub pages: Option>, + + /// Semantic elements (if element-based format enabled) + pub elements: Option>, + + /// Djot document structure (if extracting Djot) + pub djot_content: Option, + + /// Extracted keywords with relevance scores (if keyword extraction enabled) + pub extracted_keywords: Option>, + + /// Quality score for extraction result (0.0-1.0) + pub quality_score: Option, + + /// Non-fatal warnings during processing pipeline + pub processing_warnings: Vec, +} +``` + +### `ExtractedKeyword` + +Extracted keyword with relevance score and position information. + +```rust +pub struct ExtractedKeyword { + /// Keyword text + pub text: String, + + /// Relevance score (0.0-1.0) + pub score: f32, + + /// Algorithm used for extraction ("tfidf", "textrank", "yake", etc.) + pub algorithm: String, + + /// Character positions in content (if available) + pub positions: Option>, +} +``` + +### `ProcessingWarning` + +Non-fatal warning encountered during document processing. + +```rust +pub struct ProcessingWarning { + /// Component that generated the warning + pub source: String, + + /// Warning message describing the issue + pub message: String, +} +``` + +### `Chunk` + +Text chunk with optional embedding. + +```rust +pub struct Chunk { + /// Chunk text content + pub content: String, + + /// Optional embedding vector + pub embedding: Option>, + + /// Chunk metadata + pub metadata: ChunkMetadata, +} + +pub struct ChunkMetadata { + pub byte_start: usize, + pub byte_end: usize, + pub token_count: Option, + pub chunk_index: usize, + pub total_chunks: usize, + pub first_page: Option, + pub last_page: Option, +} +``` + +### `ExtractedImage` + +Image extracted from document. + +```rust +pub struct ExtractedImage { + /// Raw image bytes + pub data: Bytes, + + /// Format: "jpeg", "png", "webp", etc. + pub format: Cow<'static, str>, + + /// Zero-indexed position + pub image_index: usize, + + /// Page number (1-indexed) + pub page_number: Option, + + /// Image dimensions + pub width: Option, + pub height: Option, + + /// Colorspace: "RGB", "CMYK", "Gray" + pub colorspace: Option, + + /// Bits per component + pub bits_per_component: Option, + + /// Whether this is a mask image + pub is_mask: bool, + + /// Image description + pub description: Option, + + /// Nested OCR result (if OCRed) + pub ocr_result: Option>, +} +``` + +--- + +## Error Handling + +### `KreuzbergError` enum + +```rust +pub enum KreuzbergError { + /// File system errors (always bubble up) + Io(std::io::Error), + + /// Document parsing errors + Parsing { + message: String, + source: Option>, + }, + + /// OCR processing errors + Ocr { + message: String, + source: Option>, + }, + + /// Configuration/input validation errors + Validation { + message: String, + source: Option>, + }, + + /// Cache operation errors + Cache { + message: String, + source: Option>, + }, + + /// Image processing errors + ImageProcessing { + message: String, + source: Option>, + }, + + /// Serialization errors (JSON, MessagePack) + Serialization { + message: String, + source: Option>, + }, + + /// Missing system dependency (e.g. Tesseract) + MissingDependency(String), + + /// Plugin-specific errors + Plugin { + message: String, + plugin_name: String, + }, + + /// Mutex/RwLock poisoning + LockPoisoned(String), + + /// Unsupported MIME type or format + UnsupportedFormat(String), + + /// Other errors + Other(String), +} +``` + +#### Error Constructors + +```rust +use kreuzberg::KreuzbergError; + +// Create errors +let err = KreuzbergError::parsing("invalid PDF"); +let err = KreuzbergError::ocr("Tesseract failed"); +let err = KreuzbergError::validation("config invalid"); +let err = KreuzbergError::unsupported_format("application/unknown"); +let err = KreuzbergError::missing_dependency("tesseract"); + +// With source +let source = std::io::Error::new(std::io::ErrorKind::NotFound, "file missing"); +let err = KreuzbergError::parsing_with_source("corrupt PDF", source); +``` + +#### Handling Errors + +```rust +use kreuzberg::extract_file; + +match extract_file("doc.pdf", None, &config).await { + Ok(result) => println!("Success: {}", result.content), + Err(kreuzberg::KreuzbergError::Io(e)) => { + println!("File error: {}", e); + } + Err(kreuzberg::KreuzbergError::UnsupportedFormat(fmt)) => { + println!("Unsupported: {}", fmt); + } + Err(e) => println!("Other error: {}", e), +} +``` + +--- + +## MIME Type Detection + +### `detect_mime_type` + +Detect MIME type from file path. + +```rust +pub fn detect_mime_type(path: impl AsRef) -> Result +``` + +```rust +use kreuzberg::detect_mime_type; + +let mime = detect_mime_type("document.pdf")?; +assert_eq!(mime, "application/pdf"); +``` + +### `detect_mime_type_from_bytes` + +Detect MIME type from byte data. + +```rust +pub fn detect_mime_type_from_bytes(data: &[u8]) -> Result +``` + +### `validate_mime_type` + +Check if a MIME type is supported. + +```rust +pub fn validate_mime_type(mime_type: &str) -> Result<()> +``` + +```rust +use kreuzberg::validate_mime_type; + +validate_mime_type("application/pdf")?; // OK +validate_mime_type("application/unknown")?; // Error +``` + +### `get_extensions_for_mime` + +Get file extensions for a MIME type. + +```rust +pub fn get_extensions_for_mime(mime_type: &str) -> Vec +``` + +```rust +use kreuzberg::get_extensions_for_mime; + +let exts = get_extensions_for_mime("application/pdf"); +// ["pdf"] + +let exts = get_extensions_for_mime("text/plain"); +// ["txt", "text"] +``` + +### MIME Type Constants + +```rust +use kreuzberg::{ + PDF_MIME_TYPE, + PLAIN_TEXT_MIME_TYPE, + HTML_MIME_TYPE, + MARKDOWN_MIME_TYPE, + JSON_MIME_TYPE, + XML_MIME_TYPE, + DOCX_MIME_TYPE, + POWER_POINT_MIME_TYPE, + EXCEL_MIME_TYPE, +}; + +assert_eq!(PDF_MIME_TYPE, "application/pdf"); +assert_eq!(PLAIN_TEXT_MIME_TYPE, "text/plain"); +``` + +--- + +## Plugin Registry + +Access extractors, OCR backends, and validators. + +### `get_document_extractor_registry` + +Get all available document extractors. + +```rust +pub fn get_document_extractor_registry() -> Arc> +``` + +### `get_ocr_backend_registry` + +Get all available OCR backends. + +```rust +pub fn get_ocr_backend_registry() -> Arc> +``` + +### `get_post_processor_registry` + +Get all available post-processors. + +```rust +pub fn get_post_processor_registry() -> Arc> +``` + +### `get_validator_registry` + +Get all available validators. + +```rust +pub fn get_validator_registry() -> Arc> +``` + +--- + +## Complete Example + +```rust +use kreuzberg::{ + extract_file, ExtractionConfig, OutputFormat, + ChunkingConfig, OcrConfig, LanguageDetectionConfig, +}; + +#[tokio::main] +async fn main() -> kreuzberg::Result<()> { + // Configure extraction + let config = ExtractionConfig { + output_format: OutputFormat::Markdown, + chunking: Some(ChunkingConfig { + max_characters: 512, + overlap: 50, + ..Default::default() + }), + language_detection: Some(LanguageDetectionConfig::default()), + ocr: Some(OcrConfig { + backend: "tesseract".to_string(), + ..Default::default() + }), + force_ocr: false, + ..Default::default() + }; + + // Extract from file + let result = extract_file("document.pdf", None, &config).await?; + + // Use results + println!("Content:\n{}", result.content); + println!("MIME: {}", result.mime_type); + + if let Some(langs) = result.detected_languages { + println!("Languages: {:?}", langs); + } + + if let Some(chunks) = result.chunks { + println!("Chunks: {}", chunks.len()); + for chunk in chunks { + println!(" - {}", &chunk.content[..50.min(chunk.content.len())]); + } + } + + if let Some(images) = result.images { + println!("Images: {}", images.len()); + } + + if let Some(pages) = result.pages { + println!("Pages: {}", pages.len()); + } + + Ok(()) +} +``` + +--- + +## Result Type Alias + +```rust +pub type Result = std::result::Result; +``` + +All fallible operations return `Result` where errors are `KreuzbergError`. + +--- + +## Feature Flags Summary + +| Feature | Availability | Dependencies | +| ------------------ | ------------ | ---------------------------------------------- | +| tokio-runtime | Default | Tokio runtime for async/sync | +| pdf | Default | PDFium | +| ocr | Optional | Tesseract | +| chunking | Optional | text-splitter | +| embeddings | Optional | FastEmbed, requires tokio-runtime | +| language-detection | Optional | whatlang | +| keywords-yake | Optional | yake-rust | +| keywords-rake | Optional | rake | +| api | Optional | Axum, requires tokio-runtime | +| mcp | Optional | Model Context Protocol, requires tokio-runtime | + +--- + +## Version + +This reference is for Kreuzberg 4.x. diff --git a/plugins/kreuzberg/skills/kreuzberg/references/supported-formats.md b/plugins/kreuzberg/skills/kreuzberg/references/supported-formats.md new file mode 100644 index 0000000..e50a66f --- /dev/null +++ b/plugins/kreuzberg/skills/kreuzberg/references/supported-formats.md @@ -0,0 +1,223 @@ +# Supported Formats Reference + +Kreuzberg supports 91+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction. All formats support text and metadata extraction. Additional capabilities like OCR and table detection are noted per format. + +## Office Documents + +### Word Processing + +| Format | Extensions | MIME Type | Capabilities | +| ------------------ | ------------------------ | ------------------------------------------------------------------------- | --------------------------------------------------------------- | +| Microsoft Word | `.docx` | `application/vnd.openxmlformats-officedocument.wordprocessingml.document` | Full text extraction, tables, embedded images, metadata, styles | +| Word Macro-Enabled | `.docm` | `application/vnd.ms-word.document.macroEnabled.12` | Macro-enabled document extraction, metadata | +| Word Template | `.dotx`, `.dotm`, `.dot` | Various Word template MIME types | Template document extraction, metadata | +| OpenDocument Text | `.odt` | `application/vnd.oasis.opendocument.text` | Full text extraction, tables, embedded images, metadata, styles | + +### Spreadsheets + +| Format | Extensions | MIME Type | Capabilities | +| ------------------------ | ---------- | ---------------------------------------------------------------------- | -------------------------------------------------------- | +| Excel Workbook | `.xlsx` | `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet` | Sheet data, cell values, formulas, cell metadata, charts | +| Excel Macro-Enabled | `.xlsm` | `application/vnd.ms-excel.sheet.macroEnabled.12` | Sheet data, formulas, macros (text only), metadata | +| Excel Binary | `.xlsb` | `application/vnd.ms-excel.sheet.binary.macroEnabled.12` | Binary sheet data extraction, metadata | +| Excel Legacy | `.xls` | `application/vnd.ms-excel` | Legacy sheet data extraction, metadata | +| Excel Add-in | `.xla` | `application/vnd.ms-excel` | Add-in data extraction | +| Excel Macro Add-in | `.xlam` | `application/vnd.ms-excel.addin.macroEnabled.12` | Macro add-in metadata | +| Excel Template | `.xltm` | `application/vnd.ms-excel.template.macroEnabled.12` | Template data and metadata | +| Excel Template (XML) | `.xltx` | `application/vnd.openxmlformats-officedocument.spreadsheetml.template` | XML template data and metadata | +| Excel Template (Legacy) | `.xlt` | `application/vnd.ms-excel` | Legacy template data extraction | +| OpenDocument Spreadsheet | `.ods` | `application/vnd.oasis.opendocument.spreadsheet` | Sheet data, cell values, formulas, metadata | + +### Presentations + +| Format | Extensions | MIME Type | Capabilities | +| ----------------------- | ------------------------ | --------------------------------------------------------------------------- | ---------------------------------------------------- | +| PowerPoint Presentation | `.pptx` | `application/vnd.openxmlformats-officedocument.presentationml.presentation` | Slide text, speaker notes, embedded images, metadata | +| PowerPoint Legacy | `.ppt` | `application/vnd.ms-powerpoint` | Legacy slide text extraction, metadata | +| PowerPoint Slideshow | `.ppsx` | `application/vnd.openxmlformats-officedocument.presentationml.slideshow` | Slideshow content, speaker notes, metadata | +| PowerPoint Template | `.potx`, `.potm`, `.pot` | Various PowerPoint template MIME types | Template slide extraction, metadata | + +### PDF + +| Format | Extensions | MIME Type | Capabilities | +| ------------------------ | ---------- | ----------------- | -------------------------------------------------------------------------------------------------- | +| Portable Document Format | `.pdf` | `application/pdf` | Text extraction, tables, embedded images, metadata, OCR (when needed), password protection support | + +### eBooks + +| Format | Extensions | MIME Type | Capabilities | +| ----------- | ---------- | ------------------------------- | ------------------------------------------------------ | +| EPUB | `.epub` | `application/epub+zip` | Chapter text, metadata, embedded resources, navigation | +| FictionBook | `.fb2` | `application/x-fictionbook+xml` | Book content, metadata, chapter structure | + +### Database + +| Format | Extensions | MIME Type | Capabilities | +| ------ | ---------- | ------------------- | ----------------------------------------------------- | +| dBASE | `.dbf` | `application/x-dbf` | Table data extraction as markdown, field type support | + +### Hangul + +| Format | Extensions | MIME Type | Capabilities | +| --------------------- | --------------- | ----------------------------------------------- | --------------------------------------- | +| Hangul Word Processor | `.hwp`, `.hwpx` | `application/x-hwp`, `application/haansofthwpx` | Korean document format, text extraction | + +## Images (OCR-Enabled) + +### Raster Images + +| Format | Extensions | MIME Type | Capabilities | +| ------ | --------------- | ------------ | ---------------------------------------------------------------------------- | +| PNG | `.png` | `image/png` | OCR text extraction, table detection, EXIF metadata, dimensions, color space | +| JPEG | `.jpg`, `.jpeg` | `image/jpeg` | OCR text extraction, table detection, EXIF metadata, color profile | +| GIF | `.gif` | `image/gif` | OCR text extraction, animation metadata, dimensions | +| WebP | `.webp` | `image/webp` | OCR text extraction, metadata, lossy/lossless detection | +| Bitmap | `.bmp` | `image/bmp` | OCR text extraction, dimensions, color depth | +| TIFF | `.tiff`, `.tif` | `image/tiff` | OCR text extraction, multi-page support, EXIF metadata, compression info | + +### Advanced Image Formats + +| Format | Extensions | MIME Type | Capabilities | +| ------------------ | ------------------------------ | ------------------------- | -------------------------------------------------------------------------------- | +| JPEG 2000 | `.jp2` | `image/jp2` | OCR via pure Rust decoder (hayro-jpeg2000), table detection, resolution metadata | +| JPEG 2000 Extended | `.jpx` | `image/jpx` | Advanced JPEG 2000 features, high-resolution content, metadata | +| JPEG 2000 Compound | `.jpm` | `image/jpm` | Compound image support, mixed content | +| Motion JPEG 2000 | `.mj2` | `video/mj2` | JPEG 2000 video/sequence metadata | +| JBIG2 | `.jbig2`, `.jb2` | `image/jbig2` | Bi-level image OCR, high compression, technical documents | +| Portable PixMap | `.pnm`, `.pbm`, `.pgm`, `.ppm` | `image/x-portable-pixmap` | OCR for plain image formats, raw pixel data | + +### Vector Graphics + +| Format | Extensions | MIME Type | Capabilities | +| ------------------------ | ---------- | --------------- | ------------------------------------------------------------------------- | +| Scalable Vector Graphics | `.svg` | `image/svg+xml` | DOM parsing, embedded text extraction, graphics metadata, vector elements | + +## Web & Data + +### Markup & Structured Text + +| Format | Extensions | MIME Type | Capabilities | +| ---------------- | --------------- | ----------------------- | ---------------------------------------------------------------------------------- | +| HyperText Markup | `.html`, `.htm` | `text/html` | DOM parsing, text extraction, metadata (Open Graph, Twitter Card), link extraction | +| XHTML | `.xhtml` | `application/xhtml+xml` | XHTML parsing, metadata extraction, semantic structure | +| XML | `.xml` | `application/xml` | DOM parsing, namespace handling, text extraction, structure analysis | + +### Structured Data Formats + +| Format | Extensions | MIME Type | Capabilities | +| ------ | --------------- | --------------------------- | ---------------------------------------------------------- | +| JSON | `.json` | `application/json` | Schema detection, nested structure parsing, validation | +| YAML | `.yaml`, `.yml` | `application/x-yaml` | Hierarchical data parsing, custom tags, nested structures | +| TOML | `.toml` | `application/toml` | Configuration parsing, table structures, type preservation | +| CSV | `.csv` | `text/csv` | Delimiter detection, header inference, type detection | +| TSV | `.tsv` | `text/tab-separated-values` | Tab-separated value parsing, header detection | + +### Text & Markup Languages + +| Format | Extensions | MIME Type | Capabilities | +| ---------------- | ------------------ | ----------------- | ------------------------------------------------- | +| Plain Text | `.txt` | `text/plain` | Raw text extraction, encoding detection | +| Markdown | `.md`, `.markdown` | `text/markdown` | CommonMark parsing, GFM extensions, front matter | +| Djot | `.djot` | `text/djot` | Djot format parsing, semantic structure | +| reStructuredText | `.rst` | `text/x-rst` | RST parsing, directive handling, role extraction | +| Org Mode | `.org` | `text/org` | Org mode structure, outline parsing, metadata | +| Rich Text Format | `.rtf` | `application/rtf` | Text with formatting extraction, font information | + +## Email & Archives + +### Email Formats + +| Format | Extensions | MIME Type | Capabilities | +| ----------------- | ---------- | ---------------------------- | -------------------------------------------------------------------------------------- | +| Email Message | `.eml` | `message/rfc822` | Headers (from, to, subject, date), body (HTML/plain text), attachments, threading info | +| Microsoft Outlook | `.msg` | `application/vnd.ms-outlook` | Outlook headers, body content, attachments, recipient metadata | + +### Archive Formats + +| Format | Extensions | MIME Type | Capabilities | +| ----------- | ---------- | ----------------------------- | ---------------------------------------------------------- | +| ZIP Archive | `.zip` | `application/zip` | File listing, nested archive support, compression metadata | +| Tar Archive | `.tar` | `application/x-tar` | File listing, permission metadata, nested archives | +| Gzip Tar | `.tgz` | `application/gzip` | Compressed archive listing, metadata | +| Gzip | `.gz` | `application/gzip` | Compressed file metadata | +| 7-Zip | `.7z` | `application/x-7z-compressed` | File listing, compression info, nested archives | + +## Academic & Scientific + +### Citation Formats + +| Format | Extensions | MIME Type | Capabilities | +| ----------------------- | ----------- | ---------------------------------------- | ------------------------------------------------- | +| BibTeX | `.bib` | `text/bibtex` | Structured parsing, entry types, field extraction | +| BibLaTeX | `.biblatex` | `text/bibtex` | Extended BibTeX format, advanced field support | +| RIS | `.ris` | `application/x-research-info-systems` | Structured RIS format parsing, type detection | +| NIH RIS | `.nbib` | `application/x-research-info-systems` | NIH/PubMed format, structured citation data | +| EndNote | `.enw` | `application/x-endnote` | EndNote XML format, citation metadata | +| Citation Style Language | `.csl` | `application/vnd.citationstyles.csl+xml` | CSL JSON/XML parsing, style definitions | + +### Scientific & Technical Formats + +| Format | Extensions | MIME Type | Capabilities | +| ---------------- | ---------------- | -------------------------- | ----------------------------------------------------------- | +| LaTeX | `.tex`, `.latex` | `application/x-latex` | LaTeX source parsing, commands, document structure | +| Typst | `.typ` | `text/plain` | Typst markup parsing, document structure | +| JATS XML | `.jats` | `application/xml` | PubMed JATS parsing, article structure, metadata | +| Jupyter Notebook | `.ipynb` | `application/x-ipynb+json` | Cell extraction (code + markdown), output parsing, metadata | +| DocBook | `.docbook` | `application/docbook+xml` | DocBook XML parsing, semantic structure | + +### Documentation Formats + +| Format | Extensions | MIME Type | Capabilities | +| ----------- | ---------- | ------------------------ | ----------------------------------------------- | +| OPML | `.opml` | `application/x-opml+xml` | Outline parsing, hierarchy extraction, metadata | +| Perl POD | `.pod` | `text/x-pod` | Perl documentation parsing, section extraction | +| Manual Page | `.mdoc` | `text/plain` | UNIX manual page parsing, section structure | +| Troff/Groff | `.troff` | `text/troff` | Typesetting markup parsing, document structure | + +## Format Capabilities Summary + +### Text Extraction + +All 91+ formats support full or partial text extraction. Document structure and encoding are automatically detected. + +### Metadata Support + +Comprehensive metadata extraction includes: + +- Document properties (title, author, subject, creation date, modification date) +- Format-specific metadata (page count, dimensions, encoding, language) +- EXIF data (for images) +- Document statistics (word count, character count) + +### OCR (Optical Character Recognition) + +OCR is available for image formats: + +- **Raster Images**: PNG, JPEG, GIF, WebP, BMP, TIFF +- **Advanced Formats**: JPEG 2000, JBIG2, PNM/PBM/PGM/PPM +- **Configurable Backends**: Tesseract (all languages), EasyOCR, PaddleOCR (Python), Guten (Node.js) + +### Table Detection + +Smart table detection and reconstruction available for: + +- PDF documents (native tables and scanned content with OCR) +- Office documents (Excel, Word) +- Images (via OCR backends) +- HTML/XML (from markup structure) + +### Archive & Nested Document Support + +Archives and nested formats support file listing and sequential extraction: + +- ZIP, TAR, TGZ, 7Z archives +- Email attachments +- Nested archives within archives + +## Getting Started + +For language-specific examples and detailed API documentation, see the [API Reference](https://docs.kreuzberg.dev/reference/api-python/). + +For OCR configuration and backend selection, see the [OCR Backends Guide](https://docs.kreuzberg.dev/guides/ocr/). + +For comprehensive format details and format detection, see the [Complete Format Reference](https://docs.kreuzberg.dev/reference/formats/). diff --git a/plugins/kreuzberg/skills/picking-a-format/SKILL.md b/plugins/kreuzberg/skills/picking-a-format/SKILL.md new file mode 100644 index 0000000..082e53f --- /dev/null +++ b/plugins/kreuzberg/skills/picking-a-format/SKILL.md @@ -0,0 +1,94 @@ +--- +name: picking-a-format +description: Use when choosing an output format for extracted documents — text, markdown, djot, html, or JSON. Maps consumer (LLM, parser, archive) to the right `--format` / `--content-format` pair. +--- + +# Picking a format + +Kreuzberg has two orthogonal format knobs. Get them right up front and the +downstream code stays simple. + +| Knob | What it controls | Values | Default | +| ------------------- | ------------------------------------------------- | -------------------------------------- | ---------------- | +| `--format` | How the CLI prints the result | `text`, `json` | `text` (`extract`), `json` (`batch`) | +| `--content-format` | How extracted content is rendered inside `result` | `plain`, `markdown`, `djot`, `html` | `plain` | +| `--token-reduction` | Strip whitespace / boilerplate for LLM contexts | `off`, `light`, `moderate`, `aggressive` | `off` | + +`--format json` always returns the full `ExtractionResult` (content + +metadata + tables + images). `--format text` prints just `content`. +`--content-format` is what shows up inside that `content` field. + +## Decision tree + +```text +Who consumes the output? +├── LLM (Claude, GPT, Gemini, local) — embed/prompt context +│ --format text --content-format markdown +├── Vector store / RAG indexer +│ --format json --content-format markdown +│ (markdown preserves structure for chunking) +├── Downstream parser that expects machine-readable JSON +│ --format json --content-format plain +│ (cleanest text + structured metadata) +├── Human review / archival +│ --format text --content-format markdown +├── HTML re-rendering / web display +│ --format json --content-format html +├── Lossless intermediate for pandoc / academic tooling +│ --format json --content-format djot +└── Token-budget-constrained pipeline + --format text --content-format plain + (drops markup; add --token-reduction moderate for further savings) +``` + +## Examples + +Feed a PDF directly into an LLM: + +```bash +kreuzberg extract paper.pdf --content-format markdown +``` + +Index a corpus into a RAG store with tables and headings preserved: + +```bash +kreuzberg batch docs/*.pdf --format json --content-format markdown \ + | jq -c '.[] | {path: .metadata.path, content: .content, tables: .tables}' +``` + +Strip a file to bare text for a token-tight summarizer: + +```bash +kreuzberg extract long.pdf \ + --content-format plain \ + --token-reduction moderate +``` + +Pull metadata only, ignore content: + +```bash +kreuzberg extract file.pdf --format json | jq '.metadata' +``` + +## When in doubt + +- **Default to `markdown`** as the content format. It is the best + compromise across LLMs, RAG, and human review, and Kreuzberg has the + most faithful renderer for it. +- Reach for `plain` only when downstream cannot tolerate any markup. +- Reach for `djot` only if you're already in a djot/pandoc pipeline. +- Reach for `html` only when re-rendering for the web. + +## Token-reduction (orthogonal) + +`--token-reduction` collapses whitespace, strips repeated headers/footers, +and trims boilerplate. It composes with any `--content-format`: + +- `off` (default), `light`, `moderate`, `aggressive`, `maximum`. + +Use `moderate` as a safe starting point for LLM context windows. `maximum` +is lossy — verify before relying on it. + +See `references/cli-reference.md` for the full flag set and +`references/configuration.md` for the equivalent `output_format` and +`token_reduction` keys in `kreuzberg.toml`. diff --git a/plugins/kreuzcrawl/.factory-plugin/plugin.json b/plugins/kreuzcrawl/.factory-plugin/plugin.json new file mode 100644 index 0000000..a65e62c --- /dev/null +++ b/plugins/kreuzcrawl/.factory-plugin/plugin.json @@ -0,0 +1,23 @@ +{ + "name": "kreuzcrawl", + "version": "0.1.0", + "description": "Web crawling and scraping with HTML→Markdown and headless-Chrome fallback.", + "author": { + "name": "Kreuzberg, Inc.", + "email": "support@kreuzberg.dev", + "url": "https://kreuzberg.dev" + }, + "homepage": "https://kreuzberg.dev", + "repository": "https://github.com/kreuzberg-dev/plugins", + "license": "MIT", + "category": "web-scraping", + "keywords": [ + "web-scraping", + "crawling", + "html-to-markdown", + "headless-chrome" + ], + "brandColor": "#7C3AED", + "icon": "./assets/icon.svg", + "logo": "./assets/logo.png" +} diff --git a/plugins/kreuzcrawl/README.md b/plugins/kreuzcrawl/README.md new file mode 100644 index 0000000..71826cd --- /dev/null +++ b/plugins/kreuzcrawl/README.md @@ -0,0 +1,102 @@ +# kreuzcrawl + +Crawl, scrape, and convert websites to Markdown using the local `kreuzcrawl` CLI in your agent. + + + +## Install + +### From the marketplace (recommended) + +Pending review for official Claude marketplace. + +Self-host: + +```text +/plugin marketplace add kreuzberg-dev/plugins +/plugin install kreuzcrawl@kreuzberg +``` + +### Binary requirement + +Install the `kreuzcrawl` CLI: + +```bash +brew install kreuzberg-dev/tap/kreuzcrawl +# or +cargo install kreuzcrawl-cli +``` + +Headless fallback requires Chrome/Chromium on your system. The CLI launches it on demand; skip the binary if you only plan to use `--browser-mode never`. + +## Skills shipped + +| Skill | Trigger | +|-------|---------| +| **kreuzcrawl** | Crawl, scrape, and convert websites to Markdown using the local kreuzcrawl CLI and its MCP server. Use when the user wants to fetch a page, follow links across a domain, enumerate URLs, or drive a real browser. Covers installation, the subcommands (scrape, crawl, map, interact, mcp, serve), output formats (JSON + Markdown), browser fallback, and when to prefer the MCP server over shelling out. | +| **crawling-a-site** | Use when the user wants to follow links across a domain and capture every reachable page as Markdown. Covers `kreuzcrawl crawl` with depth, page caps, concurrency, rate limiting, domain scoping, robots, and output selection. | +| **scraping-html-to-markdown** | Use when the user wants a single page rendered as clean Markdown plus structured metadata. Covers `kreuzcrawl scrape `, JSON vs Markdown output, what metadata is returned, and how to handle JS-heavy pages. | +| **headless-fallback** | Use when a static fetch returns nothing useful and the page needs a real browser. Covers `--browser-mode auto\|always\|never`, external CDP via `--browser-endpoint`, symptoms of JS-only pages and WAF blocks, and the performance cost. | + +## MCP tools + +The `kreuzcrawl` MCP server exposes: + +- `scrape` — fetch and convert a single URL to Markdown or JSON. +- `crawl` — follow links across a domain, bounded by depth and page count. +- `map` — enumerate URLs from sitemaps and link extraction. +- `interact` — drive a headless browser with click, type, scroll actions. + +## Configuration + +Pass flags or use inline JSON via `--config`: + +```bash +kreuzcrawl scrape https://example.com \ + --format markdown \ + --browser-mode auto \ + --timeout 30000 +``` + +For complex configs, use JSON: + +```bash +kreuzcrawl crawl https://example.com \ + --config '{"depth":3,"max_pages":200,"concurrent":8,"respect_robots_txt":true}' +``` + +See the `kreuzcrawl` and `crawling-a-site` skills for the full flag surface. + +## Examples + +Fetch a single page and print Markdown: + +```text +kreuzcrawl scrape https://example.com/article --format markdown +``` + +Crawl a site at depth 3 with rate limiting: + +```text +kreuzcrawl crawl https://example.com --depth 3 --max-pages 200 --concurrent 8 --stay-on-domain --format markdown +``` + +Enumerate URLs from a sitemap: + +```text +kreuzcrawl map https://example.com --limit 500 +``` + +## Versioning + +The plugin version tracks the marketplace `VERSION` file. See [CHANGELOG.md](../../CHANGELOG.md) for release notes. + +## License + +MIT. + +## See also + +- **Marketplace**: [kreuzberg-dev/plugins](https://github.com/kreuzberg-dev/plugins) +- **Upstream**: [kreuzberg-dev/kreuzcrawl](https://github.com/kreuzberg-dev/kreuzcrawl) +- **Sibling plugins**: [kreuzberg](../kreuzberg/README.md), [kreuzberg-cloud](../kreuzberg-cloud/README.md) diff --git a/plugins/kreuzcrawl/assets/icon.svg b/plugins/kreuzcrawl/assets/icon.svg new file mode 100644 index 0000000..4910078 --- /dev/null +++ b/plugins/kreuzcrawl/assets/icon.svg @@ -0,0 +1,14 @@ + + + + + + + + + + + + + + diff --git a/plugins/kreuzcrawl/assets/logo.png b/plugins/kreuzcrawl/assets/logo.png new file mode 100644 index 0000000000000000000000000000000000000000..be845f95611fd5ca5c736233c860a89a5112b559 GIT binary patch literal 12773 zcmbVzc{~(Q_xGK#O%fx?QrVYE$d)1u$r{R*eaTL;C0o->C=w-+>_pkOEXiKUE?bNx zO9|Pt?=$ZmJ-_GozW=_TikW-Qz31F>&pr2??>#-auC026mYo&=a70~ASr-5#{D=f7 zYWQo-_xld~MSV+CRT+@Te^P5QUjo1d)Rh(WeG?Z({1VN*v83hA%^!BwxK(0!~8RT#%@*1^BX^jKO`)D?>43mzYz-ZFvt8j7gtwOFq$j(wK*UuPrF1s z^Tc+h#=8-3S8$zwL(Si;$41=0y%lHM%(=bLapKvjd*()oXwca0x0uZKG5+_*#!Z&y z*N@(BoNW1CQE(WzhgKK+_eh(}8CQu|Gu16t*KM;Au3lwBjGs?pxn0=tTv8-D^Ml%` zDPKVOTs#0@OL{J-*F`!ks__TlMnC$03!RI96AZYrR|B!Vw1c0k_*XLx{nIo22~r5q z>a6yDlRYaK_9V7tA4m@{2*f!9xCC>{#~WbtLmCAdh{33wRT`TV$)iJ>y7N2jExzV$51Z zMzQ6h8KJh#TS_sz8MwPP)lP&jD+uS-Me5I0asbS^Bbplt7o1*={hjMiDtLP4Ll`Yc zI#rc99kUnEa$i^S0l@g#=C5@`FO$lm>?zOkfVV zAEoz(@e!aYTPZ4O zC~cbl(wu+A6Bb#F#gZ6m0_1p(?=^Ve{RssJoK;$GLXH+xrNGZnawC{oZ0sj8B34!d`)mZ95@i&5%$A!1=T3 zRWDZ*j)vq?T_#Y4PY-RI$FkaRq-|3Go=)c*qyBX@1?kiK1)6L+whq9e66?0FWw;YO zX9B?K)FLZP%royW`rhU_Q$lT@w{XbQ zA7}7y4{y}!B@bd&X+T4gl#Frza7AhQu8ujuv!b~jnE7XGC2rl{JQB4ncr{A|fY$of z_=NJ>ciWXQLRz1^#dMX~7Ys6&?qs>Bjoyjy3n2@F3hy2<-o|Zw^j_Fs)byB-U)}t+ zM*Ad1W8*e?iReJpD5{#8)rbAlX2*X@0<+$3uN=XE%nW0ZIp`~FJC7f&bli^b7cQW6 zKMo4ai#AW%)Ku}8RpuZ72-z@A|1eTiu>OEWbn*9Dyo>Bc%_pO!*f&r?IRr_Y_0_kh z%-uYff2Gmj->z1%uxH=PboAbxTpSNTI zc*hd9mTIrTo&JE&{$IP0_1?H2CH3sPIB{px96URrFRX29KA6{~ZP%W+I2R4vM?M#< z=D9r($rcMIPynuXJK^)D1(svwVt-g@FN5m)*<*$LQZ)Ne9s2lWC8cQ^>=Uchr=ai> zZt0rN27UJ)0=&*TUYIH?+c_e^+stz49JrfYlRGNpo6C2K<+}?2*Pnz-%BEGG7uGSD?K$3 z0fx15FBrEI;WA|Tx?M#8OUva@;}$x_05C>F)q$IqFGrd)ko9GUD~X@g(r_VS02nO3 znVZ@?m)uURAy3uir)X&^pD@A`x6yJ8UOXB-25K2P7Pap@+|KBT_)&AEbc`rL!>*4j z4^9!<&cR37gID+NTArREX<%t279pIsn9l3JboElxOGz`#)tUT#%;v}8Q5kaY*%G{4 z;4&ulf=0)ey8c+ZE?Lu|IhEDxM5i|l$z61q>Cef=eq%UYajCZisLwzyzfYGvY<3u8E>=c*>?rS9FmK4 zm{0Etj*$mcWOX`t;^any8WZ8_tpZ%-hu6Y0(pwFfp=3Xj6=fxNr;aqyKi|NMIRM(h zY|f(xqvGVh=8{&6InLXIq8`;YQwB&KRA#gJ_XY*rtBqvd!IDV3!y8&{4u`8gu8RV| zY|R%dCLv#y{dRMlv(_2fz`7{edtMiF?v?WIbHd-G0=zZSiT(g!)QD?;IZt}hj!&AB z?rG*QndMcN_OaFK(%ZP=oXjk4zcK z%aM9YPn`Sd0*)VoCoB1j{O`K=)Tv&dM$=dO1LXI@ARE@GkDuGWMfCb;(sY1hzlHDK zBrC^6WTr#m@AOS1zHHWAkF1#>v8k9xc$z|lk$Z3x5+3hmjY8?DQd{D;N@~3^p)-?t zWcf;Oi;qZ`d95ch>O{>N_{2LBA>Yco+zOM&W+wGOvV*qI!Hzs~(|$>xhbWTiaDwNC zi;sp1xWqvZn#yjJ-usq90yHMim$JhN>2vk?TFm~l7YOo&jK4l$Dqv~fzMB_3WY7QoLYv-|lFYkFH(b+iRQ9Vyz2`OL-8USx&V9&X zIULoT94vYikj-dr2NJH|0n<*_oOX5LPR?Wh#P+2|bA=FXICYHF~E7Y}>aw zb2Zueq-xjd)1T@5+43U&m34W`k|N@``{k1Ba5SO!gjcH_iV`!=HXJXhd*TaQsNjTp ziMG~k_*E&H1!1Q?$!n<-EG1apj*HxBY=V4}q|S`?)V`&D+x6jq5`WJt+O?Mbu@q?b zUVG>~mZI0#32(vc3frMSc^@*%+&z-w9sISO2R^hBhM2$jZ#@fNxg^LBPLI%+l%~5p zH$tHrlydt2RBwLIn%FOMzU|$+#XuNaa2W|>l&d(K{sXo81appc;R1{^HZ|jCD~|&k zStmse7Ez}zN!66Crrr-t1anJ{9yeT!w>d3-c|ihFhVb=$ETF7^p_MQ*v!Ka-_g7%_ zoUXv{CuPPX%U`+IYnzb>uFsF7qIwE)fxB%X_KHcYS?}N0v-=k>I$4`|eY@t=rxoPC zXJJo)56uyS6JBznQNAY5Z|-YXGi7NqR~fDD_os<%#BY`BDdj_yMx_X-Pn*qTH{aM+ zITfI>7czZmDW`kw(jn}o-ah-{LkEu>DQ9e#s+&vCjV3{}SlxRIB}|pS6ym>M$Xy}2V$2u;9_8q~xt1n+!Ei0} zcebkEPMfH-^H6x}zBJO~vl`{e+!W`#+p#(6Vi(E+cE;SpW421^Ef{oIt4ykff57L!t z4+Yjd0C3AJ$w?)y$_GxxC4a;EM9hifb3VU7q146sy{ilTZ|*;L_?bdD8)ttUpx`k> zwHPmD-LN@v<&I=Te)6~K&^F}5B6mrolGSCWpTA@XcFLolzj}*bLxMCeIQeU=ibjV7 z+nougOO3rAg}Jq|yhssuWu%gOBhR~XL1_5rs3|{VVBYpBN9@N+IV!XEOb7o%3<|EI zUfUNNRz+k^d-ut4aWIh@1gb!~3nBJ~rI#E$957stlz_|1%gO#~LIC|P9f}=FcO9PQ z884=Sw$LQDCP96Lm|g))B%BLx|4q}~Jm65cU`Jw8HWQZ`9Ew>e*;=#Axv+5xeHsbS zT=7OTL-roN$iM~^1@gQ9xhr}4!Gc^yCzJ3%%gHV@##!U=EO{EQ&xO->DF;8No>G<*V?BuIIeSWZ}_$oOI?Jdsn|OfGY_|V zt#hK;p|YYLJ$3OWI~Z{um(jgwk&5bX-fWmY^5jMp6^3p#??=7aUTjVY&7u$(HL}o- z#IM|x;3uR&yxER@j0Tj?p<08d6fVKk)Ce*HlB5+bO7Mvr06UdcQBfs;rbB}3Dv~5C zG!nl8S7@iS%5a<;oEC-6)RH83axI0gl|c3nT7^zww~?C@Nb$OJRg!2TVr zes8v5kgBE*11t}8%oOP<>1024jNMS2KgrJ{;86tw&BNz5cL;PFok?Gy2tS9+KK(P!gz}O3Cpnx zo$dL75IU$MhCBpNVa%Kob^6h0M$hCI;r-$HpT$BtDvLH>BPI|Y0l-PcAMv>FaM6FD z5<}9qV`;cS=mjKjN5|v*7-2KCH_`NMnjkb#hSxc|L%|X5i9ZTt`Tx_hBR5FR>E8Qt z6$Rb|0esHYRUa}f1p*j6nOcHd;{F`AUO7kES?xeP>4q%@FysN zJ{43wWcQa5Twiu1E@M#O^HqSkn!eq1m>ZxN5PaIOTN+oy!!)9a$ry^WIp$fozN#cUx8c)L5Fw>EUV&Vo{6iASm zw*7@l3H&_@Kc;Sf;a36}fTxheDI>T+9~AU@_Z}mEFwmEPFvXQs)nkmv>-i{vqUzwh zp@PCl@c=G&Nn8V&_%Xm^Ah94MW&5mxKcPjL)YCrP0+NDP&FAA6Gf9SyH<+Rlr{XS7lhBE=>Y3PtSn30plB z&t>G(COY@OEak;m+^XE zg8AO4@aqvk>$Q}28@uVazzb5Ulg{mos#Z`p+f(Pejc?c&h9_MX!mQ=)tEz0D?BI+! z6UNOIHo>`U0|PvX6eiwmll0Y5z4T|Bgq&h2c9)C0RoDf8o2d z&?h(^c2hYhr-i;flK7`Ox|g6}W|NzFm9y|>)2_2OAvR=f*b>If!H0N2!$JH}o1rKu z653z?{Aj$wXLJl2eO0Sht_C-FOs@Dsu59haj8PV2F0?l#J~UwAL{BA*Q*Ij0{OUU} zEUVjs!q;5B#huBobh`e0tnD@?ni24vjf)bges8fVn(u9{kP6Ul9LTk43)Qq_a7 zX9|MpvThu*w^!GGn!;pcB6XXW_KoN8_~`_1f8CouhBC5youqGkA*lubKhfV&CurJ-xUKI0 zHwb)cV;YSHJa4i`dP8al)HFC~plK_kSjWDnU*4aLJt5^v-Bgh^j^ja+15qMBb)&B% zF;rZsek*=R4*PeE-GIZEz=7Z z*^$+zs(^lAyv$8)j<0sWL0=N5e^jY{z#>4t>B>zM;MVM8H(ygfy5#6c%u)`<7n_%6 zC}y+UD}PY|s6oG_?4OB*)6mM(x6^Nt51=VwHUFY&&*k3Uk4ps}<|=(@Opi{-Mahgd z<&e7DVWK78kqPwQ*=or6b8POic6?SSwA%{_2k18?qc_qUFWlR9gl3rJ#~4eMCN4&7 zYTRPrTxyJD)iJ54fS3D=4=}5t<3wJ!W+gFw*ugqXF6FA&i1|wm2Sj2{p-9vRZU4KdsXk-)9!aI ziD#6R^wxSWBnY63HjFb!x@E;nXs>Wmj~FpX6`)w%e)W`f{91V8YqJ;oY?<@&hI_rq zb32yUupq%-&z3D^8gYMpeW-712vt?+>zNGrkQvRel0Mlm(B!>KX2!hDak3iYCNr_w zqj?U{=jp*)n=$T1UxwIcSeDC3Onr}c^%KpQDS&_d9~MXev8+%i&wP=$xzcUBkpG>P z7f_!2l2BngFWYcx>1R*z5{woPWVE9m{4Z-rbUSs}2M#^5RfA=Mkv&VCpgVUqtkF|qV8@u3EsZQje~I3_YcefU zRhPiE?^9J0Feq;g`>%U=b_=vm2;XwtqzyA zIWHjb6m@k9nS%YE#$ApmL>5FNid{d7=H}1E?fN_fI_s&{i(kU}uL~iiI^$fU$RMVeSOyl8yEFMHC_w8Bf|u#_|=Cc8oYG1ZfE|Lp>1D_ShL~O zN|lMq3^o5%W>G49-`MY}enyUX(tvo7VeTFF?-QGzNw{b+3baR^>C3*GqDzxi6pi)Q zz-PFC;a9uwD<|Ju#N+7BQ=ln!eRg+7GpWg_jEF@~>QvnC@-Jdc`14(Rsvq293*r6x z`ziJoTr00BVX|}~%_3=ck)51P#1Ex2QtoVQ`9E;6D3nuJD5X~dfyZDv^80b3ePU$j z_btnS-OvjKt$$-`#jb$RII~9Q{Y}-iaR$}9Q(vW^_tq9}s6)-Ip}l(5S^yLEz;D>S?-72E zP)1p~KX-f${>*xEYn4+=ABJy1Ym2 z-DuXmup5<&Z*>ks;P3?^w>`2C^Sp2|iE07cf0OVm(85=|Fm3b~^<@b}7Nv(XQTHx+ zUU2=>WB>f*(bv7bC%z~ss-iH7U-P#)pR_GM_YG|n*VWJXms3@czPT+j;cw}~_XB}I z7*I1J(Sq>SCI{4T&`B8ZyON`CXrU^rs^sWfG8lqG9h`k(FjR-c{E(O&szc&qkaz(F z+yl`d7NWHXK+lLAhxvf-Aq3Edqyq5@e`zSeMM%o(NUTqTnF!1p0wMBk5drYcT$EfW zN!)ldY(Az0TAxfYgu`^8nVT~8B%~gI&3sBgAP6?Y%&eJ*GWD1w&gvFymQ(^^icowq zH7}(Vk0TM|0h?ix(Fj*X&T?*}k?wqwxKv4)DbaERn)Gdox8zI-g@}YxlQSjFVEn8K zoC8ihz(c7o7_V@URP{+Q>{7QWKJmb$juH7Q0#ZXZVOLS1kPVt13BeR@ zE(=RoSk&!A|EwFmP}~KXxdkU- zGqcR8xH|$X8w)FC5bNQnJ`b2z+8p%?if z$`j%hd}!nksK1vG(h_)rMvMHjkW{*VYMq1wTZ7T)bFkTh5~PNs5T79RB&2Q)M_;#q zD`cgFfKV`g1vYog=3aeoJFDS1&loyRg3Ot7U9|Dmq+mQv{ zp-TB_c!@%!9|Jx}>&zE<5Zo7S81;sk zykRCu;0Fc|A-HwUz~=5s3K|H^ewTh5lk@X3Kubb~MNAecx(mnHhm`vcV%=Ww57wLN zY7e^ro#fn6JUoXGUyX(M>c{D#PJAjfEg&Ts5HBB_Ksx5qxO}KjXR`DVnh(yF8n=4X zj9`JBP?wnYn5Fy&Ywn(t-;d3SfY>F(6DUIk3@gixPbTBJ#6IIe8Alshi23Is)Nxfo zE{vWD+pV|pJ)a?dN(Ni?^*U+mb^9cXmhfC;ri?(N33icT&FF&kCoyNW;3xPEq|%bR z?LzV)N-8AV@E`a!I_*hJIOO{DKU~|n{sXjjVLmLuV37Tf^%z1C!cBD;68J21?O;Dp7sj;Ub9{%7gIO%ZuAr&|9XMK?#zgMimPk&XrX5;7 z-Alr-!g*17qWR`x^U&N(OKt`43IJ|svfO?L!Q^WyR;@7(%Y>y{JDdloy}G<${%#5* zIp!N;=KU|A`i2G<-A!b9O?A->p7RUD*E%H3YdCkPC$FKx_UBXxv`!a@9yyyb`C9=k z0EKx!HkQ)M7YPIXKAytozDH;kg$DcL2N*$Kf$(z%I4l~Eb0tU;!Xt;RbiYV6>*4Oj-_h+%io>wscqP&=WH_BhZ(Ag?Oi>SSK)$(dG z{lMxq#Shp#9dlWsr7`LTX82|u2_4ngmapYiVY>S;3?!0$ug_eVv zQezRDHz!PlT#YWhArWm;9`D-hF(A(B>*^}%e)i@D#ClP+%^xxuNzbRfTgr~)xJVXg zp4G7La{l-BRj^9KlQXVV>ps<2h}$!}wp^ArtVYBOw0bgSR7JajIF+8ySY?Sn~qHw?ORvgOznQ{A2y zmVJ2cIFUnVIdiz-k)IQ_Rb~dEqJXNt0X|s8u@D>~Kvi~fLICxDMQw|-G!324GBO?z z48Un^c|ljRnC?acwo}+!x(?OC`2=&L^quTa-=-qyjL0w$G~BXOW(&rM(U>uK}Np4*D| z2Apl{&q~-zbxq_3I4B0BtoN0rr?MosyJW>qP+WHkFCI}h9`0brq^{7do5G_;90GQMFeD28j|EYIX;_xw{g^$*cupg&6QQ-Nd zJSsT4LG8H)4H6GeE4Y3nulTvTC=IY#NWasD0Koe1UIV8+pq~Fb5mjM%8h{w#gsef5 z?93!^Uwv#7!oT0>qz(X=BZGHAvINh*HDhXcx4#ehrM%uaU-oYke8iXg1Fxgh6(bwV zQvkmAn@CT>Ui3ip$_e_t$r#Y}(PQThuiVdf0`Q&ubb6Bvq0@KrTivL2UA635AQXQ* zyewpG>k6y}*fFjkk3G*Sp#uIS4SA;IQGXGJuke(u&hD0a!`wG_PaOq!w&Plw2P_|% zl1nwB#T4o$9R$4oKCiwh*XV5L3l9)J)i{?Vx93nu3l5FtMM)mjfUca&k@w8lsNWWY zB{PuWbX<7ZF8A|0)NP<`hOFi6LWsWsh~%?b-yhe6+K1TK%#11>pkUv7k+J{KkWAN~ z&}wfXwJ#b0B^Y&|FDl!>1?H1wYcW|(`lhAIaB9W}4vG5Bf0V>%cA&HV*M-kFs$glz zx=RA*i}o*{qTxiy^5w?$js(@+$t=V;|0`p-E5DlmxY+Fbw}UR(vdLlbA@fmqpFxm< zPY*kGVh&+LjW>N@FaG`(teU|ak!uCz9NH6hBD`oZ8biN@?<>LrgkmxM=FF!3zw&p$ zpC9MTR2v^uU52$=c&oTVQCv}WK(%Vb3BY$|#!p)Pf&`Hl_FUm@3Ecmo%yBUd?ohd1 zNZF|C%iIH9&-Q<#F-b{iz9M)G@U;jWme!E@H4+SY+ZYDbrjX{9Ipu!>_lfuo7QPV^ zU06)SWz_g!vXSPQzsDPf6n7ArQ5{>MsC%>fDrCd8&|+6T`So&I7I5z!a)L#Gb;oWp z!JQlEAs1o_PaV2S1=Rh2y`Cym_s<)chC?v7NByt0Nrz!|yAf8C8hVj#q}qdy_a!TP z{6aV5m5O!F$pE#U_q)RnxNUAmub4s0<}lppe_ytctLb+~9f66W{kSH`w{T{C8vrZa z%01H*vi-5*P0mZNl{SOWVuVaX82T~>J%(Rys`1j zZf9aH}y#X(+HwC7%a)<#~AbY)QB`9_Lr7n%%rIz4Y6hObj`Wm98rM#C$;p0{()K>@`nKGsPanzoj~ zdQx|Ex@w4|(Mk^KON=nM|FfGds9&_&RQJXQ_-+CwhEmoJFV*;cDgHt(Ohe-NgafR0 zmhAQz)Zb73fR%{`?Itb?KR#OT=uQ4`G=`kXHD^r5UV0xtz(O-AaYMpbQQCs)pL#`Cyq}6Lbwq6(lm|ALrAQHnU%5F3@Env5s`W(7B}kj*KIK0W zybP<98oDB5HlLm7FH<wNd6pu&CU05MHWmDQDlALHs<|hXUZ~UJunTL zkx6;&I)~rA7q6imkK?)3mrU;AhDAT$@5U4&Sv$%O2OMV~gOas;oudF~#+pWH7E2ti zt+L4S;&(PY#T}#J_x--a1U~PNzE}WwtWuWeP0b~7E!}kWh!pv{U0YbQ7$G@q2Q4JH zMPvBBizvNK8ObrR9h@w>uQm5dU&gy2JTZe4?zV5&@s(2|33b0O0ySN7`I&9blJNT- z%Y!OVKAW|4TY=TaRJRq0G!ruxcT8qp?9#v{TZ=_Hs*eg_*o@eOH49v3=w?TWU&GAO zYu=?#-_qHrihB34fC}Kn|CIDwa5UW9(N_5ltpV%*bJXw9*lt>eo*4lNFfHL6T04(6 z@7;M#grza`)Ax_mzc^<(O#GdCN`|0j`O4NDmTh8|+kI4ayiVP?P1IF(0Q4K3t5svO z0Xy&V?BN9s@XLcW{j2hxUxr?#R2JBtdy$B^IN)%hlxQh#dQq>(oLp0Db$f%p?)pXH z*$&IFPjvO43X;aK*>e5%b^bi)Ry!OYtP z8s;eWjkAi>Q9cD;V2}e^r=g=0poiqYd&X>JdpS^9ipyw+PLDG}PD0mDu`R zu(W+Q#;nnlIH)CDQ_vQd7lr~f|1sxPhN$h}3>k9y*IA{0jj_^0b*qh13!)Vw@667g z1t+`wEH+(M)V<2`_BF+LVYTt2zSQvT$7MtZw}GyW>N(HWd>+U(MW)KZ`K+sGY`21j zOry6P^wk5rGyO{P-fb&136?WCmLxTLi&^JX0(T6#CiSpX z*-~^d&OU!sAiLjN3&JpTZLK%E9vc|tim6NjO}qvFhAHKS1uPE_KX1TufH9>Egn0z+0?i>CgSJzV&3?c0s=E{ zT|cKW0frTxO;CtIeE2bnr$puIGyUf$!TpCfcgHLe>L=l`Lt=i zR@@J4(wxJ*hq*p{g$w)c_i1BD=|St6{!MM^p6vcVca4*HoZih1fmDt8sP5x7>shv( zLR%Z%cNgkkee3g5F=II{NQoCW7RNc5yS{&*{aLzXa@Q+-`PqhDW`Ih5e~y7J{Lc*G z-}%mPpW*Lk2le=j#Eih0op4FM>?T%k(_9hWlAArN{eRe)_ZC`)hzpWCD_bPxnTjjH T+=Y?Qs{{3`+R6n=R)PNq0onX| literal 0 HcmV?d00001 diff --git a/plugins/kreuzcrawl/skills/crawling-a-site/SKILL.md b/plugins/kreuzcrawl/skills/crawling-a-site/SKILL.md new file mode 100644 index 0000000..ec93c21 --- /dev/null +++ b/plugins/kreuzcrawl/skills/crawling-a-site/SKILL.md @@ -0,0 +1,142 @@ +--- +name: crawling-a-site +description: >- + Use when the user wants to follow links across a domain and capture every + reachable page as Markdown. Covers `kreuzcrawl crawl` with depth, page + caps, concurrency, rate limiting, domain scoping, robots, and output + selection. +--- + +# Crawling a site + +Reach for `kreuzcrawl crawl` when one URL is not enough — the user wants +the docs site, the blog, the marketing pages, or the whole domain. + +## Quick recipe + +```bash +kreuzcrawl crawl https://example.com \ + --depth 3 \ + --max-pages 200 \ + --concurrent 8 \ + --rate-limit 250 \ + --stay-on-domain \ + --respect-robots-txt \ + --format markdown +``` + +Defaults you should usually override: + +- `--depth 2` is shallow — set it explicitly. +- `--max-pages` is unbounded by default; cap it for any unknown site. +- `--concurrent 10` is aggressive for small hosts; drop to 4-8 for + third-party sites. + +## Flag surface + +| Flag | Default | Purpose | +| ----------------------- | ------- | ---------------------------------------------------------- | +| `--depth`, `-d` | `2` | Maximum hop count from the seed URL. | +| `--max-pages`, `-n` | — | Hard cap on pages fetched. Set this on any unknown site. | +| `--concurrent`, `-c` | `10` | Parallel in-flight requests. | +| `--rate-limit` | `200` | Milliseconds between requests to the same origin. | +| `--stay-on-domain` | off | Skip links that leave the seed domain. | +| `--respect-robots-txt` | off | Honour `robots.txt`. Pass it for any third-party host. | +| `--proxy` | — | HTTP, HTTPS, or SOCKS5 proxy URL. | +| `--user-agent` | — | Override the request UA. Be honest. | +| `--timeout` | `30000` | Per-request timeout in ms. | +| `--browser-mode` | `auto` | `auto`, `always`, `never` — see the headless-fallback skill. | +| `--browser-endpoint` | — | External CDP `ws://` URL. | +| `--format` | `json` | `json` or `markdown`. | +| `--config` | — | Inline JSON or `@file.json` for the full `CrawlConfig`. | + +Multiple seed URLs are accepted positionally — the engine fans out with +`batch_crawl` and aggregates results. + +## When to pick which flags + +### Docs sites you own + +```bash +kreuzcrawl crawl https://docs.example.com \ + --depth 5 --max-pages 1000 --concurrent 16 --rate-limit 100 \ + --stay-on-domain --format markdown > docs.md +``` + +Higher concurrency and lower rate limits are fine on infrastructure you +control. + +### Third-party sites + +```bash +kreuzcrawl crawl https://blog.unknown.example \ + --depth 2 --max-pages 50 --concurrent 4 --rate-limit 500 \ + --stay-on-domain --respect-robots-txt --format markdown +``` + +Stay shallow, cap pages, throttle hard, obey robots. + +### Multi-seed batch + +```bash +kreuzcrawl crawl \ + https://example.com/blog \ + https://example.com/docs \ + https://example.com/pricing \ + --depth 2 --max-pages 100 --stay-on-domain --format json +``` + +JSON output for batch is an array of `{ seed_url, result }` entries — each +`result` is a full crawl payload or `{ error: ... }`. + +## Output + +### Markdown mode + +```text +--- +URL: https://example.com/page-one +--- +# Page One + +… markdown content … + +--- +URL: https://example.com/page-two +--- +… +``` + +### JSON mode + +Top-level `CrawlResult` with `pages: [...]`. Each page carries the rendered +Markdown plus metadata, links, images, JSON-LD, and HTTP response info. Read +`result.pages[i].markdown.content` for the Markdown string. + +## Politeness checklist + +- Pass `--respect-robots-txt` on every third-party crawl. +- Cap `--max-pages` — a runaway BFS can issue tens of thousands of requests. +- Bump `--rate-limit` for hosts that show signs of stress (5xx, slowdowns). +- Identify yourself via `--user-agent kreuzcrawl (contact@example.com)`. + +## Common pitfalls + +- **No pages returned.** The seed page may be JS-only — the engine falls + back to headless automatically in `--browser-mode auto`, but `never` mode + will silently produce an empty crawl. Re-run with `--browser-mode always` + or check the headless-fallback skill. +- **Crawl leaves the domain.** Pass `--stay-on-domain`. Combine with + `allow_subdomains: true` in `--config` JSON to include subdomains. +- **Slow crawl.** The default rate limit is 200 ms per origin — multiple + seed URLs on the same host still share the bucket. Spread seeds across + hosts or raise `--concurrent` for unrelated origins. +- **Memory growth.** Each page carries full Markdown plus structured data. + Stream JSON output to a file rather than holding it in memory; set + `--max-pages` aggressively if downstream cannot keep up. + +## When to reach for `map` instead + +If the user only needs the list of URLs (sitemap analysis, link planning, +seeding another tool), use `kreuzcrawl map ` — it skips rendering and +returns a flat `MapResult` with hundreds of URLs in seconds. diff --git a/plugins/kreuzcrawl/skills/headless-fallback/SKILL.md b/plugins/kreuzcrawl/skills/headless-fallback/SKILL.md new file mode 100644 index 0000000..0f3bb8f --- /dev/null +++ b/plugins/kreuzcrawl/skills/headless-fallback/SKILL.md @@ -0,0 +1,140 @@ +--- +name: headless-fallback +description: >- + Use when a static fetch returns nothing useful and the page needs a real + browser. Covers `--browser-mode auto|always|never`, external CDP via + `--browser-endpoint`, symptoms of JS-only pages and WAF blocks, and the + performance cost. +--- + +# Headless fallback + +Some pages are unscrapable without a real browser — SPA shells, infinite +scroll, Cloudflare interstitials, JS-rendered article bodies. Kreuzcrawl +ships with an optional headless-Chrome backend driven by chromiumoxide. + +## Modes + +```text +--browser-mode auto # default — try static first, fall back to browser on JS/WAF +--browser-mode always # skip static, go straight to browser +--browser-mode never # static only, fail closed +``` + +### `auto` (default) + +The engine fetches statically, then inspects the response. It launches +headless Chrome and re-fetches when it sees: + +- WAF responses from one of 8 detected vendors (Cloudflare, Akamai, AWS WAF, + Imperva, DataDome, PerimeterX, Sucuri, F5). +- SPA shells: `