Skip to content

Add Plasmate integration#467

Open
dbhurley wants to merge 1 commit into
deepset-ai:mainfrom
dbhurley:add-plasmate
Open

Add Plasmate integration#467
dbhurley wants to merge 1 commit into
deepset-ai:mainfrom
dbhurley:add-plasmate

Conversation

@dbhurley
Copy link
Copy Markdown

@dbhurley dbhurley commented May 2, 2026

Adds the Plasmate integration to the showcase.

Following @anakin87's suggestion in deepset-ai/haystack#11056 — that PR proposed a PlasmateFetcher directly inside the Haystack core, which @anakin87 redirected here. The integration package itself already lives at plasmate-labs/haystack-plasmate (Haystack 2.0 components: PlasmateWebFetcher and PlasmateSOMConverter).

What Plasmate is

Open-source (Apache 2.0) browser engine that produces the Semantic Object Model (SOM) — a flat, typed JSON representation of a web page optimized for LLM consumption. Drop-in alternative to LinkContentFetcher / HTMLToDocument that delivers an order of magnitude lower token cost per page.

Public benchmark (38 sites, weekly updated): https://webtaskbench.com — currently 29.6× average compression, 9.8× median, 118.5× peak (cloud.google.com).

Files changed

  • integrations/plasmate.md — integration markdown matching the existing pattern (apify.md, anthropic.md, etc.) with Overview / Installation / Components / RAG pipeline example / License sections.

Logo

/logos/plasmate.png is referenced but not added in this PR. Happy to add it in a follow-up commit if you let me know your size/format preference (most existing logos look like ~512px PNG with transparency). Or feel free to drop one in directly.

Thanks for the steer to the right venue.

Plasmate is an open-source (Apache 2.0) browser engine for AI agents that
produces the Semantic Object Model (SOM) — a flat, typed JSON document
representing a web page in a form optimized for LLM consumption.

This integration adds PlasmateWebFetcher and PlasmateSOMConverter components
for Haystack 2.0 RAG pipelines as a drop-in alternative to LinkContentFetcher
and HTMLToDocument, with ~17x average token reduction across the public
WebTaskBench benchmark.

Reopening the redirect from deepset-ai/haystack#11056 (closed Apr 13 with
@anakin87 suggesting this venue).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dbhurley dbhurley requested a review from a team as a code owner May 2, 2026 18:36
@kacperlukawski kacperlukawski self-requested a review May 15, 2026 10:38
Copy link
Copy Markdown
Member

@kacperlukawski kacperlukawski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I left some comments.

In general, it seems the package does not exist. How did you test the code snippets? Installing the package should fail, as pypi does not recognize the name.

Comment thread integrations/plasmate.md

[Plasmate](https://plasmate.app) is an open-source (Apache 2.0) browser engine designed from the ground up for AI agents. Instead of rendering pixels, Plasmate produces the Semantic Object Model (SOM) — a flat, typed JSON document representing a web page in a form optimized for LLM consumption.

Across 38 measured production sites, Plasmate achieves an average **~17× token reduction** versus raw HTML, with peaks above **100×** on large SaaS marketing pages. The reproducible benchmark is published at [webtaskbench.com](https://webtaskbench.com), and the SOM/1.0 format is specified at [somspec.org](https://somspec.org/spec).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to avoid putting the benchmark results in the integration page. They may simply get outdated soon, but we also don't want to give the reader an impression that we suggest using a specific tool. Moreover, the benchmark itself seems to be affiliated with Plasmate.

Comment thread integrations/plasmate.md
Comment on lines +138 to +140
## SOM directives — making your own pages Plasmate-friendly

If you publish content that you would like AI agents to read efficiently, advertise a SOM endpoint via [SOM Directives in robots.txt](https://somspec.org/directives). The five-line opt-in tells any compatible agent to fetch a structured representation of your pages instead of the full HTML rendering. Verify your site is SOM-ready at [somready.com](https://somready.com).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem to be related to the integration itself. I suggest removing that part.

Comment thread integrations/plasmate.md
- name: Plasmate Labs
socials:
github: plasmate-labs
pypi: https://pypi.org/project/haystack-plasmate
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The package does not seem to exist.

Comment thread integrations/plasmate.md
repo: https://github.com/plasmate-labs/haystack-plasmate
type: Data Ingestion
report_issue: https://github.com/plasmate-labs/haystack-plasmate/issues
logo: /logos/plasmate.png
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logo does not exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants