Skip to content

Add pluggable embedding backends#369

Open
aminsmd wants to merge 1 commit intoWatts-Lab:mainfrom
aminsmd:feat/pluggable-embedding-backend
Open

Add pluggable embedding backends#369
aminsmd wants to merge 1 commit intoWatts-Lab:mainfrom
aminsmd:feat/pluggable-embedding-backend

Conversation

@aminsmd
Copy link
Copy Markdown

@aminsmd aminsmd commented Apr 21, 2026

Summary

  • add a pluggable embedding_fn interface to FeatureBuilder
  • lazy-load the default sentence-transformers and RoBERTa models instead of initializing them at import time
  • keep custom vector caches backend-specific via embedding_backend_id / embedding_dim
  • fix Discursive Diversity fallback handling for non-default embedding dimensions
  • add focused regression tests and a README usage example

Details

This keeps the default behavior unchanged when no custom encoder is provided.

For custom backends, users can now do:

fb = FeatureBuilder(
    ...,
    embedding_fn=my_encoder,
    embedding_backend_id="openai-text-embedding-3-small",
    embedding_dim=1536,
)

The vector cache path is namespaced for custom backends so switching embedding sources does not silently reuse incompatible cached vectors.

Validation

  • python -m pytest tests/test_pluggable_embeddings.py tests/test_discursive_diversity_custom_embeddings.py -q
  • full FeatureBuilder run completed on a real dataset with:
    • OpenAI text-embedding-3-small for vector-based features
    • default CardiffNLP RoBERTa sentiment model for sentiment features

Notes

  • no new required dependency on openai was added to the package API; the OpenAI example remains user-land via embedding_fn
  • the DD fix was necessary to support custom embedding dimensions when chunk-level fallback vectors are used

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant