feat(fts): add stemmer token filter based on Snowball 3.1.1 by egolearner · Pull Request #513 · alibaba/zvec

egolearner · 2026-06-22T11:47:21Z

Implement a stemmer token filter for FTS that reduces words to their root form using the Snowball stemming library. Supports 34+ languages configurable via stemmer_lang in extra_params (defaults to english).

Changes:

Integrate Snowball 3.1.1 as a thirdparty static library
Add StemmerTokenFilter with thread_local stemmer cache (lock-free)
Register 'stemmer' filter in TokenizerFactory
Add unit tests and FtsColumnIndexer end-to-end tests

feihongxu0824 · 2026-06-30T11:24:21Z

                filter_name.c_str());
      return nullptr;
    }
+    if (!filter->init(extra_json)) {


头文件里以及py docstring需要注明stemmer extra json的使用说明吧？另外也需要注明stemmer filter

feihongxu0824 · 2026-06-30T11:30:04Z

+  }
+  auto *test_stemmer = sb_stemmer_new(language_.c_str(), nullptr);
+  if (!test_stemmer) {
+    LOG_ERROR("[StemmerTokenFilter] failed to create stemmer for language: %s",


报错位置有点深，可以放在schema validate？

feihongxu0824 · 2026-07-01T02:13:07Z

 TokenFilterPtr TokenizerFactory::create_filter(const std::string &filter_name) {
  if (filter_name == "lowercase") {
    return std::make_shared<LowercaseTokenFilter>();
+  } else if (filter_name == "stemmer") {


es如果设置stemmer，内部默认是porter，和zvec的行为(snowball+english)不一致，这个需要注释下以及在文档透出

参考：https://www.elastic.co/docs/reference/text-analysis/analysis-stemmer-tokenfilter

Implement a stemmer token filter for FTS that reduces words to their root form using the Snowball stemming library. Supports 34+ languages configurable via stemmer_lang in extra_params (defaults to english). Changes: - Integrate Snowball 3.1.1 as a thirdparty static library - Add StemmerTokenFilter with thread_local stemmer cache (lock-free) - Register 'stemmer' filter in TokenizerFactory - Add unit tests and FtsColumnIndexer end-to-end tests

egolearner requested review from chinaux and zhourrr as code owners June 22, 2026 11:47

egolearner force-pushed the feat/fts-stemmer-token-filter branch from 9ddb69a to dd53084 Compare June 22, 2026 11:48

github-actions Bot assigned egolearner Jun 22, 2026

egolearner force-pushed the feat/fts-stemmer-token-filter branch from dd53084 to 11961a4 Compare June 23, 2026 02:21

feihongxu0824 reviewed Jul 1, 2026

View reviewed changes

egolearner added 4 commits July 2, 2026 11:47

fix ci

b1875a7

fix(fts): validate stemmer filter config

02f52a9

docs(fts): clarify extra params by component

151a562

egolearner force-pushed the feat/fts-stemmer-token-filter branch from c5a25cc to 151a562 Compare July 2, 2026 04:06

egolearner requested a review from Cuiyus as a code owner July 2, 2026 04:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(fts): add stemmer token filter based on Snowball 3.1.1#513

feat(fts): add stemmer token filter based on Snowball 3.1.1#513
egolearner wants to merge 4 commits into
alibaba:mainfrom
egolearner:feat/fts-stemmer-token-filter

egolearner commented Jun 22, 2026

Uh oh!

feihongxu0824 Jun 30, 2026

Uh oh!

egolearner Jul 2, 2026

Uh oh!

feihongxu0824 Jun 30, 2026

Uh oh!

egolearner Jul 2, 2026

Uh oh!

feihongxu0824 Jul 1, 2026

Uh oh!

egolearner Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

egolearner commented Jun 22, 2026

Uh oh!

feihongxu0824 Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

egolearner Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

feihongxu0824 Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

egolearner Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

feihongxu0824 Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

egolearner Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants