Skip to content

feat(fts): add stemmer token filter based on Snowball 3.1.1#513

Open
egolearner wants to merge 4 commits into
alibaba:mainfrom
egolearner:feat/fts-stemmer-token-filter
Open

feat(fts): add stemmer token filter based on Snowball 3.1.1#513
egolearner wants to merge 4 commits into
alibaba:mainfrom
egolearner:feat/fts-stemmer-token-filter

Conversation

@egolearner

Copy link
Copy Markdown
Collaborator

Implement a stemmer token filter for FTS that reduces words to their root form using the Snowball stemming library. Supports 34+ languages configurable via stemmer_lang in extra_params (defaults to english).

Changes:

  • Integrate Snowball 3.1.1 as a thirdparty static library
  • Add StemmerTokenFilter with thread_local stemmer cache (lock-free)
  • Register 'stemmer' filter in TokenizerFactory
  • Add unit tests and FtsColumnIndexer end-to-end tests

@egolearner egolearner force-pushed the feat/fts-stemmer-token-filter branch from 9ddb69a to dd53084 Compare June 22, 2026 11:48
@egolearner egolearner force-pushed the feat/fts-stemmer-token-filter branch from dd53084 to 11961a4 Compare June 23, 2026 02:21
filter_name.c_str());
return nullptr;
}
if (!filter->init(extra_json)) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

头文件里以及py docstring需要注明stemmer extra json的使用说明吧?另外也需要注明stemmer filter

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
auto *test_stemmer = sb_stemmer_new(language_.c_str(), nullptr);
if (!test_stemmer) {
LOG_ERROR("[StemmerTokenFilter] failed to create stemmer for language: %s",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

报错位置有点深,可以放在schema validate?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

TokenFilterPtr TokenizerFactory::create_filter(const std::string &filter_name) {
if (filter_name == "lowercase") {
return std::make_shared<LowercaseTokenFilter>();
} else if (filter_name == "stemmer") {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

es如果设置stemmer,内部默认是porter,和zvec的行为(snowball+english)不一致,这个需要注释下以及在文档透出

参考:https://www.elastic.co/docs/reference/text-analysis/analysis-stemmer-tokenfilter

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Implement a stemmer token filter for FTS that reduces words to their
root form using the Snowball stemming library. Supports 34+ languages
configurable via stemmer_lang in extra_params (defaults to english).

Changes:
- Integrate Snowball 3.1.1 as a thirdparty static library
- Add StemmerTokenFilter with thread_local stemmer cache (lock-free)
- Register 'stemmer' filter in TokenizerFactory
- Add unit tests and FtsColumnIndexer end-to-end tests
@egolearner egolearner force-pushed the feat/fts-stemmer-token-filter branch from c5a25cc to 151a562 Compare July 2, 2026 04:06
@egolearner egolearner requested a review from Cuiyus as a code owner July 2, 2026 04:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants