Fix search returning no results for non-Latin (Cyrillic/CJK/Greek/…) queries#2928
Fix search returning no results for non-Latin (Cyrillic/CJK/Greek/…) queries#2928vitalibondar wants to merge 1 commit into
Conversation
Search::Query#remove_invalid_search_characters used gsub(/[^\w"]/, " "). In Ruby, \w matches ASCII [a-zA-Z0-9_] only, so any query containing non-Latin characters (Cyrillic, CJK, Greek, Arabic, …) was reduced to whitespace. The blank query then failed Search::Query's presence validation and Search::Record.for_query returned `none` — zero results, even though the FTS index contains the content. Switch to the POSIX [[:word:]] class, which is Unicode-aware in Ruby (Onigmo), so word characters in any script are preserved. The same ASCII-only \w appears in Search::Stemmer (Trilogy/MySQL path) and is fixed for consistency. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds Unicode-aware search sanitization to support non‑Latin queries (e.g., Cyrillic) and verifies behavior with a new model test.
Changes:
- Add a test covering searches with Cyrillic strings.
- Update term sanitization / stemming preprocessing regexes to allow non‑Latin “word” characters.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| test/models/search_test.rb | Adds regression test for Cyrillic search queries. |
| app/models/search/stemmer.rb | Updates punctuation-stripping regex to use POSIX word class. |
| app/models/search/query.rb | Updates invalid-character removal regex to use POSIX word class. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def stem(value) | ||
| if value.present? | ||
| value.gsub(/[^\w\s]/, " ").split(/\s+/).map { |word| STEMMER.stem(word.downcase) }.join(" ") | ||
| value.gsub(/[^[[:word:]]\s]/, " ").split(/\s+/).map { |word| STEMMER.stem(word.downcase) }.join(" ") |
|
|
||
| def remove_invalid_search_characters(terms) | ||
| terms.gsub(/[^\w"]/, " ") | ||
| terms.gsub(/[^[[:word:]]"]/, " ") |
| results = Search::Record.for(@user.account_id).search("картки", user: @user) | ||
| assert results.find { |it| it.card_id == card.id } |
|
Thanks for the automated review! Quick notes on the three comments:
"a[b фільтр".gsub(/[^[[:word:]]"]/, " ") # => "a b фільтр" (the "[" is stripped)
"a[b фільтр".gsub(/[^\p{Word}"]/, " ") # => "a b фільтр" (identical)
|
Search and filter return zero results for any query containing non-Latin characters (Ukrainian/Russian Cyrillic, CJK, Greek, Arabic, …), even when matching cards exist.
Root cause
Search::Query#remove_invalid_search_characterssanitises input withterms.gsub(/[^\w"]/, " "). In Ruby (Onigmo)\wmatches ASCII[a-zA-Z0-9_]only — it does not match Unicode letters. So a non-Latin query is reduced to whitespace,termsbecomes blank →Search::Queryfails itspresencevalidation →Search::Record.for_querytakes theelse nonebranch → no results.The FTS index itself is fine: non-Latin content is indexed and raw
MATCHworks. The bug is purely in query sanitisation, so it affects both the SQLite and Trilogy adapters.Fix
Use the POSIX
[[:word:]]class, which is Unicode-aware in Ruby, so word characters in any script are preserved:The same ASCII-only
\wappears inSearch::Stemmer.stem(used on the Trilogy/MySQL path for both indexing and querying), fixed here too for consistency.Test
Added a Cyrillic case to
SearchTestmirroring the existing hyphenated-string test. It fails onmain(zero results) and passes with this change.Verification
Confirmed on a live self-hosted deployment (
ghcr.io/basecamp/fizzy:main): before the patch a Cyrillic query returnsterms => niland0results; after, the same query returns the expected cards (e.g.search("картки")→ 294 results), while Latin search is unaffected.🤖 Generated with Claude Code