Skip to content

Fix search returning no results for non-Latin (Cyrillic/CJK/Greek/…) queries#2928

Open
vitalibondar wants to merge 1 commit into
basecamp:mainfrom
vitalibondar:fix-non-latin-search
Open

Fix search returning no results for non-Latin (Cyrillic/CJK/Greek/…) queries#2928
vitalibondar wants to merge 1 commit into
basecamp:mainfrom
vitalibondar:fix-non-latin-search

Conversation

@vitalibondar
Copy link
Copy Markdown

Search and filter return zero results for any query containing non-Latin characters (Ukrainian/Russian Cyrillic, CJK, Greek, Arabic, …), even when matching cards exist.

Root cause

Search::Query#remove_invalid_search_characters sanitises input with terms.gsub(/[^\w"]/, " "). In Ruby (Onigmo) \w matches ASCII [a-zA-Z0-9_] only — it does not match Unicode letters. So a non-Latin query is reduced to whitespace, terms becomes blank → Search::Query fails its presence validation → Search::Record.for_query takes the else none branch → no results.

The FTS index itself is fine: non-Latin content is indexed and raw MATCH works. The bug is purely in query sanitisation, so it affects both the SQLite and Trilogy adapters.

Fix

Use the POSIX [[:word:]] class, which is Unicode-aware in Ruby, so word characters in any script are preserved:

- terms.gsub(/[^\w"]/, " ")
+ terms.gsub(/[^[[:word:]]"]/, " ")

The same ASCII-only \w appears in Search::Stemmer.stem (used on the Trilogy/MySQL path for both indexing and querying), fixed here too for consistency.

Test

Added a Cyrillic case to SearchTest mirroring the existing hyphenated-string test. It fails on main (zero results) and passes with this change.

Verification

Confirmed on a live self-hosted deployment (ghcr.io/basecamp/fizzy:main): before the patch a Cyrillic query returns terms => nil and 0 results; after, the same query returns the expected cards (e.g. search("картки") → 294 results), while Latin search is unaffected.

🤖 Generated with Claude Code

Search::Query#remove_invalid_search_characters used gsub(/[^\w"]/, " ").
In Ruby, \w matches ASCII [a-zA-Z0-9_] only, so any query containing
non-Latin characters (Cyrillic, CJK, Greek, Arabic, …) was reduced to
whitespace. The blank query then failed Search::Query's presence
validation and Search::Record.for_query returned `none` — zero results,
even though the FTS index contains the content.

Switch to the POSIX [[:word:]] class, which is Unicode-aware in Ruby
(Onigmo), so word characters in any script are preserved. The same
ASCII-only \w appears in Search::Stemmer (Trilogy/MySQL path) and is
fixed for consistency.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 7, 2026 23:30
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds Unicode-aware search sanitization to support non‑Latin queries (e.g., Cyrillic) and verifies behavior with a new model test.

Changes:

  • Add a test covering searches with Cyrillic strings.
  • Update term sanitization / stemming preprocessing regexes to allow non‑Latin “word” characters.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
test/models/search_test.rb Adds regression test for Cyrillic search queries.
app/models/search/stemmer.rb Updates punctuation-stripping regex to use POSIX word class.
app/models/search/query.rb Updates invalid-character removal regex to use POSIX word class.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

def stem(value)
if value.present?
value.gsub(/[^\w\s]/, " ").split(/\s+/).map { |word| STEMMER.stem(word.downcase) }.join(" ")
value.gsub(/[^[[:word:]]\s]/, " ").split(/\s+/).map { |word| STEMMER.stem(word.downcase) }.join(" ")

def remove_invalid_search_characters(terms)
terms.gsub(/[^\w"]/, " ")
terms.gsub(/[^[[:word:]]"]/, " ")
Comment on lines +49 to +50
results = Search::Record.for(@user.account_id).search("картки", user: @user)
assert results.find { |it| it.card_id == card.id }
@vitalibondar
Copy link
Copy Markdown
Author

Thanks for the automated review! Quick notes on the three comments:

app/models/search/query.rb:36 and app/models/search/stemmer.rb:8[[:word:]] is intentional and correct. In Ruby (Onigmo), the inner [...] is a nested character class, not a literal [. So /[^[[:word:]]"]/ means "anything that is not a word character and not a quote" — there is no stray literal [ admitted to the set. Verified:

"a[b фільтр".gsub(/[^[[:word:]]"]/, " ")  # => "a b фільтр"  (the "[" is stripped)
"a[b фільтр".gsub(/[^\p{Word}"]/,   " ")  # => "a b фільтр"  (identical)

/[^[[:word:]]"]/ and /[^\p{Word}"]/ produce byte-identical output across the inputs I tried (Cyrillic, brackets, quotes, hyphenated). Happy to switch to \p{Word} if you prefer it for readability — no behavioural difference.

test/models/search_test.rb — assertion style. I mirrored the existing assert results.find { |it| it.card_id == card.id } pattern used by the neighbouring tests in this file for consistency. Glad to switch to any?/card_id inclusion if that is the preference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants