Fix search returning no results for non-Latin (Cyrillic/CJK/Greek/…) queries by vitalibondar · Pull Request #2928 · basecamp/fizzy

vitalibondar · 2026-06-07T23:30:26Z

Search and filter return zero results for any query containing non-Latin characters (Ukrainian/Russian Cyrillic, CJK, Greek, Arabic, …), even when matching cards exist.

Root cause

Search::Query#remove_invalid_search_characters sanitises input with terms.gsub(/[^\w"]/, " "). In Ruby (Onigmo) \w matches ASCII [a-zA-Z0-9_] only — it does not match Unicode letters. So a non-Latin query is reduced to whitespace, terms becomes blank → Search::Query fails its presence validation → Search::Record.for_query takes the else none branch → no results.

The FTS index itself is fine: non-Latin content is indexed and raw MATCH works. The bug is purely in query sanitisation, so it affects both the SQLite and Trilogy adapters.

Fix

Use the POSIX [[:word:]] class, which is Unicode-aware in Ruby, so word characters in any script are preserved:

- terms.gsub(/[^\w"]/, " ")
+ terms.gsub(/[^[[:word:]]"]/, " ")

The same ASCII-only \w appears in Search::Stemmer.stem (used on the Trilogy/MySQL path for both indexing and querying), fixed here too for consistency.

Test

Added a Cyrillic case to SearchTest mirroring the existing hyphenated-string test. It fails on main (zero results) and passes with this change.

Verification

Confirmed on a live self-hosted deployment (ghcr.io/basecamp/fizzy:main): before the patch a Cyrillic query returns terms => nil and 0 results; after, the same query returns the expected cards (e.g. search("картки") → 294 results), while Latin search is unaffected.

🤖 Generated with Claude Code

Search::Query#remove_invalid_search_characters used gsub(/[^\w"]/, " "). In Ruby, \w matches ASCII [a-zA-Z0-9_] only, so any query containing non-Latin characters (Cyrillic, CJK, Greek, Arabic, …) was reduced to whitespace. The blank query then failed Search::Query's presence validation and Search::Record.for_query returned `none` — zero results, even though the FTS index contains the content. Switch to the POSIX [[:word:]] class, which is Unicode-aware in Ruby (Onigmo), so word characters in any script are preserved. The same ASCII-only \w appears in Search::Stemmer (Trilogy/MySQL path) and is fixed for consistency. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds Unicode-aware search sanitization to support non‑Latin queries (e.g., Cyrillic) and verifies behavior with a new model test.

Changes:

Add a test covering searches with Cyrillic strings.
Update term sanitization / stemming preprocessing regexes to allow non‑Latin “word” characters.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
test/models/search_test.rb	Adds regression test for Cyrillic search queries.
app/models/search/stemmer.rb	Updates punctuation-stripping regex to use POSIX word class.
app/models/search/query.rb	Updates invalid-character removal regex to use POSIX word class.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  def stem(value)
    if value.present?
-      value.gsub(/[^\w\s]/, " ").split(/\s+/).map { |word| STEMMER.stem(word.downcase) }.join(" ")
+      value.gsub(/[^[[:word:]]\s]/, " ").split(/\s+/).map { |word| STEMMER.stem(word.downcase) }.join(" ")



    def remove_invalid_search_characters(terms)
-      terms.gsub(/[^\w"]/, " ")
+      terms.gsub(/[^[[:word:]]"]/, " ")


+    results = Search::Record.for(@user.account_id).search("картки", user: @user)
+    assert results.find { |it| it.card_id == card.id }


vitalibondar · 2026-06-07T23:40:45Z

Thanks for the automated review! Quick notes on the three comments:

app/models/search/query.rb:36 and app/models/search/stemmer.rb:8 — [[:word:]] is intentional and correct. In Ruby (Onigmo), the inner [...] is a nested character class, not a literal [. So /[^[[:word:]]"]/ means "anything that is not a word character and not a quote" — there is no stray literal [ admitted to the set. Verified:

"a[b фільтр".gsub(/[^[[:word:]]"]/, " ")  # => "a b фільтр"  (the "[" is stripped)
"a[b фільтр".gsub(/[^\p{Word}"]/,   " ")  # => "a b фільтр"  (identical)

/[^[[:word:]]"]/ and /[^\p{Word}"]/ produce byte-identical output across the inputs I tried (Cyrillic, brackets, quotes, hyphenated). Happy to switch to \p{Word} if you prefer it for readability — no behavioural difference.

test/models/search_test.rb — assertion style. I mirrored the existing assert results.find { |it| it.card_id == card.id } pattern used by the neighbouring tests in this file for consistency. Glad to switch to any?/card_id inclusion if that is the preference.

Copilot AI review requested due to automatic review settings June 7, 2026 23:30

claude Bot reviewed Jun 7, 2026

View reviewed changes

Copilot AI reviewed Jun 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix search returning no results for non-Latin (Cyrillic/CJK/Greek/…) queries#2928

Fix search returning no results for non-Latin (Cyrillic/CJK/Greek/…) queries#2928
vitalibondar wants to merge 1 commit into
basecamp:mainfrom
vitalibondar:fix-non-latin-search

vitalibondar commented Jun 7, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

vitalibondar commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		results = Search::Record.for(@user.account_id).search("картки", user: @user)
		assert results.find { \|it\| it.card_id == card.id }

Conversation

vitalibondar commented Jun 7, 2026

Root cause

Fix

Test

Verification

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

vitalibondar commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants