Skip to content

Fix PPL CalciteException for non-ASCII string literals (e.g. Chinese characters)#5504

Open
gingeekrishna wants to merge 2 commits into
opensearch-project:mainfrom
gingeekrishna:fix/21880-ppl-non-ascii-string-literal
Open

Fix PPL CalciteException for non-ASCII string literals (e.g. Chinese characters)#5504
gingeekrishna wants to merge 2 commits into
opensearch-project:mainfrom
gingeekrishna:fix/21880-ppl-non-ascii-string-literal

Conversation

@gingeekrishna
Copy link
Copy Markdown

@gingeekrishna gingeekrishna commented Jun 2, 2026

Summary

Hi @dai-chen

PPL queries containing non-ASCII string literals (Chinese, Arabic, etc.) fail with a CalciteException on OpenSearch 3.6.0, while the identical query worked on 3.1 and the equivalent SQL query works fine on 3.6.0.

Root cause: In CalciteRexNodeVisitor.visitLiteral(), the STRING case builds a VARCHAR/CHAR type using typeFactory.createSqlType(SqlTypeName.VARCHAR) without specifying a charset. Calcite defaults to ISO-8859-1, which cannot encode non-Latin characters — causing the exception inside RexBuilder.makeLiteral()NlsString.<init>().

Fix: Explicitly create the type with UTF-8 charset and IMPLICIT collation via typeFactory.createTypeWithCharsetAndCollation() for both the CHAR(1) and VARCHAR branches of the STRING literal case.

org.apache.calcite.runtime.CalciteException: Failed to encode '未处置' in character set 'ISO-8859-1'
    at org.apache.calcite.util.NlsString.<init>(NlsString.java:155)
    at org.apache.calcite.rex.RexBuilder.clean(RexBuilder.java:2296)
    at org.apache.calcite.rex.RexBuilder.makeLiteral(RexBuilder.java:2070)
    at org.opensearch.sql.calcite.CalciteRexNodeVisitor.visitLiteral(CalciteRexNodeVisitor.java:127)

Changes

File Change
CalciteRexNodeVisitor.java Use UTF-8 charset when creating CHAR/VARCHAR types for string literals
CalciteRexNodeVisitorTest.java Add regression test with Chinese, Arabic, and single non-ASCII character literals

Test plan

  • testVisitLiteralNonAsciiStringDoesNotThrow — verifies Chinese (未处置), Arabic (مرحبا), and single non-ASCII char () literals build successfully without throwing CalciteException
  • All existing CalciteRexNodeVisitorTest tests continue to pass

Fixes opensearch-project/OpenSearch#21880

visitLiteral() built VARCHAR/CHAR types using
typeFactory.createSqlType(SqlTypeName.VARCHAR) without specifying a
charset. Calcite defaults to ISO-8859-1, which cannot encode non-Latin
characters, causing a CalciteException at query time.

Fix: explicitly create the type with UTF-8 charset and IMPLICIT collation
via typeFactory.createTypeWithCharsetAndCollation() for both the CHAR(1)
and VARCHAR branches of the STRING literal case.

This is a regression introduced in 3.6.0 when the PPL/Calcite
integration was added. SQL queries were unaffected because the SQL path
uses a different literal-building flow.

Fixes opensearch-project/OpenSearch#21880

Signed-off-by: Radhakrishnan Pachyappan <gingeekrishna@gmail.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds an explicit UTF-8 charset/collation when producing Calcite string literals to prevent non-ASCII literals from throwing, and introduces a regression test for the reported failure.

Changes:

  • Update visitLiteral to build CHAR/VARCHAR types with UTF-8 charset and implicit collation.
  • Add a regression test covering Chinese/Arabic literals and the CHAR(1) path.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
core/src/main/java/org/opensearch/sql/calcite/CalciteRexNodeVisitor.java Forces UTF-8 charset/collation for string literals to avoid Calcite NlsString rejection of non-ASCII.
core/src/test/java/org/opensearch/sql/calcite/CalciteRexNodeVisitorTest.java Adds regression coverage for non-ASCII string literal visitation and CHAR(1) behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread core/src/test/java/org/opensearch/sql/calcite/CalciteRexNodeVisitorTest.java Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

PR Reviewer Guide 🔍

(Review updated until commit cd5d733)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

PR Code Suggestions ✨

Latest suggestions up to cd5d733
Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Validate single character for CHAR type

The CHAR type creation should validate that the string length is exactly 1 before
creating the type. If value.toString() somehow produces a multi-character string
despite the length check, this could cause inconsistencies between the type
definition and actual data.

core/src/main/java/org/opensearch/sql/calcite/CalciteRexNodeVisitor.java [141-146]

+String strValue = value.toString();
+if (strValue.length() != 1) {
+    throw new IllegalStateException("Expected single character for CHAR type, got: " + strValue.length());
+}
 return rexBuilder.makeLiteral(
-    value.toString(),
+    strValue,
     typeFactory.createTypeWithCharsetAndCollation(
         typeFactory.createSqlType(SqlTypeName.CHAR),
         StandardCharsets.UTF_8,
         SqlCollation.IMPLICIT));
Suggestion importance[1-10]: 3

__

Why: The suggestion adds defensive validation, but the code already checks value.toString().length() == 1 at line 138 before entering this branch. Adding redundant validation would be unnecessary and reduce code readability. The suggestion overlooks the existing guard condition.

Low

Previous suggestions

Suggestions up to commit 9e379cd
CategorySuggestion                                                                                                                                    Impact
Possible issue
Use VARCHAR for single-character strings

The single-character string handling creates a CHAR type, but multi-byte UTF-8
characters (like Chinese) may require more than one byte. Consider using VARCHAR for
all strings to avoid potential truncation or encoding issues with non-ASCII single
characters.

core/src/main/java/org/opensearch/sql/calcite/CalciteRexNodeVisitor.java [141-146]

 return rexBuilder.makeLiteral(
     value.toString(),
     typeFactory.createTypeWithCharsetAndCollation(
-        typeFactory.createSqlType(SqlTypeName.CHAR),
+        typeFactory.createSqlType(SqlTypeName.VARCHAR),
         StandardCharsets.UTF_8,
-        SqlCollation.IMPLICIT));
+        SqlCollation.IMPLICIT),
+    true);
Suggestion importance[1-10]: 3

__

Why: While the concern about multi-byte UTF-8 characters is valid, the PR explicitly uses UTF-8 charset which handles multi-byte characters correctly. The CHAR vs VARCHAR distinction is intentional per the comment "To align Spark/PostgreSQL, Char(1) is useful, such as cast('1' to boolean) should return true". The test at line 114-117 confirms single-character handling works correctly with UTF-8.

Low

- Remove unused realRexBuilder variable (context.rexBuilder is already
  a real ExtendedRexBuilder backed by TYPE_FACTORY via the constructor)
- Add charset assertions to verify resulting RelDataType carries UTF-8,
  so future accidental charset drops are caught
- Remove unused RexBuilder import

Signed-off-by: Radhakrishnan Pachyappan <gingeekrishna@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

Persistent review updated to latest commit cd5d733

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] PPL CalciteException: Failed to encode Chinese characters in ISO-8859-1 on 3.6.0 (works on 3.1)

2 participants