Gemma4 parser by przepeck · Pull Request #4179 · openvinotoolkit/model_server

przepeck · 2026-05-05T09:39:09Z

🛠 Summary

CVS-184756
Enabling gemma4 model, changing vlm pipeline to accept special tokens in streaming

🧪 Checklist

Unit tests added.
The documentation updated.
Change follows security best practices.
``

Copilot

Pull request overview

This PR adds Gemma4 tool-call parsing support to OVMS LLM I/O processing and adjusts the VLM legacy unary path to preserve/handle special tokens by reusing the streaming text collection logic (per CVS-184756). It also updates model-prep scripts and introduces a dedicated Gemma4 output parser test suite.

Changes:

Added a new Gemma4ToolParser and wired it into OutputParser selection.
Modified VLM legacy unary response preparation to reconstruct the full text from streamer callbacks (to retain special tokens) and updated OpenAI API handler interfaces accordingly.
Added Gemma4 tokenizer/model preparation steps and extensive unit tests for Gemma4 tool-call parsing (unary + streaming scenarios).

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
windows_prepare_llm_models.bat	Adds Gemma4 tokenizer download step for Windows test model prep.
prepare_llm_models.sh	Adds Gemma4 tokenizer conversion/download for Linux test model prep.
src/test/llm/output_parsers/gemma4_output_parser_test.cpp	New unit tests covering Gemma4 tool-call parsing and streaming chunk behavior.
src/test/http_openai_handler_test.cpp	Updates tests to match the new VLM unary serialization API signature.
src/llm/visual_language_model/legacy/servable.cpp	Implements unary “streaming-style” text accumulation to preserve special tokens; passes full text into unary serialization.
src/llm/io_processing/utils.hpp	Adds new utility declarations used by Gemma4 parsing/serialization.
src/llm/io_processing/utils.cpp	Implements JSON argument writing and delimiter search respecting nested structures/quotes.
src/llm/io_processing/output_parser.cpp	Registers `gemma4` tool parser type.
src/llm/io_processing/gemma4/tool_parser.hpp	New Gemma4 tool parser interface + streaming state machine declarations.
src/llm/io_processing/gemma4/tool_parser.cpp	New Gemma4 tool parser implementation (unary + streaming).
src/llm/BUILD	Adds Bazel targets/deps for Gemma4 parser and RapidJSON usage in utils.
src/llm/apis/openai_api_handler.hpp	Updates VLM unary serialization interface to accept explicit `textResponse`.
src/llm/apis/openai_completions.hpp	Updates VLM unary serialization signature.
src/llm/apis/openai_completions.cpp	Uses provided `textResponse` for VLM unary serialization.
src/llm/apis/openai_responses.hpp	Updates VLM unary serialization signature.
src/llm/apis/openai_responses.cpp	Uses provided `textResponse` for VLM unary serialization.
spelling-whitelist.txt	Adds the new Gemma4 test file to the whitelist list.

Comments suppressed due to low confidence (2)

src/llm/apis/openai_completions.cpp:497

serializeUnaryResponse(VLMDecodedResults&, textResponse) only emits a choice when textResponse is non-empty. If the model legitimately generates an empty string, this returns an empty choices array, which is not a valid OpenAI-style response (a single choice with empty content/tool_calls should still be returned). Consider always creating one choice and using textResponse as-is (even when empty).

    if (!textResponse.empty()) {
        SPDLOG_LOGGER_TRACE(llm_calculator_logger, "Generated text: {}", textResponse);

        // Workaround to use OVMS unary parsers: get tokens from string
        // This way we have detokenized text from GenAI and calculate tokens, to further convert back to text again, in parseOutputIfNeeded...
        auto generatedTokens = encodeTextToTokens(textResponse);

        SPDLOG_LOGGER_TRACE(llm_calculator_logger, "Generated tokens: {}", generatedTokens);
        ParsedOutput parsedOutput = parseOutputIfNeeded(generatedTokens);
        jsonResponse.StartObject();
        // finish_reason: "stop" in regular scenario, "tool_calls" if output contains tool calls
        auto finishReason = mapFinishReason(ov::genai::GenerationFinishReason::STOP, !parsedOutput.toolCalls.empty());
        jsonResponse.FinishReason(finishReason.value_or("unknown"));
        // index: integer; Choice index, only n=1 supported anyway
        jsonResponse.Index(index++);
        // TODO: logprobs: object/null; Log probability information for the choice.

        if (endpoint == Endpoint::CHAT_COMPLETIONS) {
            jsonResponse.MessageObject(parsedOutput);
        } else if (endpoint == Endpoint::COMPLETIONS) {
            jsonResponse.Text(parsedOutput);
        }

        // finish message object
        jsonResponse.EndObject();
    }
    // finish choices array
    jsonResponse.EndArray();

src/llm/apis/openai_responses.cpp:676

serializeUnaryResponse(VLMDecodedResults&, textResponse) skips adding any ParsedOutput when textResponse is empty, so serializeUnaryResponseImpl() is called with an empty vector. If the model output is legitimately empty, this likely produces a response without any output items/content, which is invalid for the Responses API. Consider emitting a single empty ParsedOutput (or otherwise ensuring one output item exists) even when textResponse is empty.

    std::vector<ParsedOutput> parsedOutputs;
    if (!textResponse.empty()) {
        if (outputParser != nullptr) {
            // Same workaround as in chat completions
            auto generatedTokens = encodeTextToTokens(textResponse);
            parsedOutputs.push_back(parseOutputIfNeeded(generatedTokens));
        } else {
            // Fast path: no output parser, use decoded text directly.
            ParsedOutput output;
            output.content = textResponse;
            parsedOutputs.push_back(std::move(output));
        }
    }
    return serializeUnaryResponseImpl(parsedOutputs);

+std::string Gemma4ToolParser::parseArrayParameter(std::string argumentStr) {
+    size_t pos = 1;
+    std::string parsedArguments = "[";
+
+    while (pos != std::string::npos) {
+        size_t stringStartPos = argumentStr.find(TOOL_ARGS_STRING_INDICATOR, pos);
+        if (stringStartPos == std::string::npos) {
+            break;
+        }
+        stringStartPos += TOOL_ARGS_STRING_INDICATOR.size();
+        size_t stringEndPos = argumentStr.find(TOOL_ARGS_STRING_INDICATOR, stringStartPos);
+        if (stringEndPos == std::string::npos) {
+            break;
+        }
+
+        std::string originalStr = argumentStr.substr(stringStartPos, stringEndPos - stringStartPos);
+        size_t quotePos = 0;
+        while ((quotePos = originalStr.find('\"', quotePos)) != std::string::npos) {
+            originalStr.insert(quotePos, "\\");
+            quotePos += 2;
+        }
+        parsedArguments += "\"" + originalStr + "\",";
+
+        pos = stringEndPos + TOOL_ARGS_STRING_INDICATOR.size() + 1;
+    }
+
+    parsedArguments.back() = ']';
+
+    return parsedArguments;


+void writeArgumentOfAnyType(const rapidjson::Value& arg, rapidjson::Writer<rapidjson::StringBuffer>& writer) {
+    if (arg.IsString()) {
+        writer.String(arg.GetString());
+    } else if (arg.IsInt64()) {
+        writer.Int64(arg.GetInt64());
+    } else if (arg.IsDouble()) {
+        writer.Double(arg.GetDouble());
+    } else if (arg.IsBool()) {
+        writer.Bool(arg.GetBool());
+    } else if (arg.IsArray()) {
+        writer.StartArray();
+        for (auto& elem : arg.GetArray()) {
+            writeArgumentOfAnyType(elem, writer);
+        }
+        writer.EndArray();
+    } else if (arg.IsObject()) {
+        writer.StartObject();
+        for (auto it = arg.MemberBegin(); it != arg.MemberEnd(); ++it) {
+            writer.Key(it->name.GetString());
+            writeArgumentOfAnyType(it->value, writer);
+        }
+        writer.EndObject();
+    } else {
+        writer.String("");
+    }


+std::optional<rapidjson::Document> Gemma4ToolParser::parseChunk(const std::string& chunk, ov::genai::GenerationFinishReason finishReason) {
+    if (chunk.empty()) {
+        return std::nullopt;
+    }
+
+    this->streamingContent += chunk;
+
+    if (parseNewContent()) {
+        if (this->currentState == State::ToolCallParameters) {
+            return BaseOutputParser::wrapFirstDelta(this->toolCall.name, toolCallIndex);
+        }
+        if (this->currentState == State::ToolCallEnded) {
+            return wrapDeltaArgs(this->toolCall.arguments, toolCallIndex);
+        }
+        if (this->currentState == State::Content) {
+            size_t contentEnd = this->streamingContent.find(TOOL_CALL_START_TAG, this->streamingPosition);
+            std::string content;
+            if (contentEnd != std::string::npos) {
+                content = this->streamingContent.substr(this->streamingPosition, contentEnd - this->streamingPosition);
+            } else {
+                content = this->streamingContent.substr(this->streamingPosition);
+            }
+            this->streamingPosition += content.size();
+            if (!content.empty()) {
+                return wrapDeltaContent(content);
+            }
+        }
+        if (this->currentState == State::AfterToolCall) {
+            this->currentState = State::Content;
+        }
+    }


dkalinowski · 2026-05-08T12:04:43Z

    int singleQuoteDepth = 0;

    for (size_t i = startPos; i < str.length(); ++i) {
+        if (bracketDepth == 0 && braceDepth == 0 && quoteDepth == 0 && singleQuoteDepth == 0 &&


is it also used by other parsers? we should also run bfcl for other parsers that use it so we are sure there is no regression (i think lfm uses it?)

przepeck force-pushed the przepeck/gemma_parser branch from ab0aa5b to 1c329a4 Compare May 6, 2026 09:53

dkalinowski reviewed May 6, 2026

View reviewed changes

Comment thread src/llm/io_processing/gemma4/tool_parser.cpp Outdated

dkalinowski reviewed May 6, 2026

View reviewed changes

Comment thread src/llm/io_processing/utils.hpp

dkalinowski reviewed May 6, 2026

View reviewed changes

Comment thread src/test/llm/output_parsers/gemma4_output_parser_test.cpp

dkalinowski reviewed May 6, 2026

View reviewed changes

Comment thread src/test/llm/output_parsers/gemma4_output_parser_test.cpp Outdated

przepeck marked this pull request as ready for review May 6, 2026 12:07

Copilot AI review requested due to automatic review settings May 6, 2026 12:07

Copilot started reviewing on behalf of przepeck May 6, 2026 12:08 View session

dkalinowski reviewed May 6, 2026

View reviewed changes

Comment thread src/test/llm/output_parsers/gemma4_output_parser_test.cpp Outdated

dkalinowski reviewed May 6, 2026

View reviewed changes

Comment thread src/test/llm/output_parsers/gemma4_output_parser_test.cpp Outdated

Copilot AI reviewed May 6, 2026

View reviewed changes

dkalinowski reviewed May 6, 2026

View reviewed changes

Comment thread src/test/llm/output_parsers/gemma4_output_parser_test.cpp

dkalinowski reviewed May 6, 2026

View reviewed changes

Comment thread src/test/llm/output_parsers/gemma4_output_parser_test.cpp

dkalinowski reviewed May 6, 2026

View reviewed changes

Comment thread src/test/llm/output_parsers/gemma4_output_parser_test.cpp

dkalinowski reviewed May 6, 2026

View reviewed changes

Comment thread src/test/llm/output_parsers/gemma4_output_parser_test.cpp

dtrawins reviewed May 6, 2026

View reviewed changes

Comment thread prepare_llm_models.sh Outdated

dkalinowski mentioned this pull request May 7, 2026

Expose skip_special_tokens parameter #4184

Merged

dtrawins added this to the 2026.2_rc milestone May 8, 2026

przepeck added 12 commits May 8, 2026 11:25

save

4142c1f

adding tests

a53a9bf

fixed build

66cb9ac

save

7a9a4c9

save

461c65b

save

57a4946

fix object parsing

bcaf372

streaming and tests fixes

5a7e40a

enabling all tests

9d473fc

streaming workaround for unary

047cd5f

style

e92a335

cpplint

80def30