qwen: Add support for Qwen3-Embedding models in encode_only mode by Priyanshu31102003 · Pull Request #14758 · NVIDIA/TensorRT-LLM

Priyanshu31102003 · 2026-05-29T19:23:40Z

Description

This PR introduces support for Qwen3-Embedding architectures within the Qwen model implementation. It enables encode_only mode by bypassing the language modeling head projection when an embedding or reward model architecture is detected.

Changes

config.py: Added is_embedding attribute to QWenConfig and auto-detection logic within from_hugging_face for mapping embedding-specific architectures (e.g., Qwen2ForRewardModel, Qwen2ForEmbedding, Qwen3ForEmbedding).
model.py: Updated QWenForCausalLM.__init__ to suppress lm_head allocation when is_embedding is active. Updated the ModelWeightsLoader custom dictionary to gracefully process layer normalization weights without requiring classification target bounds.

Summary by CodeRabbit

New Features
- Extended Qwen model support to include dedicated embedding architectures with specialized configuration handling.
- Implemented automatic detection and proper initialization of embedding-only model variants during HuggingFace conversion.
- Added optimized weight mapping for embedding model types to ensure correct model structure.

coderabbitai · 2026-05-29T19:28:20Z

📝 Walkthrough

Walkthrough

This PR extends Qwen model support to embedding-only variants by introducing an is_embedding configuration flag detected during HuggingFace loading. When enabled, the model skips language model head construction and adjusts weight mapping for embedding architectures.

Changes

Embedding-only Qwen model support

Layer / File(s)	Summary
Config parameter and HuggingFace detection `tensorrt_llm/models/qwen/config.py`	`QWenConfig.__init__` adds `is_embedding: bool = False` parameter and stores it. `from_hugging_face` extends `qwen_type` list to include `qwen3_embedding`, detects embedding models from architecture names, and passes the derived flag to the constructor.
Model conditioning for embedding-only mode `tensorrt_llm/models/qwen/model.py`	`QWenForCausalLM.__init__` skips building a vocabulary `ColumnLinear` head and sets `lm_head = None` when `config.is_embedding` is true. Weight conversion custom key mapping adds logic to map `q_layernorm`/`k_layernorm` to `q_norm`/`k_norm` for embedding models.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related issues

[Feature]: Add Qwen3-Embedding model support (0.6B/4B/8B) for LLM API encode_only mode #14715: Adds is_embedding config and skips lm_head for embedding models, directly implementing the Qwen3-Embedding encode-only behavior.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description adequately explains the purpose and changes, but omits required sections from the template: Title format (ticket/type), Test Coverage, and PR Checklist.	Add PR title with proper format [ticket/issue][type], include Test Coverage section listing relevant tests, and complete the PR Checklist with required items.
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: adding support for Qwen3-Embedding models in encode_only mode, which aligns with the primary modifications to config.py and model.py.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/models/qwen/config.py (1)

38-49: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve is_embedding in config serialization.

QWenConfig.to_dict() is still hand-maintained and never emits this new field. Any config round-trip will silently clear embedding-only mode and rebuild a regular LM head on reload.

Suggested fix

     def to_dict(self):
         output = super().to_dict()
         # Serialize the fields added in QWenConfig
         output['mlp_bias'] = self.mlp_bias
         output['attn_bias'] = self.attn_bias
         output['rotary_base'] = self.rotary_base
         output['rotary_scaling'] = self.rotary_scaling
         output[
             'disable_weight_only_quant_plugin'] = self.disable_weight_only_quant_plugin
         output['use_logn_attn'] = self.use_logn_attn
         output['mlp_only_layers'] = self.mlp_only_layers
         output['decoder_sparse_step'] = self.decoder_sparse_step
+        output['is_embedding'] = self.is_embedding
         output['moe'] = self.moe.to_dict()
         return output

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/models/qwen/config.py` around lines 38 - 49, QWenConfig
currently stores is_embedding but QWenConfig.to_dict() does not emit it, so
include "is_embedding" in the dict returned by QWenConfig.to_dict();
additionally ensure any deserialization path (e.g., QWenConfig.from_dict or the
constructor signature) accepts and preserves the is_embedding flag so a
round-trip retains embedding-only mode rather than rebuilding a regular LM head.
Reference: QWenConfig.to_dict, QWenConfig.from_dict/__init__, and the
is_embedding attribute.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/models/qwen/model.py`:
- Around line 385-389: The embedding checkpoint type 'qwen3_embedding' must be
treated like the Qwen3 attention path: update the embedding-name-to-parameter
mapping block (where getattr(config, 'is_embedding', False) builds custom_dict
with "q_layernorm"/"k_layernorm") and ensure the model loader and
QWenDecoderLayer recognize 'qwen3_embedding' the same way as 'qwen3' and
'qwen3_moe'; also update the attn_bias derivation in
tensorrt_llm/models/qwen/config.py (the tuple that currently lists
('qwen3','qwen3_moe')) to include 'qwen3_embedding' so config.attn_bias,
model_type checks, and QWenDecoderLayer's qk_layernorm handling remain
consistent for that checkpoint.

---

Outside diff comments:
In `@tensorrt_llm/models/qwen/config.py`:
- Around line 38-49: QWenConfig currently stores is_embedding but
QWenConfig.to_dict() does not emit it, so include "is_embedding" in the dict
returned by QWenConfig.to_dict(); additionally ensure any deserialization path
(e.g., QWenConfig.from_dict or the constructor signature) accepts and preserves
the is_embedding flag so a round-trip retains embedding-only mode rather than
rebuilding a regular LM head. Reference: QWenConfig.to_dict,
QWenConfig.from_dict/__init__, and the is_embedding attribute.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f865eaba-b1e3-4935-ae6f-832589d3f3da

📥 Commits

Reviewing files that changed from the base of the PR and between ebbbec4 and e4fb020.

📒 Files selected for processing (2)

tensorrt_llm/models/qwen/config.py
tensorrt_llm/models/qwen/model.py

coderabbitai · 2026-05-29T19:28:23Z

+            elif getattr(config, 'is_embedding', False):
+                custom_dict = {
+                    "q_layernorm": "q_norm",
+                    "k_layernorm": "k_norm",
+                }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep qwen3_embedding on the Qwen3 attention path.

This new mapping assumes embedding checkpoints expose q_norm/k_norm, but QWenDecoderLayer still enables qk_layernorm only for ('qwen3', 'qwen3_moe'). If a checkpoint reports model_type == 'qwen3_embedding', the loader and model definition diverge, and config.py also still leaves attn_bias=True for that type.

Suggested fix

- qk_layernorm = config.qwen_type in ('qwen3', 'qwen3_moe') + qk_layernorm = config.qwen_type in ( + 'qwen3', + 'qwen3_moe', + 'qwen3_embedding', + )

Also mirror the same qwen3_embedding addition in tensorrt_llm/models/qwen/config.py where attn_bias is derived from the Qwen3-only tuple.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/models/qwen/model.py` around lines 385 - 389, The embedding checkpoint type 'qwen3_embedding' must be treated like the Qwen3 attention path: update the embedding-name-to-parameter mapping block (where getattr(config, 'is_embedding', False) builds custom_dict with "q_layernorm"/"k_layernorm") and ensure the model loader and QWenDecoderLayer recognize 'qwen3_embedding' the same way as 'qwen3' and 'qwen3_moe'; also update the attn_bias derivation in tensorrt_llm/models/qwen/config.py (the tuple that currently lists ('qwen3','qwen3_moe')) to include 'qwen3_embedding' so config.attn_bias, model_type checks, and QWenDecoderLayer's qk_layernorm handling remain consistent for that checkpoint.

Signed-off-by: jet1technology-tech <jet1technology@ryngo.in>

github-actions Bot assigned Priyanshu31102003 May 29, 2026

Priyanshu31102003 force-pushed the feature-qwen3-embedding-support branch from e4fb020 to 222d506 Compare May 29, 2026 19:27

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

qwen: Add support for Qwen3-Embedding models in encode_only mode

c52800f

Signed-off-by: jet1technology-tech <jet1technology@ryngo.in>

Priyanshu31102003 force-pushed the feature-qwen3-embedding-support branch from 222d506 to c52800f Compare May 29, 2026 19:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qwen: Add support for Qwen3-Embedding models in encode_only mode#14758

qwen: Add support for Qwen3-Embedding models in encode_only mode#14758
Priyanshu31102003 wants to merge 1 commit into
NVIDIA:mainfrom
Priyanshu31102003:feature-qwen3-embedding-support

Priyanshu31102003 commented May 29, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 29, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related issues

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Priyanshu31102003 commented May 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 29, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related issues

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Priyanshu31102003 commented May 29, 2026 •

edited by coderabbitai Bot

Loading