qwen: Add support for Qwen3-Embedding models in encode_only mode#14758
qwen: Add support for Qwen3-Embedding models in encode_only mode#14758Priyanshu31102003 wants to merge 1 commit into
Conversation
e4fb020 to
222d506
Compare
📝 WalkthroughWalkthroughThis PR extends Qwen model support to embedding-only variants by introducing an ChangesEmbedding-only Qwen model support
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related issues
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/models/qwen/config.py (1)
38-49:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPreserve
is_embeddingin config serialization.
QWenConfig.to_dict()is still hand-maintained and never emits this new field. Any config round-trip will silently clear embedding-only mode and rebuild a regular LM head on reload.Suggested fix
def to_dict(self): output = super().to_dict() # Serialize the fields added in QWenConfig output['mlp_bias'] = self.mlp_bias output['attn_bias'] = self.attn_bias output['rotary_base'] = self.rotary_base output['rotary_scaling'] = self.rotary_scaling output[ 'disable_weight_only_quant_plugin'] = self.disable_weight_only_quant_plugin output['use_logn_attn'] = self.use_logn_attn output['mlp_only_layers'] = self.mlp_only_layers output['decoder_sparse_step'] = self.decoder_sparse_step + output['is_embedding'] = self.is_embedding output['moe'] = self.moe.to_dict() return output🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/models/qwen/config.py` around lines 38 - 49, QWenConfig currently stores is_embedding but QWenConfig.to_dict() does not emit it, so include "is_embedding" in the dict returned by QWenConfig.to_dict(); additionally ensure any deserialization path (e.g., QWenConfig.from_dict or the constructor signature) accepts and preserves the is_embedding flag so a round-trip retains embedding-only mode rather than rebuilding a regular LM head. Reference: QWenConfig.to_dict, QWenConfig.from_dict/__init__, and the is_embedding attribute.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tensorrt_llm/models/qwen/model.py`:
- Around line 385-389: The embedding checkpoint type 'qwen3_embedding' must be
treated like the Qwen3 attention path: update the embedding-name-to-parameter
mapping block (where getattr(config, 'is_embedding', False) builds custom_dict
with "q_layernorm"/"k_layernorm") and ensure the model loader and
QWenDecoderLayer recognize 'qwen3_embedding' the same way as 'qwen3' and
'qwen3_moe'; also update the attn_bias derivation in
tensorrt_llm/models/qwen/config.py (the tuple that currently lists
('qwen3','qwen3_moe')) to include 'qwen3_embedding' so config.attn_bias,
model_type checks, and QWenDecoderLayer's qk_layernorm handling remain
consistent for that checkpoint.
---
Outside diff comments:
In `@tensorrt_llm/models/qwen/config.py`:
- Around line 38-49: QWenConfig currently stores is_embedding but
QWenConfig.to_dict() does not emit it, so include "is_embedding" in the dict
returned by QWenConfig.to_dict(); additionally ensure any deserialization path
(e.g., QWenConfig.from_dict or the constructor signature) accepts and preserves
the is_embedding flag so a round-trip retains embedding-only mode rather than
rebuilding a regular LM head. Reference: QWenConfig.to_dict,
QWenConfig.from_dict/__init__, and the is_embedding attribute.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f865eaba-b1e3-4935-ae6f-832589d3f3da
📒 Files selected for processing (2)
tensorrt_llm/models/qwen/config.pytensorrt_llm/models/qwen/model.py
| elif getattr(config, 'is_embedding', False): | ||
| custom_dict = { | ||
| "q_layernorm": "q_norm", | ||
| "k_layernorm": "k_norm", | ||
| } |
There was a problem hiding this comment.
Keep qwen3_embedding on the Qwen3 attention path.
This new mapping assumes embedding checkpoints expose q_norm/k_norm, but QWenDecoderLayer still enables qk_layernorm only for ('qwen3', 'qwen3_moe'). If a checkpoint reports model_type == 'qwen3_embedding', the loader and model definition diverge, and config.py also still leaves attn_bias=True for that type.
Suggested fix
- qk_layernorm = config.qwen_type in ('qwen3', 'qwen3_moe')
+ qk_layernorm = config.qwen_type in (
+ 'qwen3',
+ 'qwen3_moe',
+ 'qwen3_embedding',
+ )Also mirror the same qwen3_embedding addition in tensorrt_llm/models/qwen/config.py where attn_bias is derived from the Qwen3-only tuple.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/models/qwen/model.py` around lines 385 - 389, The embedding
checkpoint type 'qwen3_embedding' must be treated like the Qwen3 attention path:
update the embedding-name-to-parameter mapping block (where getattr(config,
'is_embedding', False) builds custom_dict with "q_layernorm"/"k_layernorm") and
ensure the model loader and QWenDecoderLayer recognize 'qwen3_embedding' the
same way as 'qwen3' and 'qwen3_moe'; also update the attn_bias derivation in
tensorrt_llm/models/qwen/config.py (the tuple that currently lists
('qwen3','qwen3_moe')) to include 'qwen3_embedding' so config.attn_bias,
model_type checks, and QWenDecoderLayer's qk_layernorm handling remain
consistent for that checkpoint.
Signed-off-by: jet1technology-tech <jet1technology@ryngo.in>
222d506 to
c52800f
Compare
Description
This PR introduces support for Qwen3-Embedding architectures within the Qwen model implementation. It enables
encode_onlymode by bypassing the language modeling head projection when an embedding or reward model architecture is detected.Changes
is_embeddingattribute toQWenConfigand auto-detection logic withinfrom_hugging_facefor mapping embedding-specific architectures (e.g., Qwen2ForRewardModel, Qwen2ForEmbedding, Qwen3ForEmbedding).QWenForCausalLM.__init__to suppresslm_headallocation whenis_embeddingis active. Updated theModelWeightsLoadercustom dictionary to gracefully process layer normalization weights without requiring classification target bounds.Summary by CodeRabbit