Describe the bug
from_single_file fails when loading a model whose CLIP text encoder is a CLIPTextModel (e.g. SD 1.x), when transformers >= 5.6 is installed.
In transformers 5.6, CLIPTextModel was flattened: its submodules (embeddings, encoder, final_layer_norm) are now assigned directly on the model and the text_model attribute was removed (CLIPTextModelWithProjection still has text_model, so SDXL-style encoders are unaffected). See huggingface/transformers#46285.
create_diffusers_clip_model_from_ldm in diffusers/loaders/single_file_utils.py reads model.text_model.embeddings.position_embedding.weight.shape[-1], which raises:
AttributeError: 'CLIPTextModel' object has no attribute 'text_model'
diffusers declares transformers>=4.41.2 with no upper bound, so this combination installs without warning.
Reproduction
import torch
from transformers import CLIPTextModel
from diffusers.loaders.single_file_utils import create_diffusers_clip_model_from_ldm
# Build an SD1.x-style LDM CLIP state dict: keys under "cond_stage_model.transformer.<hf-key>"
ref = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
checkpoint = {f"cond_stage_model.transformer.{k}": v for k, v in ref.state_dict().items()}
create_diffusers_clip_model_from_ldm(
CLIPTextModel,
checkpoint=checkpoint,
config="openai/clip-vit-large-patch14",
local_files_only=False,
)
This is the same code path used by StableDiffusionPipeline.from_single_file(<sd1.5 .safetensors>).
Logs
Traceback (most recent call last):
File ".../diffusers/loaders/single_file_utils.py", line 1702, in create_diffusers_clip_model_from_ldm
position_embedding_dim = model.text_model.embeddings.position_embedding.weight.shape[-1]
File ".../torch/nn/modules/module.py", line 1940, in __getattr__
raise AttributeError(...)
AttributeError: 'CLIPTextModel' object has no attribute 'text_model'
System Info
- diffusers: 0.37.0 (also present on main / 0.38.0 — same line is unchanged)
- transformers: reproduces on 5.6.0 – 5.9.0 (works on <= 5.5.x)
- huggingface_hub: 1.17.0
- torch: 2.7.1+cu128
- accelerate: 1.8.1
- Python: 3.12.9
- Platform: Windows-11
Who can help?
No response
Describe the bug
from_single_filefails when loading a model whose CLIP text encoder is aCLIPTextModel(e.g. SD 1.x), when transformers >= 5.6 is installed.In transformers 5.6,
CLIPTextModelwas flattened: its submodules (embeddings,encoder,final_layer_norm) are now assigned directly on the model and thetext_modelattribute was removed (CLIPTextModelWithProjectionstill hastext_model, so SDXL-style encoders are unaffected). See huggingface/transformers#46285.create_diffusers_clip_model_from_ldmindiffusers/loaders/single_file_utils.pyreadsmodel.text_model.embeddings.position_embedding.weight.shape[-1], which raises:diffusersdeclarestransformers>=4.41.2with no upper bound, so this combination installs without warning.Reproduction
This is the same code path used by
StableDiffusionPipeline.from_single_file(<sd1.5 .safetensors>).Logs
System Info
Who can help?
No response