Joyai image edit by mi804 · Pull Request #1393 · modelscope/DiffSynth-Studio

mi804 · 2026-04-14T05:16:02Z

No description provided.

gemini-code-assist

Code Review

This pull request introduces support for the JoyAI-Image model, including a new inference pipeline, a 3D Transformer architecture, and a text encoder wrapper. The changes encompass model configurations, VRAM management, state dict converters, and comprehensive documentation with examples for inference and training. Feedback points out several issues to address: unused code with hardcoded device strings, typos in prompt templates and dependency versions, and inconsistencies between the code and documentation regarding text-only encoding and multi-image support.

gemini-code-assist · 2026-04-14T05:18:35Z

diffsynth/models/joyai_image_dit.py

+def get_cu_seqlens(text_mask, img_len):
+    batch_size = text_mask.shape[0]
+    text_len = text_mask.sum(dim=1)
+    max_len = text_mask.shape[1] + img_len
+    cu_seqlens = torch.zeros([2 * batch_size + 1], dtype=torch.int32, device="cuda")
+    for i in range(batch_size):
+        s = text_len[i] + img_len
+        s1 = i * max_len + s
+        s2 = (i + 1) * max_len
+        cu_seqlens[2 * i + 1] = s1
+        cu_seqlens[2 * i + 2] = s2
+    return cu_seqlens


The function get_cu_seqlens is defined but not used anywhere in the model or pipeline. Additionally, it contains a hardcoded device="cuda" at line 158, which would break compatibility with CPU or NPU devices if it were used. It is recommended to remove this unused function.

gemini-code-assist · 2026-04-14T05:18:35Z

diffsynth/models/joyai_image_dit.py

+        dtype: Optional[torch.dtype] = None,
+        device: Optional[torch.device] = None,
+        dit_modulation_type: str = "wanx",
+        theta: int = 256,


The default value for theta is set to 256, which is unusually low for Rotary Positional Embeddings (RoPE). While the model configuration in model_configs.py overrides this to 10000, it is safer to set a more standard default value in the class definition to avoid unexpected behavior if the model is initialized directly.

Suggested change

theta: int = 256,

theta: int = 10000,

gemini-code-assist · 2026-04-14T05:18:35Z

diffsynth/pipelines/joyai_image.py

+    prompt_template_encode = {
+        'image':
+            "<|im_start|>system\n \\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n",
+        'multiple_images':
+            "<|im_start|>system\n \\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n{}<|im_start|>assistant\n",
+        'video':
+            "<|im_start|>system\n \\nDescribe the video by detailing the following aspects:\n1. The main content and theme of the video.\n2. The color, shape, size, texture, quantity, text, and spatial relationships of the objects.\n3. Actions, events, behaviors temporal relationships, physical movement changes of the objects.\n4. background environment, light, style and atmosphere.\n5. camera angles, movements, and transitions used in the video:<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n"
+    }


The prompt templates contain \\n (a literal backslash followed by 'n') preceded by a space, which appears to be a typo for a standard newline \n. This might result in incorrect prompt formatting for the text encoder.

Suggested change

prompt_template_encode = {

'image':

"<|im_start|>system\n \\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n",

'multiple_images':

"<|im_start|>system\n \\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n{}<|im_start|>assistant\n",

'video':

"<|im_start|>system\n \\nDescribe the video by detailing the following aspects:\n1. The main content and theme of the video.\n2. The color, shape, size, texture, quantity, text, and spatial relationships of the objects.\n3. Actions, events, behaviors temporal relationships, physical movement changes of the objects.\n4. background environment, light, style and atmosphere.\n5. camera angles, movements, and transitions used in the video:<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n"

}

prompt_template_encode = {

'image':

"<|im_start|>system\n\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n",

'multiple_images':

"<|im_start|>system\n\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n{}<|im_start|>assistant\n",

'video':

"<|im_start|>system\n\nDescribe the video by detailing the following aspects:\n1. The main content and theme of the video.\n2. The color, shape, size, texture, quantity, text, and spatial relationships of the objects.\n3. Actions, events, behaviors temporal relationships, physical movement changes of the objects.\n4. background environment, light, style and atmosphere.\n5. camera angles, movements, and transitions used in the video:<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n"

}

gemini-code-assist · 2026-04-14T05:18:35Z

diffsynth/pipelines/joyai_image.py

+
+    def _encode_text_only(self, pipe, prompt, max_sequence_length):
+        # TODO: may support for text-only encoding in the future.
+        raise NotImplementedError("Text-only encoding is not implemented yet. Please provide edit_images for now.")


Raising NotImplementedError for text-only encoding contradicts the README, which claims support for text-to-image generation. Please either implement this feature or update the documentation to reflect the current limitation.

gemini-code-assist · 2026-04-14T05:18:35Z

diffsynth/pipelines/joyai_image.py

+        if isinstance(edit_images, Image.Image):
+            edit_images = [edit_images]
+        pipe.load_models_to_device(self.onload_model_names)
+        assert len(edit_images) == 1, "Currently only supports single edit image for reference. Multiple edit images will be supported in the future."


The assertion len(edit_images) == 1 contradicts the README and the function's docstring, which state that multiple images are supported. If the current implementation is limited to a single image, the documentation should be updated to avoid misleading users.

gemini-code-assist · 2026-04-14T05:18:35Z

docs/en/Model_Details/JoyAI-Image.md

+
+For more information on installation, please refer to [Setup Dependencies](../Pipeline_Usage/Setup.md).
+
+> **Note**: JoyAI-Image requires a specific version of `transformers`, please install `transformers>=4.57.0,<4.58.0`.


The required transformers version 4.57.0 does not exist (the current latest is 4.48.x). This is likely a typo and should be corrected to the intended version (e.g., 4.47.0).

Suggested change

> **Note**: JoyAI-Image requires a specific version of `transformers`, please install `transformers>=4.57.0,<4.58.0`.

> **Note**: JoyAI-Image requires a specific version of `transformers`, please install `transformers>=4.47.0,<4.48.0`.

Artiprocher · 2026-04-14T05:36:18Z

diffsynth/utils/state_dict_converters/joyai_image_dit.py

@@ -0,0 +1,24 @@
+def JoyAIImageDiTStateDictConverter(state_dict):


Donot change the state_dict format. Use the original name.

Artiprocher · 2026-04-14T05:39:11Z

diffsynth/pipelines/joyai_image.py

+        cfg_scale: float = 5.0,
+        # Image
+        input_image: Image.Image = None,
+        edit_images: Union[Image.Image, List[Image.Image]] = None,


This model doesn't support multi-image editing. edit_image can only be a single image, not a list.

mi804 added 8 commits April 9, 2026 17:50

auto intergrate joyimage model

d02d23d

joyimage pipeline

4a19346

train

3c27fd3

ready

d51fabb

styling

016fbf6

joyai-image docs

8a41183

Merge remote-tracking branch 'upstream/main' into joyai-image-edit

5c53413

update readme

18227e9

gemini-code-assist bot reviewed Apr 14, 2026

View reviewed changes

Artiprocher reviewed Apr 14, 2026

View reviewed changes

pr review

8ddd3a6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Joyai image edit#1393

Joyai image edit#1393
mi804 wants to merge 9 commits intomodelscope:mainfrom
mi804:joyai-image-edit

mi804 commented Apr 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 14, 2026

Uh oh!

gemini-code-assist bot Apr 14, 2026

Uh oh!

gemini-code-assist bot Apr 14, 2026

Uh oh!

gemini-code-assist bot Apr 14, 2026

Uh oh!

gemini-code-assist bot Apr 14, 2026

Uh oh!

gemini-code-assist bot Apr 14, 2026

Uh oh!

Artiprocher Apr 14, 2026

Uh oh!

Artiprocher Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		For more information on installation, please refer to [Setup Dependencies](../Pipeline_Usage/Setup.md).

		> Note: JoyAI-Image requires a specific version of `transformers`, please install `transformers>=4.57.0,<4.58.0`.

		@@ -0,0 +1,24 @@
		def JoyAIImageDiTStateDictConverter(state_dict):

Conversation

mi804 commented Apr 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Artiprocher Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Artiprocher Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants