Skip to content

Joyai image edit#1393

Open
mi804 wants to merge 9 commits intomodelscope:mainfrom
mi804:joyai-image-edit
Open

Joyai image edit#1393
mi804 wants to merge 9 commits intomodelscope:mainfrom
mi804:joyai-image-edit

Conversation

@mi804
Copy link
Copy Markdown
Collaborator

@mi804 mi804 commented Apr 14, 2026

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the JoyAI-Image model, including a new inference pipeline, a 3D Transformer architecture, and a text encoder wrapper. The changes encompass model configurations, VRAM management, state dict converters, and comprehensive documentation with examples for inference and training. Feedback points out several issues to address: unused code with hardcoded device strings, typos in prompt templates and dependency versions, and inconsistencies between the code and documentation regarding text-only encoding and multi-image support.

Comment on lines +154 to +165
def get_cu_seqlens(text_mask, img_len):
batch_size = text_mask.shape[0]
text_len = text_mask.sum(dim=1)
max_len = text_mask.shape[1] + img_len
cu_seqlens = torch.zeros([2 * batch_size + 1], dtype=torch.int32, device="cuda")
for i in range(batch_size):
s = text_len[i] + img_len
s1 = i * max_len + s
s2 = (i + 1) * max_len
cu_seqlens[2 * i + 1] = s1
cu_seqlens[2 * i + 2] = s2
return cu_seqlens
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function get_cu_seqlens is defined but not used anywhere in the model or pipeline. Additionally, it contains a hardcoded device="cuda" at line 158, which would break compatibility with CPU or NPU devices if it were used. It is recommended to remove this unused function.

dtype: Optional[torch.dtype] = None,
device: Optional[torch.device] = None,
dit_modulation_type: str = "wanx",
theta: int = 256,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The default value for theta is set to 256, which is unusually low for Rotary Positional Embeddings (RoPE). While the model configuration in model_configs.py overrides this to 10000, it is safer to set a more standard default value in the class definition to avoid unexpected behavior if the model is initialized directly.

Suggested change
theta: int = 256,
theta: int = 10000,

Comment on lines +151 to +158
prompt_template_encode = {
'image':
"<|im_start|>system\n \\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n",
'multiple_images':
"<|im_start|>system\n \\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n{}<|im_start|>assistant\n",
'video':
"<|im_start|>system\n \\nDescribe the video by detailing the following aspects:\n1. The main content and theme of the video.\n2. The color, shape, size, texture, quantity, text, and spatial relationships of the objects.\n3. Actions, events, behaviors temporal relationships, physical movement changes of the objects.\n4. background environment, light, style and atmosphere.\n5. camera angles, movements, and transitions used in the video:<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n"
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The prompt templates contain \\n (a literal backslash followed by 'n') preceded by a space, which appears to be a typo for a standard newline \n. This might result in incorrect prompt formatting for the text encoder.

Suggested change
prompt_template_encode = {
'image':
"<|im_start|>system\n \\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n",
'multiple_images':
"<|im_start|>system\n \\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n{}<|im_start|>assistant\n",
'video':
"<|im_start|>system\n \\nDescribe the video by detailing the following aspects:\n1. The main content and theme of the video.\n2. The color, shape, size, texture, quantity, text, and spatial relationships of the objects.\n3. Actions, events, behaviors temporal relationships, physical movement changes of the objects.\n4. background environment, light, style and atmosphere.\n5. camera angles, movements, and transitions used in the video:<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n"
}
prompt_template_encode = {
'image':
"<|im_start|>system\n\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n",
'multiple_images':
"<|im_start|>system\n\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n{}<|im_start|>assistant\n",
'video':
"<|im_start|>system\n\nDescribe the video by detailing the following aspects:\n1. The main content and theme of the video.\n2. The color, shape, size, texture, quantity, text, and spatial relationships of the objects.\n3. Actions, events, behaviors temporal relationships, physical movement changes of the objects.\n4. background environment, light, style and atmosphere.\n5. camera angles, movements, and transitions used in the video:<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n"
}


def _encode_text_only(self, pipe, prompt, max_sequence_length):
# TODO: may support for text-only encoding in the future.
raise NotImplementedError("Text-only encoding is not implemented yet. Please provide edit_images for now.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Raising NotImplementedError for text-only encoding contradicts the README, which claims support for text-to-image generation. Please either implement this feature or update the documentation to reflect the current limitation.

if isinstance(edit_images, Image.Image):
edit_images = [edit_images]
pipe.load_models_to_device(self.onload_model_names)
assert len(edit_images) == 1, "Currently only supports single edit image for reference. Multiple edit images will be supported in the future."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assertion len(edit_images) == 1 contradicts the README and the function's docstring, which state that multiple images are supported. If the current implementation is limited to a single image, the documentation should be updated to avoid misleading users.


For more information on installation, please refer to [Setup Dependencies](../Pipeline_Usage/Setup.md).

> **Note**: JoyAI-Image requires a specific version of `transformers`, please install `transformers>=4.57.0,<4.58.0`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The required transformers version 4.57.0 does not exist (the current latest is 4.48.x). This is likely a typo and should be corrected to the intended version (e.g., 4.47.0).

Suggested change
> **Note**: JoyAI-Image requires a specific version of `transformers`, please install `transformers>=4.57.0,<4.58.0`.
> **Note**: JoyAI-Image requires a specific version of `transformers`, please install `transformers>=4.47.0,<4.48.0`.

@@ -0,0 +1,24 @@
def JoyAIImageDiTStateDictConverter(state_dict):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Donot change the state_dict format. Use the original name.

cfg_scale: float = 5.0,
# Image
input_image: Image.Image = None,
edit_images: Union[Image.Image, List[Image.Image]] = None,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model doesn't support multi-image editing. edit_image can only be a single image, not a list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants