Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for the JoyAI-Image model, including a new inference pipeline, a 3D Transformer architecture, and a text encoder wrapper. The changes encompass model configurations, VRAM management, state dict converters, and comprehensive documentation with examples for inference and training. Feedback points out several issues to address: unused code with hardcoded device strings, typos in prompt templates and dependency versions, and inconsistencies between the code and documentation regarding text-only encoding and multi-image support.
diffsynth/models/joyai_image_dit.py
Outdated
| def get_cu_seqlens(text_mask, img_len): | ||
| batch_size = text_mask.shape[0] | ||
| text_len = text_mask.sum(dim=1) | ||
| max_len = text_mask.shape[1] + img_len | ||
| cu_seqlens = torch.zeros([2 * batch_size + 1], dtype=torch.int32, device="cuda") | ||
| for i in range(batch_size): | ||
| s = text_len[i] + img_len | ||
| s1 = i * max_len + s | ||
| s2 = (i + 1) * max_len | ||
| cu_seqlens[2 * i + 1] = s1 | ||
| cu_seqlens[2 * i + 2] = s2 | ||
| return cu_seqlens |
There was a problem hiding this comment.
diffsynth/models/joyai_image_dit.py
Outdated
| dtype: Optional[torch.dtype] = None, | ||
| device: Optional[torch.device] = None, | ||
| dit_modulation_type: str = "wanx", | ||
| theta: int = 256, |
There was a problem hiding this comment.
The default value for theta is set to 256, which is unusually low for Rotary Positional Embeddings (RoPE). While the model configuration in model_configs.py overrides this to 10000, it is safer to set a more standard default value in the class definition to avoid unexpected behavior if the model is initialized directly.
| theta: int = 256, | |
| theta: int = 10000, |
| prompt_template_encode = { | ||
| 'image': | ||
| "<|im_start|>system\n \\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n", | ||
| 'multiple_images': | ||
| "<|im_start|>system\n \\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n{}<|im_start|>assistant\n", | ||
| 'video': | ||
| "<|im_start|>system\n \\nDescribe the video by detailing the following aspects:\n1. The main content and theme of the video.\n2. The color, shape, size, texture, quantity, text, and spatial relationships of the objects.\n3. Actions, events, behaviors temporal relationships, physical movement changes of the objects.\n4. background environment, light, style and atmosphere.\n5. camera angles, movements, and transitions used in the video:<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n" | ||
| } |
There was a problem hiding this comment.
The prompt templates contain \\n (a literal backslash followed by 'n') preceded by a space, which appears to be a typo for a standard newline \n. This might result in incorrect prompt formatting for the text encoder.
| prompt_template_encode = { | |
| 'image': | |
| "<|im_start|>system\n \\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n", | |
| 'multiple_images': | |
| "<|im_start|>system\n \\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n{}<|im_start|>assistant\n", | |
| 'video': | |
| "<|im_start|>system\n \\nDescribe the video by detailing the following aspects:\n1. The main content and theme of the video.\n2. The color, shape, size, texture, quantity, text, and spatial relationships of the objects.\n3. Actions, events, behaviors temporal relationships, physical movement changes of the objects.\n4. background environment, light, style and atmosphere.\n5. camera angles, movements, and transitions used in the video:<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n" | |
| } | |
| prompt_template_encode = { | |
| 'image': | |
| "<|im_start|>system\n\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n", | |
| 'multiple_images': | |
| "<|im_start|>system\n\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n{}<|im_start|>assistant\n", | |
| 'video': | |
| "<|im_start|>system\n\nDescribe the video by detailing the following aspects:\n1. The main content and theme of the video.\n2. The color, shape, size, texture, quantity, text, and spatial relationships of the objects.\n3. Actions, events, behaviors temporal relationships, physical movement changes of the objects.\n4. background environment, light, style and atmosphere.\n5. camera angles, movements, and transitions used in the video:<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n" | |
| } |
diffsynth/pipelines/joyai_image.py
Outdated
|
|
||
| def _encode_text_only(self, pipe, prompt, max_sequence_length): | ||
| # TODO: may support for text-only encoding in the future. | ||
| raise NotImplementedError("Text-only encoding is not implemented yet. Please provide edit_images for now.") |
diffsynth/pipelines/joyai_image.py
Outdated
| if isinstance(edit_images, Image.Image): | ||
| edit_images = [edit_images] | ||
| pipe.load_models_to_device(self.onload_model_names) | ||
| assert len(edit_images) == 1, "Currently only supports single edit image for reference. Multiple edit images will be supported in the future." |
There was a problem hiding this comment.
docs/en/Model_Details/JoyAI-Image.md
Outdated
|
|
||
| For more information on installation, please refer to [Setup Dependencies](../Pipeline_Usage/Setup.md). | ||
|
|
||
| > **Note**: JoyAI-Image requires a specific version of `transformers`, please install `transformers>=4.57.0,<4.58.0`. |
There was a problem hiding this comment.
The required transformers version 4.57.0 does not exist (the current latest is 4.48.x). This is likely a typo and should be corrected to the intended version (e.g., 4.47.0).
| > **Note**: JoyAI-Image requires a specific version of `transformers`, please install `transformers>=4.57.0,<4.58.0`. | |
| > **Note**: JoyAI-Image requires a specific version of `transformers`, please install `transformers>=4.47.0,<4.48.0`. |
| @@ -0,0 +1,24 @@ | |||
| def JoyAIImageDiTStateDictConverter(state_dict): | |||
There was a problem hiding this comment.
Donot change the state_dict format. Use the original name.
diffsynth/pipelines/joyai_image.py
Outdated
| cfg_scale: float = 5.0, | ||
| # Image | ||
| input_image: Image.Image = None, | ||
| edit_images: Union[Image.Image, List[Image.Image]] = None, |
There was a problem hiding this comment.
This model doesn't support multi-image editing. edit_image can only be a single image, not a list.
No description provided.