Conversation
….from_pretrained method
| @@ -1674,21 +1678,6 @@ def download(cls, pretrained_model_name, **kwargs) -> str | os.PathLike: | |||
| custom_class_name = config_dict["_class_name"][1] | |||
|
|
|||
| load_pipe_from_hub = custom_pipeline is not None and f"{custom_pipeline}.py" in filenames | |||
There was a problem hiding this comment.
Just to note this remains to control repo_id:
diffusers/src/diffusers/pipelines/pipeline_loading_utils.py
Lines 441 to 458 in dc8d903
It helps distinguish between:
a) custom_pipeline is e.g. my_pipeline and that filename exists in pretrained_model_name's files
b) custom_pipeline is Hub repo (and pipeline.py is used)
Maybe could be renamed load_pipe_from_hub -> hub_contains_custom_pipeline?
|
|
||
| class CustomPipelineTests(unittest.TestCase): | ||
| def test_load_custom_pipeline(self): | ||
| with self.assertRaises(ValueError): |
There was a problem hiding this comment.
Should we investigate the ValueError messaging as well (it should have something related to the use of trust_remote_code or not something else)?
There was a problem hiding this comment.
And on main, this should not have yielded a ValueError, right? That is how we know, for one instance, that it's broken.
There was a problem hiding this comment.
On main:
pretrained |
custom_pipeline |
trust_remote_code? |
|---|---|---|
| hub/repoA | my_pipeline |
✅ |
| hub/repoA | one_step_unet[1] |
❌ |
| hub/repoA | hub/repoB | ❌ |
| any local directory | any | ❌ |
PR:
pretrained |
custom_pipeline |
trust_remote_code? |
|---|---|---|
| hub/repoA | my_pipeline |
✅ |
| hub/repoA | one_step_unet[1] |
✅ |
| hub/repoA | hub/repoB | ✅ |
| any local directory | any | ✅ |
[1] or any community pipeline name
This case is more implicit vs explicit consent, but on main there is potential for misuse by combining the "trusted" nature of community pipeline names and third party Hub repos.
A user may copy an example like:
pipeline = DiffusionPipeline.from_pretrained("google/ddpm-cifar10-32", custom_pipeline="one_step_unet")or
pipeline = DiffusionPipeline.from_pretrained("google/ddpm-cifar10-32", custom_pipeline="pipeline_stable_diffusion_xl_controlnet_adapter_inpaint")or
pipeline = DiffusionPipeline.from_pretrained("google/ddpm-cifar10-32", custom_pipeline="pipeline_stable_diffusion_x/_controlnet_adapter_inpaint")The first two are harmless from diffusers/community-pipelines-mirror, the third is malicious with a user registered as pipeline_stable_diffusion_x with a repo name _controlnet_adapter_inpaint. There are many community pipelines so many potential username/repo name combinations that could easily be missed.
Considering that I think community pipeline names should remain trusted, WDYT? We can just remove this to do so.
diffusers/src/diffusers/utils/dynamic_modules_utils.py
Lines 340 to 345 in 78a5028
| revision: str | None = None, | ||
| local_files_only: bool = False, | ||
| local_dir: str | None = None, | ||
| trust_remote_code: bool = False, |
There was a problem hiding this comment.
Would it make sense to add the ValueError to the caller sites of get_cached_module_file instead? Because the function itself isn't specifically tied to custom pipelines, I think.
| custom_class_name = config_dict["_class_name"][1] | ||
|
|
||
| load_pipe_from_hub = custom_pipeline is not None and f"{custom_pipeline}.py" in filenames | ||
| load_components_from_hub = len(custom_components) > 0 |
There was a problem hiding this comment.
See case 3 - this code is in download which is only reached when we use a Hub path with from_pretrained. It is replaced by the check in get_cached_module_file, specifically the local code path.
Consider this scenario:
hf download rotcasuoicilam/SuperCoolNewModel --local-dir rotcasuoicilam/SuperCoolNewModelfrom diffusers import DiffusionPipeline
pipeline = DiffusionPipeline.from_pretrained("rotcasuoicilam/SuperCoolNewModel")rotcasuoicilam/SuperCoolNewModel contains malicious custom components, user downloads the Hub repo assuming it is safe, Diffusers loads the custom components without the user's consent.
There was a problem hiding this comment.
I didn't get it.
hf download rotcasuoicilam/SuperCoolNewModel --local-dir rotcasuoicilam/SuperCoolNewModelis agnostic to DiffusionPipeline.from_pretrained(...).
There was a problem hiding this comment.
- A malicious actor uploads a model with malicious custom components
- Either:
2a. The user follows instructions that say to download the model first then run from the local path
2b. The user chooses to download the model first out of personal preference DiffusionPipeline.from_pretrained(the_local_path)- pwned
There was a problem hiding this comment.
Let's add a test case for this scenario then.
There was a problem hiding this comment.
On main:
FAILED tests/pipelines/test_pipelines.py::CustomPipelineTests::test_custom_components_from_local_dir - AssertionError: ValueError not raised
PR:
1 passed
There was a problem hiding this comment.
A bit more elaborate explanation for other reviewers (feel free to correct).
The critical branching point is:
if not os.path.isdir(pretrained_model_name_or_path):
# ... calls cls.download() which had the trust_remote_code check
else:
cached_folder = pretrained_model_name_or_path When you call from_pretrained("rotcasuoicilam/SuperCoolNewModel"):
os.path.isdir("rotcasuoicilam/SuperCoolNewModel")is checked.- If the user previously ran
hf download ... --local-dir rotcasuoicilam/SuperCoolNewModel, that directory exists locally. - So
os.path.isdir()returns True, and the code takes the else branch at line 871 — it just setscached_folder = pretrained_model_name_or_pathdirectly. - The
download()method is never called.
The old trust_remote_code check for custom components lived inside download(). Since download() is skipped entirely when the path is a
local directory, the check never runs. The custom components in that local folder get loaded without any consent gate.
That's why the fix moves the trust_remote_code check into get_cached_module_file — that's where the actual import of custom .py files
happens, and it runs regardless of whether the code came through download() or the local else branch.
There was a problem hiding this comment.
Well just to be clear, using the same local-dir as Hub repo is a possible trick to hide the attack as anyone who didn't pre-download wouldn't be affected but any local directory is affected, and the local directory could be from other sources like snapshot_download or git clone.
|
@bot /style |
|
Style bot fixed some files and pushed the changes. |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
I was curious how common custom components are on the Hub. The results are limited but I managed to scrape 9602 Hub repo paths from the model pages. 190 were gated, 4685 actually had |
What does this PR do?
As per #13446
trust_remote_codefails under several circumstances:pretrained_model_name_or_pathas Hub repo A andcustom_pipelineas Hub repo B,trust_remote_codeis bypassed and remote code runs from repo Bpretrained_model_name_or_pathas local directory andcustom_pipelineas Hub repo B,trust_remote_codeis never checked and remote code runs from repo Bpretrained_model_name_or_pathas local directory with custom components,trust_remote_codeis never checkedThis moves
trust_remote_codechecks intoget_cached_module_filewhere the actual custom module loading takes place.trust_remote_codeis passed from several code paths for complete coverage.I've added 3 separate
ValueErroringet_cached_module_fileto account for the different sources of remote code:local,gitandhub.gitpath could be considered trusted as these are fromhttps://huggingface.co/datasets/diffusers/community-pipelines-mirror,hubpath covers "custom pipelines" andlocalpath is reached for "custom components" (these are added to theallowed filesindownloadso become local files).With PR the above 3 cases are resolved:
Fixes #13446
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.