Skip to content

HF upload guide update#65

Draft
egrace479 wants to merge 4 commits intomainfrom
hf-upload
Draft

HF upload guide update#65
egrace479 wants to merge 4 commits intomainfrom
hf-upload

Conversation

@egrace479
Copy link
Copy Markdown
Member

@egrace479 egrace479 commented Apr 25, 2026

Revises the Hugging Face dataset upload guide to better reflect the preferred methods.

Still ToDo:

  • refine text
  • cut extra old content
  • run linter (most errors should be resolved by finishing the first two items)

Closes #44

@egrace479 egrace479 requested a review from hlapp April 25, 2026 17:27
@egrace479 egrace479 added enhancement New feature or request structure Refactoring or architecture, general code organization labels Apr 25, 2026
Copy link
Copy Markdown
Member

@hlapp hlapp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@egrace479 looks pretty good. See two edits for clarification (though I'm not 100% sure these are right).

The PR is marked as draft and it looks like there may still be some placeholders (in the form of ...s), so making this a comment. If instead you meant it to be ready for merging, remove the draft status and re-assign me.


![Dataset repository Add file button](images/HF-dataset-upload/346190430-9e6cef9b-18ef-4d4a-84c5-1a3f75ac9336.png){ loading=lazy }

This method is fine for smaller files (<100MB), or dataset repositories that are distributed, not well organized, have less files. If you are uploading existing files, navigate to the target folder first.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This method is fine for smaller files (<100MB), or dataset repositories that are distributed, not well organized, have less files. If you are uploading existing files, navigate to the target folder first.
This method is fine for smaller files (<100MB), or data uploads from distributed sources, have relatively flat structure with few directories, and/or have few files. If you are uploading existing files, navigate to the target folder first.


Hugging Face provides a comprehensive Command Line Interface (CLI) and corresponding [docs](https://huggingface.co/docs/huggingface_hub/en/guides/cli). Note that this is installed with the `huggingface_hub` python package, but can also be installed directly, then called with `hf <command>`.

The Hugging Face CLI is the ideal method for larger datasets, with more files. It works directly from HPC clusters, such as OSC. Under the hood, [`hf upload`](https://huggingface.co/docs/huggingface_hub/en/guides/cli#hf-upload) uses the same upload functions described below, under [Upload a Dataset with HfApi](#upload-a-dataset-with-hfapi).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The Hugging Face CLI is the ideal method for larger datasets, with more files. It works directly from HPC clusters, such as OSC. Under the hood, [`hf upload`](https://huggingface.co/docs/huggingface_hub/en/guides/cli#hf-upload) uses the same upload functions described below, under [Upload a Dataset with HfApi](#upload-a-dataset-with-hfapi).
The Hugging Face CLI is the ideal method for uploads that are large in volume, have more than a few files, and/or a folder structure with many or nested directories. It works directly from HPC clusters, such as OSC. Under the hood, [`hf upload`](https://huggingface.co/docs/huggingface_hub/en/guides/cli#hf-upload) uses the same upload functions described below, under [Upload a Dataset with HfApi](#upload-a-dataset-with-hfapi), but obviates the need to first write a custom script.

@hlapp
Copy link
Copy Markdown
Member

hlapp commented Apr 26, 2026

I'm also wondering how much of the git lfs part we should leave in here. It seems the cases where this is best is fairly limited, so maybe the test is, can we clearly articulate the nature of those cases (e.g., want full version control access?), and if we're having trouble with that, cut that section down to merely mentioning it exists and linking to the docs.

@egrace479
Copy link
Copy Markdown
Member Author

I'm also wondering how much of the git lfs part we should leave in here. It seems the cases where this is best is fairly limited, so maybe the test is, can we clearly articulate the nature of those cases (e.g., want full version control access?), and if we're having trouble with that, cut that section down to merely mentioning it exists and linking to the docs.

I opened the draft PR so you could see what I had so far (since you had this issue in mind). I hadn't removed all the old content yet, instead moving most of it further down the page in case I wanted to pull more for the main content. This is also why (as noted in your first comment) there are placeholders.

I will re-read tomorrow, but am happy for any feedback. My current plan:

I think lines 89-114 (git lfs and gitattributes content) should probably be deleted, while lines 116-126 should fall under an "other topics of note" style heading or be worked in as notes at relevant locations further up. HF is quite picky with merge conflicts, so I want to make sure to include a warning about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request structure Refactoring or architecture, general code organization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Revise Hugging Face Upload Guide

2 participants