gh-151788: Speed up http.server directory listing by using os.scandir()#151789
gh-151788: Speed up http.server directory listing by using os.scandir()#151789mjbommar wants to merge 5 commits into
Conversation
…candir SimpleHTTPRequestHandler.list_directory() called os.path.isdir() and os.path.islink() for every entry, issuing two stat-family syscalls per file. This is wasted work on any filesystem and dominates listing time for large directories; on network filesystems such as NFS, where each call is a round-trip, it becomes severe. Use os.scandir(), whose DirEntry objects report the type from the directory read itself (d_type / READDIRPLUS), eliminating the per-entry stats in the common case and never doing more work than before. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
- Wrap entry.is_dir()/is_symlink() in try/except OSError, falling back to False to exactly mirror os.path.isdir()/islink() (matches os.walk()). - Reword the NEWS entry to avoid speedup-multiplier claims; describe it as improving list_directory() on systems with slow stat calls. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Remove the performance rationale from the inline comment; keep only the note explaining the OSError fallback (correctness, not performance). - Reword the NEWS entry to state the mechanism without magnitude claims. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
I have made the requested changes; please review again |
|
Thanks for making the requested changes! @picnixz: please review the changes made to this pull request. |
|
Please fix the CI first before asking for a rereview. |
Drop the :meth: cross-reference to list_directory, which is not a documented method and fails the docs "new NEWS nits" check. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@picnixz sorry, thought this was just test flake. The expected checks are all green now other than your review |
- Drop the local `name` variable in favor of entry.name. - Reword the OSError comment per review. - Move the "Append /" comment above the is_dir check. - Reword the NEWS entry per review (fix call -> calls). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| displayname = entry.name + "/" | ||
| linkname = entry.name + "/" | ||
| if is_symlink: | ||
| displayname = entry.name + "@" |
There was a problem hiding this comment.
Can you avoid having an LLM doing those suggestions please?
There was a problem hiding this comment.
Sure, AFK today but can do that tonight. Emacs doesn't work so well on phone. You want PR clean without Claude?
There was a problem hiding this comment.
What I (personally) want usually is:
- human interaction (I don't want to communicate with an agent, that's out the question. I don't care whether the text has been written by the AI as long as it's not a C/C without any human review before sending it but I don't want an automated conversation)
- commits that are always reviewed by a human before even committing them; in general this means that the AI is giving you the rough idea or contents but you are responsible for cleaning it up, and addressing things that could help the next review. In this case, I considered the
name = entry.nameto be un-necessary so reducing the diff only required to keep the same code and inlining that access. It didn't mean to rewrite all the occurrences withentry.name.
Our policy is that AI is a tool that can be used but it's not something meant to generate PRs for users that would later take credit for (I'm not saying you are, I'm saying that just how AI PRs usually go: most of the time, people just want to have a contribution and so they simply leave their agent in automode).
There was a problem hiding this comment.
So, the problem generally is not the AI, it's more that whenever there is an LLM, that LLM tends to choose things that are either useless, weird, and that would have been avoided if a human had directly written the lines. If you prefer writing your code through Claude, it's ok, but keep in mind that reviewers are not responsible for guiding the AI, it's the contributor. If we keep telling their agent through our review "no, not like that", it's as if we were directly using an agent in the least efficient way, and hence we are both losing our time.
There was a problem hiding this comment.
Ah. FWIW, Claude pointed out that your suggestion resulted in a NameError at string concat below, so it proposed this as easiest way to avoid name global and I accepted it
There was a problem hiding this comment.
Oh, my bad actually it should have been displayname = linkname = name = entry.name. But yeah, in this case, I would have preferred to know why it decided to do so. And then I would have suggested:
displayname = linkname = name = entry.name
instead of the two lines. That's why I prefer that you tell me this first before changing otherwise I would get confused.
There was a problem hiding this comment.
So to reiterate: there is nothing wrong with using Claude (the only think I don't like about it is the long summaries in every PR and trivial things it reports like "all tests pass" which is something I would expect for a PR otherwise it doesn't make sense). It's just that having a 3rd actor usually creates more back and forth between the reviewer and the reviewee.
| fullname = os.path.join(path, name) | ||
| displayname = linkname = name | ||
| for entry in entries: | ||
| displayname = linkname = entry.name |
There was a problem hiding this comment.
| displayname = linkname = entry.name | |
| displayname = linkname = name = entry.name |
| displayname = entry.name + "/" | ||
| linkname = entry.name + "/" | ||
| if is_symlink: | ||
| displayname = entry.name + "@" |
There was a problem hiding this comment.
| displayname = entry.name + "/" | |
| linkname = entry.name + "/" | |
| if is_symlink: | |
| displayname = entry.name + "@" | |
| displayname = name + "/" | |
| linkname = name + "/" | |
| if is_symlink: | |
| displayname = name + "@" |
SimpleHTTPRequestHandler.list_directory()calledos.path.isdir()(astat) andos.path.islink()(anlstat) for every entry — two stat-family syscalls per file. That is wasted work on any filesystem and dominates listing time for large directories; on network filesystems such as NFS, where each call is a round-trip, it becomes severe.This switches to
os.scandir(), whoseDirEntryobjects report the entry type from the directory read itself (POSIXd_type/ NFSREADDIRPLUS), eliminating the per-entry stats in the common case. CPython already made this exact migration foros.walk(),glob, andpathlib.Path.iterdir()(gh-117727);http.serverwas simply missed.Behavior preserved
DirEntry.is_dir()/is_symlink()matchos.path.isdir/os.path.islinksemantics — same follow-symlinks behavior and same return-False-on-error behavior — verified across real dirs/files, symlink-to-dir (still rendered with@but linked with/), symlink-to-file, and broken symlinks. The existingLib/test/test_httpservers.pysuite passes unchanged (92/92), including the undecodable/unencodable filename cases that exercise surrogate-escaped names.Benchmark
Directory with 1000 files + 1000 dirs:
strace), oldstatlatency injected)Caveat (no overselling): the large NFS figures assume the filesystem reports entry types in the directory read (
d_type; local filesystems and NFS withREADDIRPLUS, the Linux default). If a mount returnsDT_UNKNOWN,os.scandir()falls back to one cachedlstatper entry — still fewer calls than today, and never worse.This PR was prepared with AI assistance (Claude Code). I reviewed the change and benchmarks and can explain it in my own words.