gh-142659: Optimize set_swap_bodies for intersection_update#148155
gh-142659: Optimize set_swap_bodies for intersection_update#148155Siyet wants to merge 5 commits intopython:mainfrom
Conversation
|
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
|
@Siyet Could you add a few benchmarks for the affected cases? |
|
Benchmarks on a dedicated server (AMD EPYC 1 vCPU, 2 GB RAM, Ubuntu 24.04, no background load, Default build (with GIL):
Free-threading build (
The end-to-end improvement is modest (~1%) because Benchmark script"""Benchmark set.intersection_update for gh-142659.
Measures the performance of intersection_update across different set sizes
and usage patterns. Each benchmark recreates the set on every iteration
to measure the full intersection_update path including set_replace_body.
"""
import statistics
import sys
import timeit
ROUNDS = 50
BENCHMARKS = {
"small (5 & 5)": {
"setup": "",
"stmt": "set(range(5)).intersection_update({3, 4, 5, 6, 7})",
"number": 2_000_000,
},
"medium (100 & 100)": {
"setup": "b = set(range(50, 150))",
"stmt": "set(range(100)).intersection_update(b)",
"number": 500_000,
},
"large (10k & 10k)": {
"setup": "b = set(range(5000, 15000))",
"stmt": "set(range(10000)).intersection_update(b)",
"number": 2_000,
},
"multi-arg (100 & 100 & 50)": {
"setup": "b = set(range(50, 150)); c = set(range(75, 125))",
"stmt": "set(range(100)).intersection_update(b, c)",
"number": 500_000,
},
"empty result (100 & 0 overlap)": {
"setup": "b = set(range(200, 300))",
"stmt": "set(range(100)).intersection_update(b)",
"number": 500_000,
},
}
def main():
print(f"Python {sys.version}")
print(f"Rounds: {ROUNDS}")
print()
print(f"{'Benchmark':<32} {'Mean (ns)':>10} {'Stdev':>10} {'Min':>10} {'Max':>10}")
print("-" * 78)
for name, bench in BENCHMARKS.items():
times = timeit.repeat(
bench["stmt"],
setup=bench["setup"],
number=bench["number"],
repeat=ROUNDS,
)
per_iter = [t / bench["number"] * 1e9 for t in times]
mean = statistics.mean(per_iter)
stdev = statistics.stdev(per_iter)
print(
f"{name:<32} {mean:>10.1f} {stdev:>9.1f} {min(per_iter):>10.1f} {max(per_iter):>10.1f}"
)
if __name__ == "__main__":
main() |
Replace the general-purpose set_swap_bodies() with a specialized set_replace_body() that exploits the invariant that src is always a uniquely-referenced temporary about to be discarded.
d3f1a20 to
37a95e6
Compare
Objects/setobject.c
Outdated
|
|
||
| assert(!PyType_IsSubtype(Py_TYPE(dst), &PyFrozenSet_Type)); | ||
| assert(!PyType_IsSubtype(Py_TYPE(src), &PyFrozenSet_Type)); | ||
| assert(Py_REFCNT(src) == 1); |
There was a problem hiding this comment.
| assert(Py_REFCNT(src) == 1); | |
| assert(_PyObject_IsUniquelyReferenced(src)); |
There was a problem hiding this comment.
This assertion is going to be prone to gc.get_objects()/gc.get_referrers() issues like in #148180.
I'm not sure if we should do anything about that.
There was a problem hiding this comment.
Will gc.get_objects and similar ever see this set? They use a stop-the-world heap traversal, which can't happen until set_intersection_update / set_intersection_update_multi_impl returns, and by the time that happens, this src set is already gone.
There was a problem hiding this comment.
I agree with @zhuyifei1999 - src is a local variable that gets freed before the function returns, so gc can't observe it.
Switched to _PyObject_IsUniquelyReferenced per @eendebakpt's suggestion regardless, since I believe it's the correct API for free-threaded builds.
| moving dst's old contents into src for proper cleanup on Py_DECREF. | ||
|
|
||
| t=set(a); a.clear(); a.update(b); b.clear(); b.update(t); del t | ||
| The caller guarantees that src is a uniquely-referenced temporary set |
There was a problem hiding this comment.
If the called is required to cleanup src immediately afterwards, maybe it is cleaner to let the set_replace_body steal the reference and do the decref.
There was a problem hiding this comment.
In set_intersection_update_multi_impl, I placed Py_DECREF(tmp) after Py_END_CRITICAL_SECTION().
My reasoning was that moving the decref inside set_replace_body would run the deallocation (and decref of all keys) while still holding the critical section lock, which I thought might be undesirable.
By keeping it outside, the caller has control over when deallocation happens.
This is my first PR though and I'm still getting familiar with the project, so if I'm wrong here - happy to rework it.
| @@ -0,0 +1,3 @@ | |||
| Optimize :meth:`set.intersection_update` by replacing the general-purpose | |||
| ``set_swap_bodies()`` with a specialized ``set_replace_body()`` that skips | |||
There was a problem hiding this comment.
set_replace_body is a function internal to cpython so I don't think you need to mention it in the news entry as this won't affect user programs.
Objects/setobject.c
Outdated
| if (a_table == a->smalltable || b_table == b->smalltable) { | ||
| memcpy(tab, a->smalltable, sizeof(tab)); | ||
|
|
||
| assert(!PyType_IsSubtype(Py_TYPE(dst), &PyFrozenSet_Type)); |
There was a problem hiding this comment.
Could you replace these two asserts with
assert(PySet_CheckExact((PyObject *) dst));
assert(PySet_CheckExact((PyObject *) src));There was a problem hiding this comment.
I'd prefer to keep the current assertions. dst can be a subclass of set (e.g. class S(set): pass; S([1,2]).intersection_update({2})), so PySet_CheckExact(dst) would fire on that legitimate case. make_new_set_basetype() normalizes the type for src, but dst is the original caller object.
The current form (!PyType_IsSubtype(..., &PyFrozenSet_Type)) checks exactly what we need: neither argument is a frozenset.
There was a problem hiding this comment.
Could you use PySet_Check instead?
Objects/setobject.c
Outdated
| memcpy(a->smalltable, b->smalltable, sizeof(tab)); | ||
| memcpy(b->smalltable, tab, sizeof(tab)); | ||
| memcpy(dst->smalltable, src->smalltable, sizeof(tab)); | ||
| memcpy(src->smalltable, tab, sizeof(tab)); |
There was a problem hiding this comment.
This memcpy and the one below it on line 1528 can be pulled out of the macro.
Also, I find this entire section for copying the smalltable a bit awkward. Right now, this section is the only time copy_small_table is called so we could just move the for loop in copy_small_table into this function. Additionally, we could make copy_small_table exist for GIL builds and have it use memcpy as the default unless the GIL is disabled. Something roughly like this:
static void
copy_small_table(setentry *dest, setentry *src)
{
#ifdef Py_GIL_DISABLED
for (Py_ssize_t i = 0; i < PySet_MINSIZE; i++) {
_Py_atomic_store_ptr_release(&dest[i].key, src[i].key);
_Py_atomic_store_ssize_relaxed(&dest[i].hash, src[i].hash);
}
#else
memcpy(dest, src, PySet_MINSIZE * sizeof(setentry));
#endif
}There was a problem hiding this comment.
Good suggestion, done in 100c491. Moved the #ifdef Py_GIL_DISABLED branch inside copy_small_table so it uses memcpy in GIL builds and atomic stores in free-threaded builds. The callsite is clean now - no preprocessor conditionals.
There was a problem hiding this comment.
Sorry if I was not clear. There were two options that I suggested:
- Never use
copy_small_tableand move the logic intoset_replace_body - Never use
memcpyinset_replace_bodyand instead callcopy_small table
I'm not sure which option is best. However, if you want to implement the second option, I would make sure to no longer call memcpy in set_replace_body.
There was a problem hiding this comment.
Wait actually on second thought I realized that you probably should not make that change. Never mind.
There was a problem hiding this comment.
No worries. Thanks for the thorough review, really appreciate it!
Remove NEWS entry for internal-only change. Refactor copy_small_table to be available in all builds: memcpy on GIL-enabled, atomic stores on free-threaded. This removes the #ifdef from set_replace_body callsite.
|
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
Summary
Replace the general-purpose
set_swap_bodies()with a specializedset_replace_body()that exploits the invariant that the source argument is always a uniquely-referenced temporary set about to be discarded.Follow-up to the observation in #132290 (comment) and #142659.
Changes
set_swap_bodies()was designed for arbitrary two-way swaps between any two sets, but it is only called fromset_intersection_update()andset_intersection_update_multi_impl(), where the second argument (tmp) is always a freshly created temporary withPy_REFCNT == 1.The new
set_replace_body()exploits this invariant:src: the temporary is not visible to other threads, so plain assignments suffice (saves atomic fence overhead in the free-threaded build).copy_small_tableforsrc: use plainmemcpyinstead of per-entry atomic stores when writing back to src's smalltable (Py_GIL_DISABLEDpath).assert).srcis never shared (enforced viaassert), so only one direction of the shared-marking check is needed — propagate shared status fromdsttosrcfor proper deallocation of old entries.Py_hash_t hvariable.All assumptions are guarded by
assert()to document and enforce the contract.