Skip to content

GH-148937: fix for free-threaded GC (RSS based defer)#148940

Open
nascheme wants to merge 5 commits intopython:mainfrom
nascheme:ft-gc-threshold-fix-main
Open

GH-148937: fix for free-threaded GC (RSS based defer)#148940
nascheme wants to merge 5 commits intopython:mainfrom
nascheme:ft-gc-threshold-fix-main

Conversation

@nascheme
Copy link
Copy Markdown
Member

@nascheme nascheme commented Apr 23, 2026

Asking the OS for the process memory usage doesn't work well given how mimalloc works. It does not promptly return memory to the OS and so the memory doesn't drop after cyclic trash is freed.

Instead of asking the OS, use mimalloc APIs to compute how much memory is being used by all mimalloc arenas. We need to stop-the-world to do this but usually we can avoid doing a collection. So, from a performance perspective, this is worth it.

Tim Peters has a GC stress tester that quickly shows the issue, linked below. Before this fix, when I run this, the process RSS quickly goes up to 1 GB. After the fix, the RSS stays at about 100 MB. For comparision, the 3.13 GC keeps RSS at about 200 MB.

tim-gc-test.py

Benchmark results

@nascheme
Copy link
Copy Markdown
Member Author

Note that this adds two extra stop/start-the-world points. We need STW to call the mimalloc APIs to compute the memory usage (iterating through arenas). We could likely consolidate one or both of these with existing STW points but I think it makes the code more complex. So I decided to keep it simple for now. I think we should backport this change to 3.14.

Comment thread Python/gc_free_threading.c Outdated
Asking the OS for the process memory usage doesn't work will given how
mimalloc works.  It does not promptly return memory to the OS and so the
memory doesn't drop after cyclic trash is freed.

Instead of asking the OS, use mimalloc APIs to compute how much memory
is being used by all mimalloc arenas.  We need to stop-the-world to do
this but usually we can avoid doing a collection.  So, from a
performance perspective, this is worth it.
It's probably better to call this inside of gc_collect_main().  That
way, we are not doing the STW from inside _PyObject_GC_Link() function.
This should have no significant performance impact since we hit this
only after the young object count hits the threshold.
This avoids using STW in exchange for less accurate memory usage
estimates.
@nascheme nascheme force-pushed the ft-gc-threshold-fix-main branch from ac09833 to a853c00 Compare April 30, 2026 14:40
@read-the-docs-community
Copy link
Copy Markdown

read-the-docs-community Bot commented Apr 30, 2026

Documentation build overview

📚 cpython-previews | 🛠️ Build #32486627 | 📁 Comparing 819a848 against main (7686abe)

  🔍 Preview build  

5 files changed · ± 5 modified

± Modified

@nascheme
Copy link
Copy Markdown
Member Author

Based on a suggestion from Sam, I changed it to instead estimate mimalloc memory use by counting full mimalloc pages. This requires a couple of changes to mimalloc itself but avoids the STW blocks and so should perform better. The accounting happens when a page transitions from non-full->full and so should have minimal performance overhead.

@nascheme
Copy link
Copy Markdown
Member Author

Benchmark results from cyclotron. First compares 3.14.3t to this PR. Note that r-trash are mostly small. Table below this compared 3.13 (GIL, generational GC) with this PR.

base=./py-3.14t/bin/python vs new=/home/nas/src/cpython/python

cycle extra live t(s) r-t rss r-rss trash r-trash pause r-pause
10 0 100 21.05 0.5 39M 1.0 14k 0.2 2.60 0.3
10 0 10.0k 21.05 0.6 39M 1.0 20k 0.2 4.21 0.5
10 0 30.0k 21.06 1.0 41M 1.0 30k 0.2 4.30 0.5
10 10.0k 100 21.05 0.4 45M 0.3 8k 0.1 2.50 0.2
10 10.0k 10.0k 21.05 1.0 60M 0.4 10k 0.1 2.84 0.3
10 10.0k 30.0k 33.08 1.0 83M 0.5 30k 0.2 4.49 0.4
10 100.0k 100 21.05 0.6 130M 0.1 8k 0.1 2.70 0.3
10 100.0k 10.0k 21.05 1.0 278M 0.2 10k 0.1 2.93 0.3
10 100.0k 30.0k 21.06 0.5 594M 0.4 30k 0.2 5.04 0.3
10 300.0k 100 21.06 0.8 274M 0.1 8k 0.1 2.06 0.2
10 300.0k 10.0k 21.06 0.6 881M 0.3 20k 0.2 3.68 0.4
10 300.0k 30.0k 21.06 0.8 1.9G 0.5 30k 0.2 5.48 0.5
100 0 100 25.06 1.0 39M 1.0 16k 0.2 1.72 0.3
100 0 10.0k 33.06 1.1 41M 1.1 30k 0.3 2.52 0.5
100 0 30.0k 41.07 2.0 42M 1.1 60k 0.5 4.05 0.8
100 10.0k 100 25.05 1.2 40M 0.8 8k 0.1 2.18 0.3
100 10.0k 10.0k 29.06 1.4 41M 0.9 20k 0.2 2.18 0.3
100 10.0k 30.0k 21.06 1.0 45M 0.9 30k 0.2 2.77 0.4
100 100.0k 100 21.05 0.5 49M 0.3 8k 0.1 1.82 0.2
100 100.0k 10.0k 21.06 1.0 62M 0.4 10k 0.1 2.02 0.3
100 100.0k 30.0k 21.05 1.0 95M 0.5 30k 0.2 2.79 0.3
100 300.0k 100 25.05 1.2 64M 0.2 8k 0.1 2.44 0.3
100 300.0k 10.0k 33.05 1.6 126M 0.3 20k 0.2 2.40 0.4
100 300.0k 30.0k 21.05 0.7 242M 0.7 30k 0.3 3.27 0.5
1.0k 0 100 21.05 0.7 39M 1.0 16k 0.2 2.13 0.4
1.0k 0 10.0k 25.06 1.2 39M 1.0 30k 0.3 2.96 0.4
1.0k 0 30.0k 21.06 0.6 42M 1.0 30k 0.2 4.66 0.6
1.0k 10.0k 100 21.05 0.5 39M 1.0 16k 0.2 1.81 0.3
1.0k 10.0k 10.0k 21.05 1.0 38M 1.0 30k 0.3 2.38 0.4
1.0k 10.0k 30.0k 21.07 1.0 41M 1.0 30k 0.2 3.31 0.5
1.0k 100.0k 100 21.05 1.0 40M 0.8 9k 0.1 1.76 0.3
1.0k 100.0k 10.0k 21.05 0.8 42M 0.8 20k 0.2 2.35 0.4
1.0k 100.0k 30.0k 21.05 1.0 43M 0.8 30k 0.2 4.70 0.8
1.0k 300.0k 100 25.05 0.7 42M 0.5 9k 0.1 1.82 0.3
1.0k 300.0k 10.0k 21.06 1.0 50M 0.6 30k 0.3 2.27 0.4
1.0k 300.0k 30.0k 25.06 1.2 60M 0.8 30k 0.2 4.76 0.7

Uniform columns omitted:
wl: chain
cyc%: 100
stable: yes

Legend (base vs new, matched by wl/cycle/extra/live/cyc%):
wl workload mode (chain or tree)
cyc% fraction of allocation units made cyclic
t(s) total time for new build
r-t ratio of new/base total time (1.0 = equal, 2.0 = 2x slower)
rss peak RSS for new build
r-rss ratio of new/base peak RSS
trash max uncollected cyclic-garbage for new build
r-trash ratio of new/base max trash
pause max GC pause (ms) for new build
r-pause ratio of new/base max GC pause
stable yes if new build trash count was stable (non-rising)

base=/usr/bin/python3 vs new=/home/nas/src/cpython/python

cycle extra live t(s) r-t rss r-rss trash r-trash pause r-pause
10 0 100 21.05 1.0 39M 2.2 14k 2.9 2.60 1.4
10 0 10.0k 21.05 1.0 39M 1.7 20k 0.3 4.21 0.6
10 0 30.0k 21.06 0.8 41M 1.4 30k 0.2 4.30 0.3
10 10.0k 100 21.05 0.8 45M 2.1 8k 1.6 2.50 1.6
10 10.0k 10.0k 21.05 0.5 60M 0.6 10k 0.1 2.84 0.3
10 10.0k 30.0k 33.08 1.3 83M 0.4 30k 0.2 4.49 0.2
10 100.0k 100 21.05 1.0 130M 2.2 8k 1.6 2.70 1.2
10 100.0k 10.0k 21.05 0.7 278M 0.4 10k 0.1 2.93 0.1
10 100.0k 30.0k 21.06 1.0 594M 0.3 30k 0.2 5.04 0.2
10 300.0k 100 21.06 1.0 274M 1.9 8k 1.6 2.06 0.5
10 300.0k 10.0k 21.06 1.0 881M 0.4 20k 0.3 3.68 0.1
10 300.0k 30.0k 21.06 0.8 1.9G 0.4 30k 0.2 5.48 0.1
100 0 100 25.06 1.1 39M 2.1 16k 2.9 1.72 1.1
100 0 10.0k 33.06 0.9 41M 1.6 30k 0.3 2.52 0.3
100 0 30.0k 41.07 1.0 42M 1.2 60k 0.3 4.05 0.1
100 10.0k 100 25.05 1.2 40M 2.1 8k 1.5 2.18 1.1
100 10.0k 10.0k 29.06 1.1 41M 1.3 20k 0.2 2.18 0.3
100 10.0k 30.0k 21.06 0.7 45M 0.9 30k 0.2 2.77 0.1
100 100.0k 100 21.05 1.0 49M 2.5 8k 1.5 1.82 1.2
100 100.0k 10.0k 21.06 0.4 62M 0.6 10k 0.1 2.02 0.2
100 100.0k 30.0k 21.05 0.5 95M 0.4 30k 0.2 2.79 0.1
100 300.0k 100 25.05 1.1 64M 2.4 8k 1.5 2.44 1.6
100 300.0k 10.0k 33.05 0.8 126M 0.5 20k 0.2 2.40 0.3
100 300.0k 30.0k 21.05 1.0 242M 0.4 30k 0.2 3.27 0.1
1.0k 0 100 21.05 0.6 39M 1.7 16k 0.7 2.13 0.8
1.0k 0 10.0k 25.06 1.0 39M 1.6 30k 0.3 2.96 0.5
1.0k 0 30.0k 21.06 0.4 42M 1.1 30k 0.1 4.66 0.3
1.0k 10.0k 100 21.05 0.6 39M 1.7 16k 0.7 1.81 0.7
1.0k 10.0k 10.0k 21.05 0.6 38M 1.3 30k 0.3 2.38 0.4
1.0k 10.0k 30.0k 21.07 0.7 41M 1.2 30k 0.1 3.31 0.2
1.0k 100.0k 100 21.05 0.8 40M 1.8 9k 0.4 1.76 0.8
1.0k 100.0k 10.0k 21.05 0.4 42M 1.1 20k 0.2 2.35 0.3
1.0k 100.0k 30.0k 21.05 0.8 43M 0.8 30k 0.1 4.70 0.3
1.0k 300.0k 100 25.05 1.0 42M 1.6 9k 0.4 1.82 0.8
1.0k 300.0k 10.0k 21.06 1.0 50M 1.0 30k 0.3 2.27 0.3
1.0k 300.0k 30.0k 25.06 1.1 60M 0.7 30k 0.1 4.76 0.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants