Skip to content

Cheaper receive pack connectivity check#2136

Open
newren wants to merge 2 commits into
gitgitgadget:masterfrom
newren:cheap-receive-pack-connectivity-check
Open

Cheaper receive pack connectivity check#2136
newren wants to merge 2 commits into
gitgitgadget:masterfrom
newren:cheap-receive-pack-connectivity-check

Conversation

@newren
Copy link
Copy Markdown

@newren newren commented Jun 3, 2026

Prior commits have tried to reduce the cost of git-receive-pack's connectivity check:

  • bcec678 (receive-pack: only use visible refs for connectivity check, 2022-11-17)
  • 68cb0b5 (builtin/receive-pack: add option to skip connectivity check, 2025-05-20)
    But there's still a big gap between turning it off and leaving it on. This series argues that the existing --not --all --exclude-hidden=receive, which is only an optimization, is not a particularly optimal one...and that using a much smaller set of refs provides a much better tradeoff. It introduces a new --toplevel-branches option to rev-list and uses it in receive-pack's connectivity check.

newren added 2 commits June 3, 2026 10:19
In large repositories which are served at central hubs (such as at
hosting forges), there tend to be huge numbers of refs, and even huge
numbers of branches.  These branches are often of forms like

  refs/heads/${username}/${topic}
  refs/heads/${bot}/${unique_name}

Sometimes, the sheer number of branches is itself a scaling challenge.
There are cases where we'd like a cheap approximation of "the union of
the main integration branches" (e.g. "main develop next maint"), and
it's okay to also take other branches along for the ride but it's nice
to limit how many extra refs we include.

Introduce a new pseudo-revision `--toplevel-branches` for this
purpose.  It is similar to the existing `--branches` pseudo-revision
and its `--branches=<pattern>` sibling, but there's been no ergonomic
way to exclude the large maze of branches in sub-hierarchies (`*` as
a pattern matches across `/`).

The next commit will use this to speed up push time reachability checks.

Add tests to `t6018-rev-list-glob.sh`, which already constructs a
fixture with a mix of top-level branches (`main`, `someref`,
`subspace-x`) and sub-directory branches (`other/three`, `subspace/one`,
`subspace/two`), making it a natural home for the new coverage.

Signed-off-by: Elijah Newren <newren@gmail.com>
When git-receive-pack(1) verifies that a pushed pack is fully
connected, it invokes

  git rev-list --objects --stdin --not --exclude-hidden=receive --all

with the new tips piped in on stdin.  The purpose of this check is to
ensure that ref updates sent with the receive-pack are well connected,
i.e. that they don't depend upon objects in the repository that
reference non-existent objects.  The portion following "--not" is just
a performance optimization that relies on the assumption that existing
refs in the repository are well connected.  (Thus, we assume that
refs are well connected, but not necessarily all objects in the
repository are.)

Prior to bcec678 (receive-pack: only use visible refs for
connectivity check, 2022-11-17) the only flag after "--not" was "--all",
but it was noted that this "optimization" was far from optimal and the
"--exclude-hidden=receive" argument was added.  In fact, having huge
numbers of refs after "--not" is generally suboptimal.  See also commit
68cb0b5 (builtin/receive-pack: add option to skip connectivity
check, 2025-05-20), which suggests that others besides us are still
getting connectivity checks that are taking too long.  If we still need
a connectivity check, the optimal choice is probably just the O(1) set
of "primary integration branches" that exist for the repository, e.g.
"main next seen maint develop", but discovering that list of branch
names automatically for any given repository is difficult.  My timings
suggest that using "HEAD" is preferable to using "--all" and is likely
near optimal in most cases, but I know there are a few repositories out
there that leave HEAD pointing to a non-existent branch or which never
update HEAD and update other branches instead.  Thus, I propose a middle
ground: "--toplevel-branches HEAD".

Some timings across 86 pushes in a monorepo:

    Group                                        Mean   Median   Min    Max
    1: --not --exclude-hidden=receive --all      2.219  2.520    0.33   8.30
    2: --not --branches                          2.099  2.430    0.24   8.43
    3: --not <objectnames of refs/heads/[^/]*>   0.883  0.770    0.05   2.11
    4: --not HEAD                                0.853  0.700    0.00   3.55

where <objectnames of refs/heads/[^/]*> was more specifically:
    $(git for-each-ref --format='%(objectname)' 'refs/heads/[^/]*')
and thus was also paying for the overhead of invoking a subprocess.

(For reference, this repo has about 6 times as many refs as branches,
and about 13 times as many total branches as toplevel branches.)

Note also that the biggest slowdown from group 1->3 on a given push was
0.34s -> 1.05s (approximately 3.1x) and the biggest speedup from group
1->3 on a given push was 4.52s -> 0.69s (approximately 6.5x).

While we could use the above for-each-ref invocation, that'd be an
extra subprocess, and it'd run the risk of violating command line length
limits (big monorepos often still have enough toplevel branches that we
have to worry about even those).  Instead, use the new
--toplevel-branches flag added in the previous commit, updating our
rev-list reachability command to

  git rev-list --objects --stdin --not --toplevel-branches HEAD

Signed-off-by: Elijah Newren <newren@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant