Skip to content

Test misc-next (regular, SELF)#1629

Open
kdave wants to merge 10000 commits into
btrfs:ci-kvmfrom
kdave:misc-next
Open

Test misc-next (regular, SELF)#1629
kdave wants to merge 10000 commits into
btrfs:ci-kvmfrom
kdave:misc-next

Conversation

@kdave
Copy link
Copy Markdown
Member

@kdave kdave commented Apr 17, 2026

No description provided.

@kdave kdave force-pushed the misc-next branch 8 times, most recently from a464848 to 8f84140 Compare April 27, 2026 14:35
@kdave kdave force-pushed the misc-next branch 2 times, most recently from d93f97d to 7d5a51d Compare April 29, 2026 14:08
Paolo Abeni and others added 17 commits May 12, 2026 15:20
Linus Walleij says:

====================
net: ethernet: cortina: Fix various RX bugs

During review of a minor patch for a bug in the Cortina
ethernet driver, Sashiko jumped in and pointed out a number
of nasty bugs.

This series hopefully fixes all of them.

Signed-off-by: Linus Walleij <linusw@kernel.org>
====================

Link: https://patch.msgid.link/20260509-gemini-ethernet-fixes-v1-0-6c5d20ddc35b@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The legacy ARM board file for MACH_MX31ADS was removed in commit
c93197b ("ARM: imx: Remove i.MX31 board files"), but a reference
to it remained in the cs89x0 driver. Drop this unused code.

Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
Fixes: c93197b ("ARM: imx: Remove i.MX31 board files")
Link: https://patch.msgid.link/20260509023732.42256-1-enelsonmoore@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The usual way of inserting entries which are not yet fully ready
into XArray is to have a VALID flag. The shaper code has a NOT_VALID
flag. Since XArray code does not let us create entries with marks
already set - the creation of entries is currently not atomic.

Flip the polarity of the VALID flag. This closes the tiny race
in net_shaper_pre_insert() of entries being created without
the NOT_VALID flag.

Fixes: 93954b4 ("net-shapers: implement NL set and delete operations")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20260510192904.3987113-2-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We should update the entry before we mark it as valid.

Fixes: 93954b4 ("net-shapers: implement NL set and delete operations")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20260510192904.3987113-3-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
net_shaper_nl_group_doit() does not deduplicate NET_SHAPER_A_LEAVES
entries. When userspace supplies the same leaf handle twice, the same
old-parent pointer lands twice in old_nodes[]. The cleanup loop double
frees the parent. Of course the same parent may still be in old_nodes[]
twice if we are moving multiple of its leaves.

Note that this patch also implicitly fixes the fact that the
i >= leaves_count path forgets to set ret.

Fixes: 5d5d470 ("net-shapers: implement NL group operation")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20260510192904.3987113-4-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add test exercising duplicate leaves.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20260510192904.3987113-5-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
genlmsg_new() alloc failure path in net_shaper_nl_group_doit() forgets
to set ret before jumping to error handling.

Fixes: 5d5d470 ("net-shapers: implement NL group operation")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20260510192904.3987113-6-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
net_shaper_group_send_reply() writes both the NET_SHAPER_A_IFINDEX
attribute (via net_shaper_fill_binding()) and the nested
NET_SHAPER_A_HANDLE attribute (via net_shaper_fill_handle()), but
the reply skb at the call site in net_shaper_nl_group_doit() is
allocated using net_shaper_handle_size(), which only accounts for
the nested handle.

The allocation is therefore short by nla_total_size(sizeof(u32))
(8 bytes) for the IFINDEX attribute.  In practice the slab allocator
rounds up the small allocation so the bug is latent, but the size
accounting is wrong and could bite if the reply grew further.

Introduce net_shaper_group_reply_size() that accounts for the full
reply payload and use it both at the genlmsg_new() call site and in
the defensive WARN_ONCE message.

Fixes: 5d5d470 ("net-shapers: implement NL group operation")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20260510192904.3987113-7-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Using definitions in kernel policies is awkward right now.
On one hand we want defines for max values and such.
On the other we don't have a way of adding kernel-only defines.
Adding unnecessary defines to uAPI is a bad idea, we won't
be able to delete them. And when it comes to policy user
space should just query it via the policy dump, not use
hard coded defines.

Add a "scope" property to definitions, which will let us tell
the codegen that a definition is for kernel use only. Support
following values:
  - uapi: render into the uAPI header (default, today's behavior)
  - kernel: render to kernel header only
  - user: same as kernel but for the user-side generated header

Definitions may have a header property (definition is "external",
provided by existing header). Extend the scope to headers, too.
If definition has both scope and header properties we will only
generate the includes in the right scope.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20260510192904.3987113-8-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
net_shaper_parse_handle() reads the user-supplied handle ID via
nla_get_u32(), accepting the full u32 range. However, the xarray key
is built by net_shaper_handle_to_index() using
FIELD_PREP(NET_SHAPER_ID_MASK, handle->id), where NET_SHAPER_ID_MASK
is GENMASK(25, 0) - only 26 bits wide. FIELD_PREP silently masks off
the upper bits at runtime. A user-supplied NODE id like 0x04000123
becomes id 0x123.

Additionally, a user-supplied id equal to NET_SHAPER_ID_UNSPEC
(0x03FFFFFF, which is NET_SHAPER_ID_MASK itself) would collide with
the sentinel used internally by the group operation to signal
"allocate a new NODE id".

Reject user-supplied IDs >= NET_SHAPER_ID_MASK (i.e., >= 0x03FFFFFF)
in the policy.

Fixes: 4b623f9 ("net-shapers: implement NL get operation")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20260510192904.3987113-9-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The NETDEV scope represents a singleton root shaper in the per-device
hierarchy.  All code assumes NETDEV shapers have id 0:
net_shaper_default_parent() hardcodes parent->id = 0 when returning
the NETDEV parent for QUEUE/NODE children, and the UAPI documentation
describes NETDEV scope as "the main shaper" (singular, not plural).

Make sure we reject non-0 IDs.

Fixes: 4b623f9 ("net-shapers: implement NL get operation")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20260510192904.3987113-10-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
net_shaper_parse_handle() does not enforce that the user provides
the handle ID. For NODE the ID defaults to UNSPEC for all other
cases it defaults to 0.

For NETDEV 0 is the only option. For QUEUE defaulting to 0 makes
less intuitive sense. Specifically because the behavior should
(IMHO) be the same for all cases where there may be more than
one ID (QUEUE and NODE).

We should either document this as intentional or reject.
I picked the latter with no strong conviction.

Fixes: 4b623f9 ("net-shapers: implement NL get operation")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20260510192904.3987113-11-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski says:

====================
net: shaper: fix various minor bugs

Fix various minor bugs in the net shaper API.

First 2 patches deal with ordering issues around inserting
and publishing new shapers. Shapers are inserted "tentatively"
and marked valid only after HW op succeeded, this used to
be slightly racy.

Only other patch of note is patch 8. We want to add a Netlink
policy check on the handle ID. This necessitates patch 7.

The rest are simple and self-explanatory.

v1: https://lore.kernel.org/20260506000628.1501691-1-kuba@kernel.org
====================

Link: https://patch.msgid.link/20260510192904.3987113-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
New Infolevels for QUERY_DIR (and QUERY_INFO) levels 78 through 81 are
now being used by Windows clients and were added to the documentation.
Add defines for them (and correct some typos in documentation).  See
MS-SMB2 2.2.33 and MS-FSCC 2.4

Signed-off-by: Steve French <stfrench@microsoft.com>
The driver uses an initial IO to set the device to a default
state. That initialization is currently being done after the device
node has been created. That means that the single buffer used
for output can be altered while IO is in progress.
Move the intialization before announcement to user space.

Fixes: fac733f ("HID: force feedback support for SmartJoy PLUS PS2/USB adapter")
Signed-off-by: Oliver Neukum <oneukum@suse.com>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
bio_integrity_add_page() already sets bip_vcnt to 1 for the bounce
segment. Overwriting it with nr_vecs breaks bip_vcnt <= bip_max_vcnt
on WRITE (bip_max_vcnt is 1), so the gap-merge checks in block/blk.h
read past the bip_vec[] flex array. On READ the read is in bounds
but lands on a saved user bvec instead of the bounce.

The line was added for split propagation, but bio_integrity_clone()
doesn't copy bip_vcnt and BIP_CLONE_FLAGS excludes BIP_COPY_USER.

Fixes: 3991657 ("block: set bip_vcnt correctly")
Signed-off-by: David Carlier <devnexen@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260511215151.346228-1-devnexen@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_insert_cloned_request() already recomputes nr_phys_segments
against the bottom queue, because "the queue settings related to
segment counting may differ from the original queue." The exact same
reasoning applies to integrity segments: a stacked driver's underlying
queue can have tighter virt_boundary_mask, seg_boundary_mask, or
max_segment_size than the top queue, in which case
blk_rq_count_integrity_sg() against the bottom queue produces a
different count than the cached rq->nr_integrity_segments inherited
from the source request by blk_rq_prep_clone().

When the cached count is lower than the bottom queue's actual count,
blk_rq_map_integrity_sg() trips

	BUG_ON(segments > rq->nr_integrity_segments);

on dispatch. The same families of stacked setups that motivated the
existing nr_phys_segments recompute -- dm-multipath fanning out to
nvme-rdma in particular -- can produce this.

Mirror the nr_phys_segments handling: when the request carries
integrity, recompute nr_integrity_segments against the bottom queue
and reject the request if it exceeds the bottom queue's
max_integrity_segments. blk_rq_count_integrity_sg() and
queue_max_integrity_segments() are both already available via
<linux/blk-integrity.h>, which blk-mq.c includes.

This closes a latent gap in the stacking contract and brings the
integrity-segment accounting in line with the existing
phys-segment accounting.

Fixes: 76c313f ("blk-integrity: improved sg segment mapping")
Signed-off-by: Casey Chen <cachen@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260511212230.27511-1-cachen@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
fdmanana and others added 18 commits May 23, 2026 00:27
btrfs_log_inode_parent() is one of the most important steps called during
a fsync operation as well as during rename and link operations on inodes
that were previously logged. Add trace events for when entering and
exiting that function.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We use this unnamed enum for the log mode and then pass it around log
functions as an int type with the odd name "inode_only" which suggests a
boolean. So add a name to the enum and change the type everywhere to that
enum and rename the parameters to something more clear - "log_mode".
Also move the enum into tree-log.h - it will be used later by new trace
events.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_log_inode() is one of the most important steps called during a fsync,
as well as during rename and link operations on inodes that were previously
logged. Add trace events for when entering and exiting that function.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_log_all_parents() is an important step called during a fsync, as
well as during rename and link operations on inodes that were previously
logged. Add trace events for when entering and exiting that function.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
log_all_new_ancestors() is an important step called during a fsync, as
well as during rename and link operations on inodes that were previously
logged. Add trace events for when entering and exiting that function.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
log_new_dir_dentries() is an important step called during a fsync, as
well as during rename and link operations on inodes that were previously
logged. Add trace events for when entering and exiting that function.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
add_conflicting_inode() is an important step called during a fsync, as
well as during rename and link operations on inodes that were previously
logged. Add trace events for when entering and exiting that function.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
log_conflicting_inodes() is an important step called during a fsync, as
well as during rename and link operations on inodes that were previously
logged. Add trace events for when entering and exiting that function.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…play

In overwrite_item():

  There's no point in printing the root's ID if the assertion fails, since
  it can only be BTRFS_TREE_LOG_OBJECTID if it fails.

In log_new_delayed_dentries():

  There's no point in using a verbose assertion to print the value of
  ctx->logging_new_delayed_dentries because it's a boolean, so if the
  assertion fails we know its value is true (1).

So convert them to simpler assertion to make the code less verbose.
It also slightly reduces the object size, at least on x86_64 using
Debian's gcc 14.2.0-19 (if CONFIG_BTRFS_ASSERT is enabled in the kernel
config, which is the case for SUSE distributions for example).

Before:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  2028244	 197176	  15624	2241044	 223214	fs/btrfs/btrfs.ko

After:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  2028228	 197176	  15624	2241028	 223204	fs/btrfs/btrfs.ko

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
log_new_delayed_dentries() is an important step called during a fsync, as
well as during rename and link operations on inodes that were previously
logged. Add trace events for when entering and exiting that function.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_record_unlink_dir() is an important operation that affects inode
logging and is called during unlink and rename operations. Add a trace
event for it to help debug issues.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_record_snapshot_destroy() is an important operation that affects
inode logging and is called during subvolume/snapshot deletion as well as
during rmdir. Add a trace event for it to help debug issues.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_record_new_subvolume() is an important operation that affects
inode logging and is called during subvolume creation. Add a trace event
for it to help debug issues.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_log_new_name() is an important function that affects inode logging
and is called during link and rename operations. Add trace events for when
entering and exiting that function.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_sync_log() is one of the main functions called during a fsync.
Add trace events for when entering and exiting that function.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Print the type of the inode (directory, regular file, symlink, etc) to
facilitate debugging.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
sb_write_pointer() reads the super block from the block device page cache
using read_cache_page_gfp(). This has the same race with BLKBSZSET as the
one fixed by commit 3f29d66 ("btrfs: sync read disk super and set
block size").

Take the mapping invalidate lock around read_cache_page_gfp() to
serialize the read against block size changes.

Signed-off-by: KangNing Liao <lkangn.kernel@gmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the beginning of the loop, we try to obtain a locked delayed ref head,
if 'locked_ref' is currently NULL, by calling btrfs_select_ref_head(),
which can return an error pointer. If the error pointer is -EAGAIN we do
a continue and go back to the beginning of the loop, which will not try
again to call btrfs_select_ref_head() since 'locked_ref' is no longer
NULL but it's ERR_PTR(-EAGAIN), and then we do:

   spin_lock(&locked_ref->lock);

against a ERR_PTR(-EAGAIN) value, generating an invalid pointer
dereference.

Fix this by ensuring that 'locked_ref' is set to NULL when
btrfs_select_ref_head() returns ERR_PTR(-EAGAIN) and incrementing 'count'
as well, to prevent infinite looping. We do this by doing a goto to the
bottom of the loop that already sets 'locked_ref' to NULL and does a
cond_resched(), with an increment to 'count' right before the goto.
These measures were in place before the refactoring in commit 0110a4c
("btrfs: refactor __btrfs_run_delayed_refs loop") but were unintentionally
lost afterwards.

Reported-by: Dan Carpenter <error27@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/ag8ARRwykv8bpJ87@stanley.mountain/
Fixes: 0110a4c ("btrfs: refactor __btrfs_run_delayed_refs loop")
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
kdave added 2 commits May 23, 2026 23:21
…on()

In preparation to encode more information to the error value add a step
that verifies if the value is valid (i.e. < 0). This works for
compile-time and runtime (in debugging mode).

The compile-time check recognizes direct constants and defines an array
type. An invalid condition leads to negative array size which is caught
by compiler.

The runtime check constructs the array type from the condition and only
verifies the correct size, as we don't need to tweak the size to be
negative.

The sizeof() expressions do not generate any code. In the debugging
config the warning adds about 9KiB of btrfs.ko code size.

The array size trick is needed as we can't use static_array(), not even
with __builtin_constant_p().

Sample error message:

In file included from inode.c:40:
inode.c: In function ‘__cow_file_range_inline’:
transaction.h:261:26: error: size of unnamed array is negative
  261 |         (void)sizeof(char[-!(__builtin_constant_p(error) ? (error) < 0 : 1)]);  \
      |                          ^
transaction.h:275:9: note: in expansion of macro ‘VERIFY_NEGATIVE_ERROR’
  275 |         VERIFY_NEGATIVE_ERROR(error);                           \
      |         ^~~~~~~~~~~~~~~~~~~~~
inode.c:665:17: note: in expansion of macro ‘btrfs_abort_transaction’
  665 |                 btrfs_abort_transaction(trans, 17);
      |                 ^~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: David Sterba <dsterba@suse.com>
Optimize the btrfs_abort_transaction() for size as it (by our
convention) must be put right after the error condition is detected.
The exact file:line is reported so there's a portion that must be
inlined. As this is cold code it bloats functions. In previous patch
"btrfs: move transaction abort message to __btrfs_abort_transaction()"
the error message was moved to the common helper, saving like 20KiB of
btrfs.ko and several instructions per call site and some stack space.

There's little left to be optimized, we need to keep the atomic
test_and_set_bit() and to convey that as 'first hit' to
__btrfs_abort_transaction().

Right now it's a bool, which takes 8 bytes on stack for each call but
it's 1 bit of information. We can encode that to some of the other
parameters.

For that let's use the 'error' parameter, by convention it's negative
errno so we can reliably detect if it's the first hit or a later error.
Also the negation is usually implemented by a single instruction (NEG on
x86_64) so the resulting object code is kept short.

This reduces btrfs.ko by 8K and stack in several functions by 8 bytes.

Cumulative effect with the other commit is -30K of btrfs.ko. While the
encoding is an implementation detail, it's contained within the API.
Making the transaction abort calls very light is desired.

Signed-off-by: David Sterba <dsterba@suse.com>
kdave and others added 7 commits May 23, 2026 23:22
Any commits after this one are for testing and evaluation only.

Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_record_root_in_trans() has a lockless fast path for shareable
roots. It skips reloc_mutex when root->last_trans matches the current
transaction and BTRFS_ROOT_IN_TRANS_SETUP is clear.

The writer side publishes that state in two phases: it sets
IN_TRANS_SETUP before updating root->last_trans, then clears the bit
after btrfs_init_reloc_root() finishes. However, the reader-side
smp_rmb() is before both loads, so it does not order the last_trans load
against the later bit test. A reader can observe the new last_trans value
while missing the setup bit and return before the relocation-root setup
is complete.

Read root->last_trans first, then issue the read barrier before testing
IN_TRANS_SETUP. Also use clear_bit_unlock() for the writer's final clear
and test_bit_acquire() for the successful fast path, so the lockless
return observes the setup done before the bit was cleared.

Fixes: 7585717 ("Btrfs: fix relocation races")
Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The comments at the beginning of subpage.c are out-of-date, a lot of the
limits are already resolved.

Update them to reflect the latest status.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…maps

[CURRENT LIMIT]
Btrfs currently only supports sub-bitmaps (e.g. dirty bitmap) no larger
than BITS_PER_LONG.

That limit allows us to easily grab an unsigned long without the need to
properly allocate memory for a larger bitmap.

Unfortunately that limit prevents us from supporting huge folios.
For 4K page size and block size, a huge folio (order 9) means 512 blocks
inside a 2M folio.

[ENHANCEMENT]
To allow direct bitmap operations without allocating new memory,
introduce two different ways to access the subpage bitmaps:

- Return an unsigned long value
  This only happens if blocks_per_folio <= BITS_PER_LONG.

  We read out the sub-bitmap into an unsigned long, and return the
  value.
  This is the old existing method.

  This involves get_bitmap_value_##name() helper functions.
  And this time the helper functions are defined as inline functions
  instead of macros to provide better type checks.

- Return a pointer where the sub-bitmap starts
  This only happens if blocks_per_folio >= BITS_PER_LONG.

  This is the new method for sub-bitmaps larger than BITS_PER_LONG.
  Since the sizes of sub-bitmaps are all aligned to BITS_PER_LONG, we
  can directly access the start word of the sub-bitmap.

  This involves get_bitmap_pointer_##name() helper functions.

Then change the existing sub-bitmaps users to use the new helpers:

- Bitmap dumping
  Switch between get_bitmap_value_##name() and
  get_bitmap_pointer_##name() depending on the sub-bitmap size.

- btrfs_get_subpage_dirty_bitmap()
  Rename it to btrfs_get_subpage_dirty_bitmap_value() to follow the new
  value/pointer naming.
  Since we do not support huge folios yet, there is no pointer version
  for the dirty bitmap.

  Furthermore, add the support for bs == ps cases for
  btrfs_get_subpage_dirty_bitmap_value(), so that the caller no longer
  needs to check if the folio needs subpage handling.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[CURRENT LIMIT]
Btrfs currently only supports sub-bitmaps (e.g. dirty bitmap) no larger
than BITS_PER_LONG.

One call site that utilizes this limit is btrfs_bio_ctrl::submit_bitmap,
which makes it very simple and straightforward to just grab an unsigned
long value and assign it to submit_bitmap.

Unfortunately that limit prevents us from supporting huge folios.
For 4K page size and block size, a huge folio (order 9) means 512 blocks
inside a 2M folio.

[ENHANCEMENT]
Instead of using a fixed unsigned long value, change
btrfs_bio_ctrl::submit_bitmap to an unsigned long pointer.

And for cases where an unsigned long can hold the whole bitmap,
introduce @submit_bitmap_value, and just point that pointer to that
unsigned long.

Then update all direct users of bio_ctrl->submit_bitmap to use the
pointer version.

There are several call sites that get extra changes:

- @range_bitmap inside extent_writepage_io()
  Which is only utilized to truncate the bitmap.
  Since we do not want to allocate new memory just for such temporary
  usage, change the original bitmap_set() and bitmap_and() into
  bitmap_clear() for the ranges outside of the target range.

- Getting dirty subpage bitmap inside writepage_delalloc()
  Since we're passing an unsigned long pointer now, we need to go with
  different handling (bs == ps, blocks_per_folio <= BITS_PER_LONG,
  blocks_per_folio > BITS_PER_LONG).

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With all the previous preparations, it's finally time to enable the
huge folio support.

- The max folio size
  Here we define BTRFS_MAX_FOLIO_SIZE, which is fixed at 2MiB.

  This will ensure we have a large enough but not too large folio for
  btrfs.
  This limit applies to all systems regardless of page size.

  Then we also define BTRFS_MAX_BLOCKS_PER_FOLIO, which depends on
  CONFIG_BTRFS_EXPERIMENTAL.

  If it's an experimental build, BTRFS_MAX_BLOCKS_PER_FOLIO is 512,
  otherwise it's BITS_PER_LONG.

  The filemap max order will be calculated using both
  BTRFS_MAX_FOLIO_SIZE and BTRFS_MAX_BLOCKS_PER_FOLIO.

  E.g. for 64K page size with 64K fs block size, the limit will be
  BTRFS_MAX_FOLIO_SIZE (2M), which limits the filemap max order to 5.
  This will be lower than the old order (6), but folios larger than 2M
  are rarely any better for IO performance. Meanwhile excessively large
  folios can cause other problems like stalling the IO pipeline for too
  long.

  For 4K page size and 4K fs block size, the limit will be increased to
  2M from the old 256K.
  This new size is constrained by both BTRFS_MAX_FOLIO_SIZE (2M) and
  BTRFS_MAX_BLOCKS_PER_FOLIO (512 * 4K), allowing x86_64 to reach huge
  folio support, and the filemap max order will be 9.

- btrfs_bio_ctrl::submit_bitmap
  This will be enlarged to contain BTRFS_MAX_BLOCKS_PER_FOLIO bits, and
  this will be on-stack memory.
  This will increase on-stack memory usage by 56 bytes compared to the
  baseline (before the first patch in the series).

- Local @delalloc_bitmap inside writepage_delalloc()
  Unfortunately we cannot afford to handle an allocation error here, thus
  again we use on-stack memory.
  Thus this will increase on-stack memory usage by 56 bytes again.

So unfortunately this means during the delalloc window, the writeback path
will have +112 bytes on-stack memory usage, and for other cases the
writeback path will have +56 bytes on-stack memory usage.

The +56 bytes (btrfs_bio_ctrl::submit_bitmap) can be removed
after we have reworked the compression submission, so the current
on-stack submit_bitmap is mostly a workaround until then.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
On 64-bit kernels with 32-bit userspace, struct btrfs_ioctl_timespec is
laid out as 16 bytes (8B sec + 4B nsec + 4B trailing padding) instead of
the 12 bytes a 32-bit userspace expects, because the surrounding struct
is not packed. As a result, struct btrfs_ioctl_get_subvol_info_args has
a different size and layout in 32-bit userspace than in the 64-bit
kernel, and BTRFS_IOC_GET_SUBVOL_INFO returns garbage to 32-bit callers.

Mirror what was done for BTRFS_IOC_SET_RECEIVED_SUBVOL: add a packed
btrfs_ioctl_get_subvol_info_args_32 with btrfs_ioctl_timespec_32 fields,
define BTRFS_IOC_GET_SUBVOL_INFO_32 with that struct as the size
argument, factor the existing handler into a shared _btrfs_ioctl_get_
subvol_info() helper, and add btrfs_ioctl_get_subvol_info_32() which
fills the kernel struct and translates field-by-field into the 32-bit
struct before copy_to_user().

Signed-off-by: Daan De Meyer <daan@amutable.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.