Test misc-next (regular, SELF)#1629
Open
kdave wants to merge 10000 commits into
Open
Conversation
a464848 to
8f84140
Compare
d93f97d to
7d5a51d
Compare
Linus Walleij says: ==================== net: ethernet: cortina: Fix various RX bugs During review of a minor patch for a bug in the Cortina ethernet driver, Sashiko jumped in and pointed out a number of nasty bugs. This series hopefully fixes all of them. Signed-off-by: Linus Walleij <linusw@kernel.org> ==================== Link: https://patch.msgid.link/20260509-gemini-ethernet-fixes-v1-0-6c5d20ddc35b@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The legacy ARM board file for MACH_MX31ADS was removed in commit c93197b ("ARM: imx: Remove i.MX31 board files"), but a reference to it remained in the cs89x0 driver. Drop this unused code. Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com> Fixes: c93197b ("ARM: imx: Remove i.MX31 board files") Link: https://patch.msgid.link/20260509023732.42256-1-enelsonmoore@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The usual way of inserting entries which are not yet fully ready into XArray is to have a VALID flag. The shaper code has a NOT_VALID flag. Since XArray code does not let us create entries with marks already set - the creation of entries is currently not atomic. Flip the polarity of the VALID flag. This closes the tiny race in net_shaper_pre_insert() of entries being created without the NOT_VALID flag. Fixes: 93954b4 ("net-shapers: implement NL set and delete operations") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260510192904.3987113-2-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We should update the entry before we mark it as valid. Fixes: 93954b4 ("net-shapers: implement NL set and delete operations") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260510192904.3987113-3-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
net_shaper_nl_group_doit() does not deduplicate NET_SHAPER_A_LEAVES entries. When userspace supplies the same leaf handle twice, the same old-parent pointer lands twice in old_nodes[]. The cleanup loop double frees the parent. Of course the same parent may still be in old_nodes[] twice if we are moving multiple of its leaves. Note that this patch also implicitly fixes the fact that the i >= leaves_count path forgets to set ret. Fixes: 5d5d470 ("net-shapers: implement NL group operation") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260510192904.3987113-4-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add test exercising duplicate leaves. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260510192904.3987113-5-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
genlmsg_new() alloc failure path in net_shaper_nl_group_doit() forgets to set ret before jumping to error handling. Fixes: 5d5d470 ("net-shapers: implement NL group operation") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260510192904.3987113-6-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
net_shaper_group_send_reply() writes both the NET_SHAPER_A_IFINDEX attribute (via net_shaper_fill_binding()) and the nested NET_SHAPER_A_HANDLE attribute (via net_shaper_fill_handle()), but the reply skb at the call site in net_shaper_nl_group_doit() is allocated using net_shaper_handle_size(), which only accounts for the nested handle. The allocation is therefore short by nla_total_size(sizeof(u32)) (8 bytes) for the IFINDEX attribute. In practice the slab allocator rounds up the small allocation so the bug is latent, but the size accounting is wrong and could bite if the reply grew further. Introduce net_shaper_group_reply_size() that accounts for the full reply payload and use it both at the genlmsg_new() call site and in the defensive WARN_ONCE message. Fixes: 5d5d470 ("net-shapers: implement NL group operation") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260510192904.3987113-7-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Using definitions in kernel policies is awkward right now. On one hand we want defines for max values and such. On the other we don't have a way of adding kernel-only defines. Adding unnecessary defines to uAPI is a bad idea, we won't be able to delete them. And when it comes to policy user space should just query it via the policy dump, not use hard coded defines. Add a "scope" property to definitions, which will let us tell the codegen that a definition is for kernel use only. Support following values: - uapi: render into the uAPI header (default, today's behavior) - kernel: render to kernel header only - user: same as kernel but for the user-side generated header Definitions may have a header property (definition is "external", provided by existing header). Extend the scope to headers, too. If definition has both scope and header properties we will only generate the includes in the right scope. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260510192904.3987113-8-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
net_shaper_parse_handle() reads the user-supplied handle ID via nla_get_u32(), accepting the full u32 range. However, the xarray key is built by net_shaper_handle_to_index() using FIELD_PREP(NET_SHAPER_ID_MASK, handle->id), where NET_SHAPER_ID_MASK is GENMASK(25, 0) - only 26 bits wide. FIELD_PREP silently masks off the upper bits at runtime. A user-supplied NODE id like 0x04000123 becomes id 0x123. Additionally, a user-supplied id equal to NET_SHAPER_ID_UNSPEC (0x03FFFFFF, which is NET_SHAPER_ID_MASK itself) would collide with the sentinel used internally by the group operation to signal "allocate a new NODE id". Reject user-supplied IDs >= NET_SHAPER_ID_MASK (i.e., >= 0x03FFFFFF) in the policy. Fixes: 4b623f9 ("net-shapers: implement NL get operation") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260510192904.3987113-9-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The NETDEV scope represents a singleton root shaper in the per-device hierarchy. All code assumes NETDEV shapers have id 0: net_shaper_default_parent() hardcodes parent->id = 0 when returning the NETDEV parent for QUEUE/NODE children, and the UAPI documentation describes NETDEV scope as "the main shaper" (singular, not plural). Make sure we reject non-0 IDs. Fixes: 4b623f9 ("net-shapers: implement NL get operation") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260510192904.3987113-10-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
net_shaper_parse_handle() does not enforce that the user provides the handle ID. For NODE the ID defaults to UNSPEC for all other cases it defaults to 0. For NETDEV 0 is the only option. For QUEUE defaulting to 0 makes less intuitive sense. Specifically because the behavior should (IMHO) be the same for all cases where there may be more than one ID (QUEUE and NODE). We should either document this as intentional or reject. I picked the latter with no strong conviction. Fixes: 4b623f9 ("net-shapers: implement NL get operation") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260510192904.3987113-11-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski says: ==================== net: shaper: fix various minor bugs Fix various minor bugs in the net shaper API. First 2 patches deal with ordering issues around inserting and publishing new shapers. Shapers are inserted "tentatively" and marked valid only after HW op succeeded, this used to be slightly racy. Only other patch of note is patch 8. We want to add a Netlink policy check on the handle ID. This necessitates patch 7. The rest are simple and self-explanatory. v1: https://lore.kernel.org/20260506000628.1501691-1-kuba@kernel.org ==================== Link: https://patch.msgid.link/20260510192904.3987113-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
New Infolevels for QUERY_DIR (and QUERY_INFO) levels 78 through 81 are now being used by Windows clients and were added to the documentation. Add defines for them (and correct some typos in documentation). See MS-SMB2 2.2.33 and MS-FSCC 2.4 Signed-off-by: Steve French <stfrench@microsoft.com>
The driver uses an initial IO to set the device to a default state. That initialization is currently being done after the device node has been created. That means that the single buffer used for output can be altered while IO is in progress. Move the intialization before announcement to user space. Fixes: fac733f ("HID: force feedback support for SmartJoy PLUS PS2/USB adapter") Signed-off-by: Oliver Neukum <oneukum@suse.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
bio_integrity_add_page() already sets bip_vcnt to 1 for the bounce segment. Overwriting it with nr_vecs breaks bip_vcnt <= bip_max_vcnt on WRITE (bip_max_vcnt is 1), so the gap-merge checks in block/blk.h read past the bip_vec[] flex array. On READ the read is in bounds but lands on a saved user bvec instead of the bounce. The line was added for split propagation, but bio_integrity_clone() doesn't copy bip_vcnt and BIP_CLONE_FLAGS excludes BIP_COPY_USER. Fixes: 3991657 ("block: set bip_vcnt correctly") Signed-off-by: David Carlier <devnexen@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260511215151.346228-1-devnexen@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_insert_cloned_request() already recomputes nr_phys_segments against the bottom queue, because "the queue settings related to segment counting may differ from the original queue." The exact same reasoning applies to integrity segments: a stacked driver's underlying queue can have tighter virt_boundary_mask, seg_boundary_mask, or max_segment_size than the top queue, in which case blk_rq_count_integrity_sg() against the bottom queue produces a different count than the cached rq->nr_integrity_segments inherited from the source request by blk_rq_prep_clone(). When the cached count is lower than the bottom queue's actual count, blk_rq_map_integrity_sg() trips BUG_ON(segments > rq->nr_integrity_segments); on dispatch. The same families of stacked setups that motivated the existing nr_phys_segments recompute -- dm-multipath fanning out to nvme-rdma in particular -- can produce this. Mirror the nr_phys_segments handling: when the request carries integrity, recompute nr_integrity_segments against the bottom queue and reject the request if it exceeds the bottom queue's max_integrity_segments. blk_rq_count_integrity_sg() and queue_max_integrity_segments() are both already available via <linux/blk-integrity.h>, which blk-mq.c includes. This closes a latent gap in the stacking contract and brings the integrity-segment accounting in line with the existing phys-segment accounting. Fixes: 76c313f ("blk-integrity: improved sg segment mapping") Signed-off-by: Casey Chen <cachen@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260511212230.27511-1-cachen@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
btrfs_log_inode_parent() is one of the most important steps called during a fsync operation as well as during rename and link operations on inodes that were previously logged. Add trace events for when entering and exiting that function. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
We use this unnamed enum for the log mode and then pass it around log functions as an int type with the odd name "inode_only" which suggests a boolean. So add a name to the enum and change the type everywhere to that enum and rename the parameters to something more clear - "log_mode". Also move the enum into tree-log.h - it will be used later by new trace events. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_log_inode() is one of the most important steps called during a fsync, as well as during rename and link operations on inodes that were previously logged. Add trace events for when entering and exiting that function. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_log_all_parents() is an important step called during a fsync, as well as during rename and link operations on inodes that were previously logged. Add trace events for when entering and exiting that function. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
log_all_new_ancestors() is an important step called during a fsync, as well as during rename and link operations on inodes that were previously logged. Add trace events for when entering and exiting that function. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
log_new_dir_dentries() is an important step called during a fsync, as well as during rename and link operations on inodes that were previously logged. Add trace events for when entering and exiting that function. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
add_conflicting_inode() is an important step called during a fsync, as well as during rename and link operations on inodes that were previously logged. Add trace events for when entering and exiting that function. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
log_conflicting_inodes() is an important step called during a fsync, as well as during rename and link operations on inodes that were previously logged. Add trace events for when entering and exiting that function. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
…play
In overwrite_item():
There's no point in printing the root's ID if the assertion fails, since
it can only be BTRFS_TREE_LOG_OBJECTID if it fails.
In log_new_delayed_dentries():
There's no point in using a verbose assertion to print the value of
ctx->logging_new_delayed_dentries because it's a boolean, so if the
assertion fails we know its value is true (1).
So convert them to simpler assertion to make the code less verbose.
It also slightly reduces the object size, at least on x86_64 using
Debian's gcc 14.2.0-19 (if CONFIG_BTRFS_ASSERT is enabled in the kernel
config, which is the case for SUSE distributions for example).
Before:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
2028244 197176 15624 2241044 223214 fs/btrfs/btrfs.ko
After:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
2028228 197176 15624 2241028 223204 fs/btrfs/btrfs.ko
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
log_new_delayed_dentries() is an important step called during a fsync, as well as during rename and link operations on inodes that were previously logged. Add trace events for when entering and exiting that function. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_record_unlink_dir() is an important operation that affects inode logging and is called during unlink and rename operations. Add a trace event for it to help debug issues. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_record_snapshot_destroy() is an important operation that affects inode logging and is called during subvolume/snapshot deletion as well as during rmdir. Add a trace event for it to help debug issues. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_record_new_subvolume() is an important operation that affects inode logging and is called during subvolume creation. Add a trace event for it to help debug issues. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_log_new_name() is an important function that affects inode logging and is called during link and rename operations. Add trace events for when entering and exiting that function. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_sync_log() is one of the main functions called during a fsync. Add trace events for when entering and exiting that function. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Print the type of the inode (directory, regular file, symlink, etc) to facilitate debugging. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
sb_write_pointer() reads the super block from the block device page cache using read_cache_page_gfp(). This has the same race with BLKBSZSET as the one fixed by commit 3f29d66 ("btrfs: sync read disk super and set block size"). Take the mapping invalidate lock around read_cache_page_gfp() to serialize the read against block size changes. Signed-off-by: KangNing Liao <lkangn.kernel@gmail.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
In the beginning of the loop, we try to obtain a locked delayed ref head, if 'locked_ref' is currently NULL, by calling btrfs_select_ref_head(), which can return an error pointer. If the error pointer is -EAGAIN we do a continue and go back to the beginning of the loop, which will not try again to call btrfs_select_ref_head() since 'locked_ref' is no longer NULL but it's ERR_PTR(-EAGAIN), and then we do: spin_lock(&locked_ref->lock); against a ERR_PTR(-EAGAIN) value, generating an invalid pointer dereference. Fix this by ensuring that 'locked_ref' is set to NULL when btrfs_select_ref_head() returns ERR_PTR(-EAGAIN) and incrementing 'count' as well, to prevent infinite looping. We do this by doing a goto to the bottom of the loop that already sets 'locked_ref' to NULL and does a cond_resched(), with an increment to 'count' right before the goto. These measures were in place before the refactoring in commit 0110a4c ("btrfs: refactor __btrfs_run_delayed_refs loop") but were unintentionally lost afterwards. Reported-by: Dan Carpenter <error27@gmail.com> Link: https://lore.kernel.org/linux-btrfs/ag8ARRwykv8bpJ87@stanley.mountain/ Fixes: 0110a4c ("btrfs: refactor __btrfs_run_delayed_refs loop") Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
…on()
In preparation to encode more information to the error value add a step
that verifies if the value is valid (i.e. < 0). This works for
compile-time and runtime (in debugging mode).
The compile-time check recognizes direct constants and defines an array
type. An invalid condition leads to negative array size which is caught
by compiler.
The runtime check constructs the array type from the condition and only
verifies the correct size, as we don't need to tweak the size to be
negative.
The sizeof() expressions do not generate any code. In the debugging
config the warning adds about 9KiB of btrfs.ko code size.
The array size trick is needed as we can't use static_array(), not even
with __builtin_constant_p().
Sample error message:
In file included from inode.c:40:
inode.c: In function ‘__cow_file_range_inline’:
transaction.h:261:26: error: size of unnamed array is negative
261 | (void)sizeof(char[-!(__builtin_constant_p(error) ? (error) < 0 : 1)]); \
| ^
transaction.h:275:9: note: in expansion of macro ‘VERIFY_NEGATIVE_ERROR’
275 | VERIFY_NEGATIVE_ERROR(error); \
| ^~~~~~~~~~~~~~~~~~~~~
inode.c:665:17: note: in expansion of macro ‘btrfs_abort_transaction’
665 | btrfs_abort_transaction(trans, 17);
| ^~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: David Sterba <dsterba@suse.com>
Optimize the btrfs_abort_transaction() for size as it (by our convention) must be put right after the error condition is detected. The exact file:line is reported so there's a portion that must be inlined. As this is cold code it bloats functions. In previous patch "btrfs: move transaction abort message to __btrfs_abort_transaction()" the error message was moved to the common helper, saving like 20KiB of btrfs.ko and several instructions per call site and some stack space. There's little left to be optimized, we need to keep the atomic test_and_set_bit() and to convey that as 'first hit' to __btrfs_abort_transaction(). Right now it's a bool, which takes 8 bytes on stack for each call but it's 1 bit of information. We can encode that to some of the other parameters. For that let's use the 'error' parameter, by convention it's negative errno so we can reliably detect if it's the first hit or a later error. Also the negation is usually implemented by a single instruction (NEG on x86_64) so the resulting object code is kept short. This reduces btrfs.ko by 8K and stack in several functions by 8 bytes. Cumulative effect with the other commit is -30K of btrfs.ko. While the encoding is an implementation detail, it's contained within the API. Making the transaction abort calls very light is desired. Signed-off-by: David Sterba <dsterba@suse.com>
Any commits after this one are for testing and evaluation only. Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_record_root_in_trans() has a lockless fast path for shareable roots. It skips reloc_mutex when root->last_trans matches the current transaction and BTRFS_ROOT_IN_TRANS_SETUP is clear. The writer side publishes that state in two phases: it sets IN_TRANS_SETUP before updating root->last_trans, then clears the bit after btrfs_init_reloc_root() finishes. However, the reader-side smp_rmb() is before both loads, so it does not order the last_trans load against the later bit test. A reader can observe the new last_trans value while missing the setup bit and return before the relocation-root setup is complete. Read root->last_trans first, then issue the read barrier before testing IN_TRANS_SETUP. Also use clear_bit_unlock() for the writer's final clear and test_bit_acquire() for the successful fast path, so the lockless return observes the setup done before the bit was cleared. Fixes: 7585717 ("Btrfs: fix relocation races") Signed-off-by: Cen Zhang <zzzccc427@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
The comments at the beginning of subpage.c are out-of-date, a lot of the limits are already resolved. Update them to reflect the latest status. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
…maps [CURRENT LIMIT] Btrfs currently only supports sub-bitmaps (e.g. dirty bitmap) no larger than BITS_PER_LONG. That limit allows us to easily grab an unsigned long without the need to properly allocate memory for a larger bitmap. Unfortunately that limit prevents us from supporting huge folios. For 4K page size and block size, a huge folio (order 9) means 512 blocks inside a 2M folio. [ENHANCEMENT] To allow direct bitmap operations without allocating new memory, introduce two different ways to access the subpage bitmaps: - Return an unsigned long value This only happens if blocks_per_folio <= BITS_PER_LONG. We read out the sub-bitmap into an unsigned long, and return the value. This is the old existing method. This involves get_bitmap_value_##name() helper functions. And this time the helper functions are defined as inline functions instead of macros to provide better type checks. - Return a pointer where the sub-bitmap starts This only happens if blocks_per_folio >= BITS_PER_LONG. This is the new method for sub-bitmaps larger than BITS_PER_LONG. Since the sizes of sub-bitmaps are all aligned to BITS_PER_LONG, we can directly access the start word of the sub-bitmap. This involves get_bitmap_pointer_##name() helper functions. Then change the existing sub-bitmaps users to use the new helpers: - Bitmap dumping Switch between get_bitmap_value_##name() and get_bitmap_pointer_##name() depending on the sub-bitmap size. - btrfs_get_subpage_dirty_bitmap() Rename it to btrfs_get_subpage_dirty_bitmap_value() to follow the new value/pointer naming. Since we do not support huge folios yet, there is no pointer version for the dirty bitmap. Furthermore, add the support for bs == ps cases for btrfs_get_subpage_dirty_bitmap_value(), so that the caller no longer needs to check if the folio needs subpage handling. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
[CURRENT LIMIT] Btrfs currently only supports sub-bitmaps (e.g. dirty bitmap) no larger than BITS_PER_LONG. One call site that utilizes this limit is btrfs_bio_ctrl::submit_bitmap, which makes it very simple and straightforward to just grab an unsigned long value and assign it to submit_bitmap. Unfortunately that limit prevents us from supporting huge folios. For 4K page size and block size, a huge folio (order 9) means 512 blocks inside a 2M folio. [ENHANCEMENT] Instead of using a fixed unsigned long value, change btrfs_bio_ctrl::submit_bitmap to an unsigned long pointer. And for cases where an unsigned long can hold the whole bitmap, introduce @submit_bitmap_value, and just point that pointer to that unsigned long. Then update all direct users of bio_ctrl->submit_bitmap to use the pointer version. There are several call sites that get extra changes: - @range_bitmap inside extent_writepage_io() Which is only utilized to truncate the bitmap. Since we do not want to allocate new memory just for such temporary usage, change the original bitmap_set() and bitmap_and() into bitmap_clear() for the ranges outside of the target range. - Getting dirty subpage bitmap inside writepage_delalloc() Since we're passing an unsigned long pointer now, we need to go with different handling (bs == ps, blocks_per_folio <= BITS_PER_LONG, blocks_per_folio > BITS_PER_LONG). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
With all the previous preparations, it's finally time to enable the huge folio support. - The max folio size Here we define BTRFS_MAX_FOLIO_SIZE, which is fixed at 2MiB. This will ensure we have a large enough but not too large folio for btrfs. This limit applies to all systems regardless of page size. Then we also define BTRFS_MAX_BLOCKS_PER_FOLIO, which depends on CONFIG_BTRFS_EXPERIMENTAL. If it's an experimental build, BTRFS_MAX_BLOCKS_PER_FOLIO is 512, otherwise it's BITS_PER_LONG. The filemap max order will be calculated using both BTRFS_MAX_FOLIO_SIZE and BTRFS_MAX_BLOCKS_PER_FOLIO. E.g. for 64K page size with 64K fs block size, the limit will be BTRFS_MAX_FOLIO_SIZE (2M), which limits the filemap max order to 5. This will be lower than the old order (6), but folios larger than 2M are rarely any better for IO performance. Meanwhile excessively large folios can cause other problems like stalling the IO pipeline for too long. For 4K page size and 4K fs block size, the limit will be increased to 2M from the old 256K. This new size is constrained by both BTRFS_MAX_FOLIO_SIZE (2M) and BTRFS_MAX_BLOCKS_PER_FOLIO (512 * 4K), allowing x86_64 to reach huge folio support, and the filemap max order will be 9. - btrfs_bio_ctrl::submit_bitmap This will be enlarged to contain BTRFS_MAX_BLOCKS_PER_FOLIO bits, and this will be on-stack memory. This will increase on-stack memory usage by 56 bytes compared to the baseline (before the first patch in the series). - Local @delalloc_bitmap inside writepage_delalloc() Unfortunately we cannot afford to handle an allocation error here, thus again we use on-stack memory. Thus this will increase on-stack memory usage by 56 bytes again. So unfortunately this means during the delalloc window, the writeback path will have +112 bytes on-stack memory usage, and for other cases the writeback path will have +56 bytes on-stack memory usage. The +56 bytes (btrfs_bio_ctrl::submit_bitmap) can be removed after we have reworked the compression submission, so the current on-stack submit_bitmap is mostly a workaround until then. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
On 64-bit kernels with 32-bit userspace, struct btrfs_ioctl_timespec is laid out as 16 bytes (8B sec + 4B nsec + 4B trailing padding) instead of the 12 bytes a 32-bit userspace expects, because the surrounding struct is not packed. As a result, struct btrfs_ioctl_get_subvol_info_args has a different size and layout in 32-bit userspace than in the 64-bit kernel, and BTRFS_IOC_GET_SUBVOL_INFO returns garbage to 32-bit callers. Mirror what was done for BTRFS_IOC_SET_RECEIVED_SUBVOL: add a packed btrfs_ioctl_get_subvol_info_args_32 with btrfs_ioctl_timespec_32 fields, define BTRFS_IOC_GET_SUBVOL_INFO_32 with that struct as the size argument, factor the existing handler into a shared _btrfs_ioctl_get_ subvol_info() helper, and add btrfs_ioctl_get_subvol_info_32() which fills the kernel struct and translates field-by-field into the 32-bit struct before copy_to_user(). Signed-off-by: Daan De Meyer <daan@amutable.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.