Skip to content

Test for-next ARM64 64K (regular, SELF)#1630

Open
kdave wants to merge 10000 commits into
ci-arm-kvmfrom
for-next
Open

Test for-next ARM64 64K (regular, SELF)#1630
kdave wants to merge 10000 commits into
ci-arm-kvmfrom
for-next

Conversation

@kdave

@kdave kdave commented Apr 17, 2026

Copy link
Copy Markdown
Member

No description provided.

@adam900710 adam900710 force-pushed the for-next branch 2 times, most recently from ad252c6 to af81080 Compare April 18, 2026 04:42
@kdave kdave force-pushed the for-next branch 2 times, most recently from 30c6cb0 to 73d4bbd Compare April 22, 2026 19:47
@kdave kdave force-pushed the for-next branch 2 times, most recently from 26f5cfa to 2189fe7 Compare April 24, 2026 11:09
@kdave kdave force-pushed the for-next branch 2 times, most recently from 5280eae to 52d1b61 Compare April 27, 2026 14:33
@adam900710 adam900710 force-pushed the for-next branch 3 times, most recently from 40c2283 to 09752d4 Compare April 28, 2026 00:45
@kdave kdave force-pushed the for-next branch 2 times, most recently from 29451dd to dc188da Compare April 28, 2026 06:01
@adam900710 adam900710 force-pushed the for-next branch 2 times, most recently from 4a55cf6 to 436ac81 Compare May 3, 2026 08:53
@fdmanana fdmanana force-pushed the for-next branch 2 times, most recently from e32c6db to 49a0b34 Compare May 4, 2026 15:50
@kdave kdave force-pushed the for-next branch 4 times, most recently from 4137f02 to f2ac86e Compare May 12, 2026 15:03
@kdave kdave force-pushed the for-next branch 2 times, most recently from db2485b to 0c78978 Compare May 16, 2026 00:59
adam900710 and others added 4 commits June 22, 2026 19:45
…ck groups

A swap file on btrfs will pin down block groups that cover the swap file
extent.

Pinned down block groups will be skipped for scrub and relocation.

These degradation on critical btrfs maintenance operations is never
properly educated to end users, and have already caused problems
including:

- Scrub finished too quick
  Because the enabled swap file has pinned down most of the block
  groups. Thus any file extents in those block groups, even not utilized
  by the swap file, will be skipped from scrub.

- Unbalanced data and metadata usage, meanwhile relocation won't help
  The same reason, pinned down block groups will not be considered as
  relocation target, thus data extents that are not utilized by the swap
  file can still be skipped from relocation.

Although we already have kernel messages for both scrub and balance, the
balance one is still info level.

To better communicate those potential long term problems, add the
following output into dmesg:

- Change the message level to warn for __btrfs_balance()

- Total pinned down block group number and size during swapfile activation
- Total released block group number and size during swapfile deactivation
  The above messages have info level.

- The fact that pinned down block groups will not be scrubbed nor
  balanced
  The above message has warning level.

The example output would look like the following, for enabling a 1.2G
swapfile, which pinned down 2G block groups:

 BTRFS info (device dm-3): swapfile activated on root 5 ino 257, pinned down 2147483648 bytes from 2 block group(s)
 BTRFS warning (device dm-3): block groups with swapfile extents will not be scrubbed or balanced
 Adding 1257468k swap on /mnt/btrfs/foobar.  Priority:-1 extents:1 across:1257468k
 BTRFS info (device dm-3): swapfile deactivated on root 5 ino 257, released 2147483648 bytes from 2 block group(s)

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The variable-sized buffer buf in struct btrfs_ioctl_search_args_v2 is
declared as __u64[], but it holds a packed byte stream of search results,
where all offsets into the buffer are in bytes.

Declaring buf as __u64[] makes it easy for user space to write incorrect
pointer arithmetic: adding a byte offset directly to a __u64 pointer
scales the offset by 8, landing at byte position offset*8 instead of
offset.

This recently caused an infinite loop in btrfs-progs: the accessor read
all-zero data from misaddressed items, which fed zeroed search keys back
into the ioctl loop and spun forever. The issue was worked around at the
time by disabling TREE_SEARCH_V2 entirely in btrfs-progs (d73e69824854:
"btrfs-progs: temporarily disable usage of v2 of search tree ioctl").

The kernel side already treats buf as a byte buffer, so change the
declaration to __u8[] to match the actual semantics and prevent similar
misuse in user space. The change is ABI compatible: both the structure size
and alignment are unchanged.

Fixes: cc68a8a ("btrfs: new ioctl TREE_SEARCH_V2")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: You-Kai Zheng <ykzheng@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Inside btrfs we always pair -EUCLEAN error with an error message to
indicate which data is corrupted.

However there are 3 cases inside lzo decompression where there is no
error message for corrupted headers.

Add those missing error messages to show exactly where the corruption
is.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
A crafted btrfs image can trigger the following crash:

  BUG: unable to handle page fault for address: ffffd1dc42884000
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  CPU: 9 UID: 0 PID: 1034 Comm: poc Not tainted 7.1.0-rc4-custom+ #383 PREEMPT(full)  46af0a92938a63be7132e0dfd71e62327c51d5c2
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
  RIP: 0010:memcpy+0xc/0x10
  Call Trace:
   <TASK>
   read_extent_buffer+0xe4/0x100 [btrfs 3cf0785dd58fec8c5ff84633b772f17ce1f92a8f]
   btrfs_get_name+0x15e/0x1e0 [btrfs 3cf0785dd58fec8c5ff84633b772f17ce1f92a8f]
   reconnect_path+0x165/0x390
   exportfs_decode_fh_raw+0x337/0x400
   ? drop_caches_sysctl_handler+0xb0/0xb0
   </TASK>
  ---[ end trace 0000000000000000 ]---
  RIP: 0010:memcpy+0xc/0x10
  Kernel panic - not syncing: Fatal exception

[CAUSE]
TThe crafted image has the following corrupted INODE_REF item:

         item 9 key (258 INODE_REF 257) itemoff 11544 itemsize 4106
         	index 2 namelen 4096 name: d\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000

The itemsize matches the namelen, but the namelen is 4096, way larger
than normal name length limit (BTRFS_NAME_LEN, 255).

Meanwhile the memory of the @name is only 255 byte sized, this will cause
out-of-boundary access, and cause the above crash.

[FIX]
Add extra namelen verification for INODE_REF, just like what we have
done in ROOT_REF checks.

Now the crafted image can be rejected gracefully:

 BTRFS critical (device dm-2): corrupt leaf: root=5 block=30572544 slot=14 ino=259, invalid inode ref name length, has 4096 expect [1, 255]
 BTRFS error (device dm-2): read time tree block corruption detected on logical 30572544 mirror 2

Reported-by: Xiang Mei <xmei5@asu.edu>
Link: https://lore.kernel.org/linux-btrfs/aik0hEV6ehKx6Ldv@Air.local/
Acked-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
[ Rebase, add a Link: tag, add an simple cause analyze ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
adam900710 and others added 4 commits June 22, 2026 19:49
V2 space cache has been the default mkfs option since btrfs-progs v5.15,
and commit 1e7bec1 ("btrfs: emit a warning about space cache v1
being deprecated") has already added a warning to show v1 space cache
has been deprecated.

It has been long enough that we should remove v1 space cache completely.

As the first step, disable v1 space cache by:

- Make "space_cache" mount option fallback to "nospace_cache"

- Make "space_cache=v1" fall back to "nospace_cache"

  This is safer than forcing "space_cache=v2", as forcing v2 cache
  requires removal of v1 cache and regenerating v2 cache.
  Such operation can be slow, and takes extra metadata space, thus
  it is not always safe for existing filesystems.

With this done, v1 cache mount will always fallback to nospace cache,
and mount option will not be able to force v1 space cache usage.

For example, even for a fs with v1 cache:

  # btrfs ins dump-super test.img
  superblock: bytenr=65536, device=test.img
  ---------------------------------------------------------
  csum_type		0 (crc32c)
  csum_size		4
  csum			0xdce44b2c [match]
  bytenr		65536
  flags			0x1
  			( WRITTEN )
  magic			_BHRfS_M [match]
  fsid			7d7c3bba-8211-4206-868d-10eedd5703f8
  metadata_uuid		00000000-0000-0000-0000-000000000000
  label
  generation		9
  root			30605312
  [...]
  compat_ro_flags	0x0                     <<< No FST feature
  incompat_flags	0x361
  			( MIXED_BACKREF |
  			  BIG_METADATA |
  			  EXTENDED_IREF |
  			  SKINNY_METADATA |
  			  NO_HOLES )
  cache_generation	9                       <<< Matches generation
  uuid_tree_generation	9

Attempting to mount it will lead to no space cache other than v1 space cache:

  # mount test.img /mnt/btrfs
  # dmesg -t | tail -n 5
  BTRFS: device fsid 7d7c3bba-8211-4206-868d-10eedd5703f8 devid 1 transid 9 /dev/loop0 (7:0) scanned by mount (1264)
  BTRFS info (device loop0): first mount of filesystem 7d7c3bba-8211-4206-868d-10eedd5703f8
  BTRFS info (device loop0): using crc32c checksum algorithm
  BTRFS info (device loop0): turning on async discard
  BTRFS info (device loop0): last unmount of filesystem 7d7c3bba-8211-4206-868d-10eedd5703f8

Even forcing v1 cache will not work, but fallback to the usual
nospace_cache:

  # mount test.img -o space_cache=v1 /mnt/btrfs
  # dmesg -t | tail -n 6
  BTRFS warning: v1 space cache is deprecated, fallback to no space cache
  BTRFS: device fsid 7d7c3bba-8211-4206-868d-10eedd5703f8 devid 1 transid 9 /dev/loop0 (7:0) scanned by mount (1264)
  BTRFS info (device loop0): first mount of filesystem 7d7c3bba-8211-4206-868d-10eedd5703f8
  BTRFS info (device loop0): using crc32c checksum algorithm
  BTRFS info (device loop0): turning on async discard
  BTRFS info (device loop0): last unmount of filesystem 7d7c3bba-8211-4206-868d-10eedd5703f8

And there will be no way to force converting a v2 cache back to v1, such
attempt will only clear free space tree and fallback to no space cache.

  # mkfs.btrfs -f -O fst,^bgt test.img
  # mount -o clear_cache,space_cache=v1 test.img /mnt/btrfs
  # dmesg -t | tail -n 11
  BTRFS warning: v1 space cache is deprecated, fallback to no space cache
  BTRFS: device fsid f59daad2-3ab5-4f33-b752-a36cfb09b674 devid 1 transid 8 /dev/loop0 (7:0) scanned by mount (1419)
  BTRFS info (device loop0): first mount of filesystem f59daad2-3ab5-4f33-b752-a36cfb09b674
  BTRFS info (device loop0): using crc32c checksum algorithm
  BTRFS info (device loop0): rebuilding free space tree
  BTRFS info (device loop0): disabling free space tree
  BTRFS info (device loop0): clearing compat-ro feature flag for FREE_SPACE_TREE (0x1)
  BTRFS info (device loop0): clearing compat-ro feature flag for FREE_SPACE_TREE_VALID (0x2)
  BTRFS info (device loop0): checking UUID tree
  BTRFS info (device loop0): turning on async discard
  BTRFS info (device loop0): force clearing of disk cache
  # mount | grep /mnt/btrfs
  /mnt/test.img on /mnt/btrfs type btrfs (rw,relatime,discard=async,nospace_cache,subvolid=5,subvol=/)

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit bac3c29 ("btrfs: remove 2K block size support") there
is no 2K block size support inside btrfs anymore.

Remove the stale comments of btrfs_supported_blocksize().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since v5.15 btrfs has support for block size < page size, but we still
only support 4K block size, while there is no special reason that we
cannot support 8K/16K/32K block sizes for 64K page size.

That 4K limit is completely arbitrary, and mostly to reduce test runtime
so we do not need to test all the extra block size combinations.

However that also limits the user choices, some users may understand
what they are doing, and want larger block sizes.  In that case, fixed
4K block size for subpage routine is blocking our way.

Just remove that fixed 4K requirement for block size < page size.

This should not affect regular end users, since mkfs is already using 4K
block size as default for quite a while, and the existing bs == ps support is
always there.

But for power users, this allows extra block size support, and may
provide extra test coverage.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Decentralize transaction aborts in create_reloc_root(), so that it is
obvious which call failed and what caused the transaction abort.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
@kdave kdave force-pushed the for-next branch 2 times, most recently from 71c1ff7 to 876e956 Compare June 22, 2026 23:57
adam900710 and others added 3 commits June 24, 2026 14:04
When dumping a tree block, btrfs_header::owner is printed as
unsigned, which can result in numbers that are hard to read, e.g.:

  BTRFS info (device loop0): leaf 8908800 gen 16 total ptrs 28 free space 1676 owner 18446744073709551607

For the above output, 18446744073709551607 is (s64)-9, the root id of data
reloc tree.

Despite those predefined root ids that are already negative, existing
subvolume trees will not have any negative values, as subvolume trees can
only utilize the lower 48 bits, so there will be no output change for
existing subvolumes, thus no extra confusion.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…erge

On a zoned FS, btrfs_delayed_refs_rsv_refill() returns -EAGAIN whenever
the over-committed metadata plus the zone_unusable bytes exceeds the
usable size in a metadata block-group to avoid heavy over-commit of
metadata and early ENOSPC in one transaction.

If this happens while doing reclaim, the transaction is getting
aborted.

Treat -EAGAIN as a soft, retryable condition in case of block-group
reclaim.

Reported-by: Damien Le Moal <dlemoal@kernel.org>
Fixes: 7bcb04d ("btrfs: zoned: cap delayed refs metadata reservation to avoid overcommit")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
The comment is wrong, because it's not about storing the ID of new
directories that were already created, instead it's about storing utimes
values for directories (both new and existing). The comment is wrong
because it was copy pasted from SEND_MAX_DIR_CREATED_CACHE_SIZE, but
forgot to update it afterwards.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
fdmanana and others added 4 commits June 24, 2026 17:59
…mon prefixes

In case the current inode's path is a prefix of the given path, the helper
is_current_inode_path() will return true, which causes the single caller
to reset the current inode's path. While this is not a functional issue,
it makes the caller recompute the current inode's path later. It could
also become a problem in the future in case get new callers for
is_current_inode_path() in more sensitive contexts.

Example: the current inode path is "/foo/bar" and the path we compare
against is "/foo/bar_xyz".

Fix this by returning true only if we have exact matches.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[BUG]
There is a syzbot report that the check inside get_new_location()
triggered:

 BTRFS info (device loop0): found 31 extents, stage: move data extents
 BTRFS info (device loop0): leaf 8908800 gen 16 total ptrs 28 free space 1676 owner 18446744073709551607
        item 0 key (256 INODE_ITEM 0) itemoff 3835 itemsize 160
                inode generation 5 transid 0 size 0 nbytes 0
                block group 0 mode 40755 links 1 uid 0 gid 0
                rdev 0 sequence 0 flags 0x0
                atime 1669132761.0
                ctime 1669132761.0
                mtime 1669132761.0
                otime 0.0
        item 1 key (256 INODE_REF 256) itemoff 3823 itemsize 12
                index 0 name_len 2
        item 2 key (258 INODE_ITEM 0) itemoff 3663 itemsize 160
                inode generation 1 transid 16 size 733184 nbytes 106496
                block group 0 mode 100600 links 0 uid 0 gid 0
                rdev 0 sequence 24 flags 0x18
        item 3 key (258 EXTENT_DATA 0) itemoff 3595 itemsize 68
                generation 16 type 0
                inline extent data size 47 ram_bytes 4096 compression 1
 [...]
        item 27 key (18446744073709551611 ORPHAN_ITEM 258) itemoff 2376 itemsize 0
 BTRFS error (device loop0): unexpected non-zero offset in file extent item for data reloc inode 258 key offset 0 offset 9277520992061368337
 ------------[ cut here ]------------
 btrfs_abort_should_print_stack(__error)

[CAUSE]
The above dump tree shows the first file extent item is inlined, which
should make no sense for data reloc inodes, as such inodes just
represent where the data extents are in the relocation destination chunk.

However the relocation path preallocates space for each block,
then dirties them, cluster by cluster.
It's possible to have a single block at the beginning of the block
group, and no other block in the same cluster.

So relocation will preallocate a file extent for that block and dirty
the first block.
Then memory pressure forces the data reloc inode to be written back, before
any other blocks are dirtied/allocated.

Finally commit 3eaf5f0 ("btrfs: extract inlined creation into a dedicated
delalloc helper") changed the sequence of delalloc. Before that commit we
always tried NOCOW first, so that dirtied block will be written back into
the preallocated space, and appear as a regular extent.

But with that commit, we always try inline first, and since compression
is forced, we try compressing the first block, and then inline the
compressed data, resulting in the above inlined file extent in the data
reloc tree.

Then the check in get_new_location() will check the file offset, without
checking if the file extent is inlined or not, resulting in the above
failure.

[FIX]
Do not allow compression for data reloc inodes.

Since data reloc inode sizes are always block aligned, as long as we do
not compress, @data_len will always be at least one block, and
that will cause can_cow_file_range_inline() to return false, thus no
inlined extent will be created.

Reported-by: syzbot+d950c6ba09b79f6e1864@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6a373dc5.764cf64f.168fbe.0001.GAE@google.com/
Fixes: 3eaf5f0 ("btrfs: extract inlined creation into a dedicated delalloc helper")
Cc: stable@vger.kernel.org
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Commit a6908f8 ("btrfs: validate data reloc tree file extent item
members") introduced extra checks on file extent items for data reloc
inodes, but it checks the file extent offset without checking if the file
extent is inlined.

This can lead to either false alerts (as the offset member is inside the
inlined data) or even reading beyond the item range.

This has already triggered a warning in a syzbot report.
Although the root fix is to avoid compression for data reloc inodes, for
the sake of consistency, reject inlined file extents first.

Fixes: a6908f8 ("btrfs: validate data reloc tree file extent item members")
Cc: stable@vger.kernel.org
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Btrfs' nodesize and sectorsize are all u32 values, there is no need to
use u64 for local usage.

Furthermore some call sites also use "blocksize" or "bs" for sectorsize,
also change them to use the minimal type u32 instead.

Since we're here, also reorder those local variables so that they won't
cause extra holes for stack memory, and consitify the
sectorsize/nodesize/blocksize/bs usage.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
adam900710 and others added 2 commits June 25, 2026 14:48
Btrfs does not support variable stripe length yet, all RAID0/5/6/10
chunks have the fixed stripe length 64K for now.

Furthermore, btrfs_fs_info::stripesize is not the real chunk stripe
length, it's always the same value as sectorsize.

Remove btrfs_fs_info::stripesize, and for the only callsite utilizing
that member, replace it with fs_info->sectorsize instead.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
…etattr()

btrfs_getattr() unconditionally reads BTRFS_I(inode)->new_delalloc_bytes
and adds it (sector-aligned) to stat->blocks for every inode type.
However, new_delalloc_bytes lives in a union with last_dir_index_offset:

    union {
        u64 new_delalloc_bytes;     /* files only */
        u64 last_dir_index_offset;  /* directories only */
    };

For a directory inode this memory holds last_dir_index_offset, which is
set during directory logging (e.g. flush_dir_items_batch()) to the
offset of the last logged BTRFS_DIR_INDEX_KEY.  That offset grows with
the number of entries ever created in the directory (dir indexes are
monotonic and never reused), so it can be arbitrarily large.

As a result, after a directory has been logged (e.g. via an fsync that
triggers directory logging), btrfs_getattr() reports inflated st_blocks
for that directory.  The inflation is purely in-core and disappears
after the inode is evicted and reloaded (btrfs_alloc_inode() zeroes the
union), e.g. after a remount.

Reproducer (on a btrfs filesystem):

    D=/mnt/btrfs/d
    mkdir -p $D
    for i in $(seq 1 20000); do touch $D/f$i; done
    sync                      # commit, push dir index high
    touch $D/trigger          # dirty the dir in a new transaction
    xfs_io -c fsync $D        # log the directory -> sets last_dir_index_offset
    stat -c '%b' $D           # st_blocks is now inflated (e.g. 40)
    # umount + mount -> st_blocks drops back to the correct value

The evict path already knows this union is type-dependent and guards the
corresponding WARN_ON with !S_ISDIR() in btrfs_destroy_inode(); only
btrfs_getattr() was missing the equivalent check.

Only read new_delalloc_bytes for regular files, which are the only
inodes that ever set it.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Dave Chen <davechen@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
fdmanana added 6 commits June 26, 2026 18:24
…tion

While running fsstress with autodefrag and flushoncommit, hit a deadlock
due to the fact that defrag reserves delalloc space while it's holding
dirty and locked folios, besides the extent range lock. The stack traces
are the following:

   [430958.624136] task:kworker/u50:3   state:D stack:0     pid:20365 tgid:20365 ppid:2      task_flags:0x4208060 flags:0x00080000
   [430958.626267] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
   [430958.627821] Call Trace:
   [430958.628351]  <TASK>
   [430958.628990]  __schedule+0x4be/0x10f0
   [430958.629791]  ? preempt_count_add+0x69/0xa0
   [430958.630605]  schedule+0x26/0xd0
   [430958.631327]  wait_current_trans+0x102/0x160 [btrfs]
   [430958.632414]  ? __pfx_autoremove_wake_function+0x10/0x10
   [430958.633515]  start_transaction+0x374/0x900 [btrfs]
   [430958.634601]  btrfs_commit_current_transaction+0x1d/0x70 [btrfs]
   [430958.635982]  flush_space+0xca/0x5e0 [btrfs]
   [430958.636996]  ? _raw_spin_unlock+0x15/0x30
   [430958.637894]  ? btrfs_reduce_alloc_profile+0x8c/0x190 [btrfs]
   [430958.639217]  ? _raw_spin_unlock+0x15/0x30
   [430958.640030]  ? calc_available_free_space.isra.0+0x6f/0x110 [btrfs]
   [430958.641462]  do_async_reclaim_metadata_space+0x84/0x190 [btrfs]
   [430958.642711]  btrfs_async_reclaim_metadata_space+0x64/0x80 [btrfs]
   [430958.644015]  process_one_work+0x19d/0x3a0
   [430958.644873]  worker_thread+0x1c4/0x330
   [430958.645668]  ? __pfx_worker_thread+0x10/0x10
   [430958.646535]  kthread+0xfc/0x130
   [430958.647285]  ? __pfx_kthread+0x10/0x10
   [430958.648068]  ret_from_fork+0x1f7/0x2c0
   [430958.648894]  ? __pfx_kthread+0x10/0x10
   [430958.649713]  ret_from_fork_asm+0x1a/0x30
   [430958.650536]  </TASK>
   [430958.651036] task:kworker/u49:7   state:D stack:0     pid:52990 tgid:52990 ppid:2      task_flags:0x4208060 flags:0x00080000
   [430958.653709] Workqueue: writeback wb_workfn (flush-btrfs-334)
   [430958.655110] Call Trace:
   [430958.655737]  <TASK>
   [430958.656284]  __schedule+0x4be/0x10f0
   [430958.657178]  ? __blk_flush_plug+0xe9/0x140
   [430958.658188]  schedule+0x26/0xd0
   [430958.658982]  io_schedule+0x42/0x70
   [430958.659850]  folio_wait_bit_common+0x12b/0x330
   [430958.660954]  ? folio_wait_bit_common+0x100/0x330
   [430958.662157]  ? __pfx_wake_page_function+0x10/0x10
   [430958.663328]  extent_write_cache_pages+0x599/0x830 [btrfs]
   [430958.664496]  ? acpi_fwnode_get_reference_args+0x1fa/0x270
   [430958.665579]  btrfs_writepages+0x77/0x130 [btrfs]
   [430958.666614]  ? __pfx_end_bbio_data_write+0x10/0x10 [btrfs]
   [430958.667846]  do_writepages+0xc6/0x160
   [430958.668596]  __writeback_single_inode+0x42/0x310
   [430958.669535]  writeback_sb_inodes+0x231/0x570
   [430958.670583]  wb_writeback+0x8a/0x340
   [430958.671383]  wb_workfn+0xbf/0x450
   [430958.672058]  ? finish_task_switch.isra.0+0xc1/0x350
   [430958.673026]  process_one_work+0x19d/0x3a0
   [430958.673814]  worker_thread+0x1c4/0x330
   [430958.674565]  ? __pfx_worker_thread+0x10/0x10
   [430958.675440]  kthread+0xfc/0x130
   [430958.676084]  ? __pfx_kthread+0x10/0x10
   [430958.676832]  ret_from_fork+0x1f7/0x2c0
   [430958.677582]  ? __pfx_kthread+0x10/0x10
   [430958.678369]  ret_from_fork_asm+0x1a/0x30
   [430958.679171]  </TASK>
   [430958.679644] task:btrfs-cleaner   state:D stack:0     pid:296750 tgid:296750 ppid:2      task_flags:0x208040 flags:0x00080000
   [430958.681812] Call Trace:
   [430958.682318]  <TASK>
   [430958.682762]  __schedule+0x4be/0x10f0
   [430958.683542]  schedule+0x26/0xd0
   [430958.684264]  handle_reserve_ticket+0x1b9/0x2c0 [btrfs]
   [430958.685366]  ? __pfx_autoremove_wake_function+0x10/0x10
   [430958.686520]  reserve_bytes+0x283/0x4c0 [btrfs]
   [430958.687610]  btrfs_reserve_metadata_bytes+0x18/0xb0 [btrfs]
   [430958.688860]  btrfs_delalloc_reserve_metadata+0x121/0x320 [btrfs]
   [430958.690263]  btrfs_delalloc_reserve_space+0x46/0xb0 [btrfs]
   [430958.691675]  btrfs_defrag_file+0x903/0x1110 [btrfs]
   [430958.692879]  btrfs_run_defrag_inodes+0x334/0x430 [btrfs]
   [430958.694005]  cleaner_kthread+0x97/0x1c0 [btrfs]
   [430958.694969]  ? __pfx_cleaner_kthread+0x10/0x10 [btrfs]
   [430958.696232]  kthread+0xfc/0x130
   [430958.696954]  ? __pfx_kthread+0x10/0x10
   [430958.697763]  ret_from_fork+0x1f7/0x2c0
   [430958.698521]  ? __pfx_kthread+0x10/0x10
   [430958.699348]  ret_from_fork_asm+0x1a/0x30
   [430958.700217]  </TASK>
   [430958.716533] task:fsstress        state:D stack:0     pid:296769 tgid:296769 ppid:296768 task_flags:0x400140 flags:0x00080000
   [430958.718780] Call Trace:
   [430958.719366]  <TASK>
   [430958.719817]  __schedule+0x4be/0x10f0
   [430958.720611]  ? preempt_count_add+0x69/0xa0
   [430958.721465]  schedule+0x26/0xd0
   [430958.722150]  wb_wait_for_completion+0x79/0xc0
   [430958.723109]  ? __pfx_autoremove_wake_function+0x10/0x10
   [430958.724173]  __writeback_inodes_sb_nr+0xc5/0xf0
   [430958.725081]  try_to_writeback_inodes_sb+0x55/0x70
   [430958.726075]  btrfs_commit_transaction+0x19d/0xeb0 [btrfs]
   [430958.727337]  ? start_transaction+0x343/0x900 [btrfs]
   [430958.728422]  btrfs_mksubvol+0x28b/0x4e0 [btrfs]
   [430958.729445]  btrfs_mksnapshot+0x74/0xa0 [btrfs]
   [430958.730511]  __btrfs_ioctl_snap_create+0x194/0x210 [btrfs]
   [430958.732245]  btrfs_ioctl_snap_create_v2+0xef/0x150 [btrfs]
   [430958.733636]  btrfs_ioctl+0x7ec/0x2a70 [btrfs]
   [430958.734665]  ? __virt_addr_valid+0xe4/0x180
   [430958.735534]  ? __check_object_size+0x1cd/0x1f0
   [430958.736613]  ? kmem_cache_free+0x146/0x380
   [430958.737645]  ? _raw_spin_unlock+0x15/0x30
   [430958.738660]  ? do_sys_openat2+0x83/0xd0
   [430958.739637]  __x64_sys_ioctl+0x92/0xe0
   [430958.740576]  do_syscall_64+0x60/0x590
   [430958.741512]  ? clear_bhb_loop+0x60/0xb0
   [430958.742485]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
   [430958.743772] RIP: 0033:0x7f4431e108db
   [430958.744668] RSP: 002b:00007ffcd147db20 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
   [430958.746327] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f4431e108db
   [430958.747816] RDX: 00007ffcd147eb90 RSI: 0000000050009417 RDI: 0000000000000005
   [430958.749479] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
   [430958.751216] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffcd147fbf0
   [430958.752929] R13: 00007ffcd147eb90 R14: 0000000000000005 R15: 0000000000000003
   [430958.754684]  </TASK>

What happens is the following:

1) The cleaner kthread is running autodefrag, and in defrag_one_range()
   it acquired all the folios for the range and locked them.

   Then it locked the extent range in the inode's iotree.

   It got two subranges from defrag_collect_targets(), the first one
   with folio A and the second one with folio B.

   After it defraged the first subrange, folio A remains locked and
   dirty - it's only unlocked when defrag_one_range() returns.

   When it attempts to defrag the second subrange (containing folio B),
   btrfs_delalloc_reserve_space() creates a space reservation ticket,
   due to lack of free metadata space and blocks waiting for the async
   metadata reclaim task to free space and wake it up;

2) The async reclaim metadata task attempts to commit the current
   transaction, but it blocks because there is another task that
   started the commit first;

3) A task creating a snapshot is committing the transaction and
   because the fs was mounted with flushoncommit, it calls
   try_to_writeback_inodes_sb(), which spawns a task to flush
   delalloc and waits for it to complete;

4) The task flushing delalloc (kworker/u49:7), finds that folio A for
   the inode being defragged is dirty, so it tries to lock it...

   But it blocks because folio A is locked by the defrag task (the
   cleaner kthread) which is blocked waiting for the reservation
   ticket to be served, but the async reclaim metadata task is
   blocked waiting for the transaction commit, which in turn is
   blocked waiting for the delalloc flush task, which is trying to
   lock folio A, resulting in a deadlock.

The same type of problem can happen if the async reclaim task starts to
flush delalloc, as that requires both locking the folio and the extent
rannge in the inode's io tree, and in this case we don't need the fs to
be mounted with flushoncommit. This type of problem has ocurred several
times in the past with reflinks for example, where we had a dirty folio
while holding the extent range locked and then starting a transaction
blocked waiting for the async reclaim task due to lack of free metadata
space.

So fix this by reserving delalloc space before locking folios and locking
the extent range in the inode's iotree. We can not simply unlock the
folios for each subrange given by defrag_collect_targets() after we defrag
it because the same folio may be present too in the next subrange (due to
large folios).

Fixes: 22b398e ("btrfs: defrag: introduce helper to defrag a contiguous prepared range")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Syzbot reported the following warning recently:

   [  157.672472][ T6611] BTRFS info (device loop0): turning on flush-on-commit
   [  157.672488][ T6611] BTRFS info (device loop0): enabling free space tree
   [  157.672504][ T6611] BTRFS info (device loop0): enabling auto defrag
   [  157.672555][ T6611] BTRFS info (device loop0): use lzo compression, level 1
   [  157.672574][ T6611] BTRFS info (device loop0): max_inline set to 4096
   [  158.094512][ T5608] BTRFS info (device loop2): last unmount of filesystem c9fe44da-de57-406a-8241-57ec7d4412cf
   [  160.073968][ T6656] BTRFS info (device loop0 state M): max_inline set to 4096
   [  160.418911][ T5611] BTRFS info (device loop0): last unmount of filesystem ab8108e1-bea5-4a9f-94c9-a3ff208d732a
   [  160.432287][ T6662] loop2: detected capacity change from 0 to 32768
   [  160.438859][ T6662] BTRFS: device fsid c9fe44da-de57-406a-8241-57ec7d4412cf devid 1 transid 8 /dev/loop2 (7:2) scanned by syz.2.74 (6662)
   [  160.459589][ T6662] BTRFS info (device loop2): first mount of filesystem c9fe44da-de57-406a-8241-57ec7d4412cf
   [  160.459616][ T6662] BTRFS info (device loop2): using crc32c checksum algorithm
   [  160.634366][ T1187] ------------[ cut here ]------------
   [  160.634376][ T1187] test_bit(BTRFS_FS_STATE_NO_DELAYED_IPUT, &fs_info->fs_state)
   [  160.634387][ T1187] WARNING: fs/btrfs/inode.c:3596 at btrfs_add_delayed_iput+0x2e3/0x340, CPU#0: kworker/u8:10/1187
   [  160.634412][ T1187] Modules linked in:
   [  160.634423][ T1187] CPU: 0 UID: 0 PID: 1187 Comm: kworker/u8:10 Not tainted syzkaller #0 PREEMPT_{RT,(full)}
   [  160.634435][ T1187] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026
   [  160.634442][ T1187] Workqueue: btrfs-endio-write btrfs_work_helper
   [  160.634456][ T1187] RIP: 0010:btrfs_add_delayed_iput+0x2e3/0x340
   [  160.634468][ T1187] Code: 53 a3 45 (...)
   [  160.634482][ T1187] RSP: 0018:ffffc900065d77c8 EFLAGS: 00010293
   [  160.634490][ T1187] RAX: ffffffff83e5f502 RBX: ffff88805aba0000 RCX: ffff888029768000
   [  160.634497][ T1187] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
   [  160.634503][ T1187] RBP: dffffc0000000000 R08: 0000000000000000 R09: 0000000000000000
   [  160.634509][ T1187] R10: dffffc0000000000 R11: ffffed100b574497 R12: 0000000000000001
   [  160.634516][ T1187] R13: dffffc0000000000 R14: ffff888061194788 R15: 0000000000000200
   [  160.634523][ T1187] FS:  0000000000000000(0000) GS:ffff888126186000(0000) knlGS:0000000000000000
   [  160.634531][ T1187] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   [  160.634537][ T1187] CR2: 00007fe553a3f000 CR3: 00000000596c2000 CR4: 00000000003526f0
   [  160.634547][ T1187] Call Trace:
   [  160.634551][ T1187]  <TASK>
   [  160.634560][ T1187]  btrfs_put_ordered_extent+0x18f/0x430
   [  160.634577][ T1187]  btrfs_finish_one_ordered+0xf63/0x2680
   [  160.634598][ T1187]  ? __pfx_btrfs_finish_one_ordered+0x10/0x10
   [  160.634611][ T1187]  ? do_raw_spin_lock+0x12b/0x2f0
   [  160.634622][ T1187]  ? lock_acquire+0x106/0x350
   [  160.634636][ T1187]  ? __pfx_do_raw_spin_lock+0x10/0x10
   [  160.634650][ T1187]  btrfs_work_helper+0x38b/0xc20
   [  160.634666][ T1187]  ? process_scheduled_works+0xa70/0x1860
   [  160.634679][ T1187]  process_scheduled_works+0xb5d/0x1860
   [  160.634703][ T1187]  ? __pfx_process_scheduled_works+0x10/0x10
   [  160.634716][ T1187]  ? assign_work+0x3d5/0x5e0
   [  160.634729][ T1187]  worker_thread+0xa53/0xfc0
   [  160.634752][ T1187]  kthread+0x388/0x470
   [  160.634765][ T1187]  ? __pfx_worker_thread+0x10/0x10
   [  160.635870][ T1187]  ? __pfx_kthread+0x10/0x10
   [  160.635891][ T1187]  ret_from_fork+0x514/0xb70
   [  160.635907][ T1187]  ? __pfx_ret_from_fork+0x10/0x10
   [  160.635917][ T1187]  ? __switch_to+0xc79/0x1410
   [  160.635934][ T1187]  ? __pfx_kthread+0x10/0x10
   [  160.635948][ T1187]  ret_from_fork_asm+0x1a/0x30
   [  160.635969][ T1187]  </TASK>
   [  160.635975][ T1187] Kernel panic - not syncing: kernel: panic_on_warn set ...

It means we add a delayed iput created after we last ran delayed iputs in
close_ctree() and set the flag BTRFS_FS_STATE_NO_DELAYED_IPUT in fs_info.

This happens when using autodefrag and more likely to happen if we use
flushoncommit too. The steps are the following:

1) Unmount starts, all delalloc is flushed and we enter close_ctree();

2) In close_ctree() we park the cleaner kthread, but while we wait for it
   to park, it's in:

     btrfs_run_defrag_inodes()
        btrfs_run_defrag_inode()
           btrfs_defrag_file()
              defrag_one_cluster()
                 defrag_one_range()
                    defrag_one_locked_target()

   And dirties some folios from an inode;

3) The cleaner kthread parks and we proceed in close_ctree(), waiting
   for all ordered extents, running delayed iputs and setting the flag
   BTRFS_FS_STATE_NO_DELAYED_IPUT in fs_info;

4) Later in close_ctree() we call btrfs_commit_super(), which commits the
   current transaction. Because we are mounted with flushoncommit, the
   transaction commit flushes delalloc and waits for the resulting ordered
   extent to complete;

5) The ordered extents from the flushed dealloc created by autodefrag
   complete and create delayed iputs, triggering the warning:

     WARN_ON_ONCE(test_bit(BTRFS_FS_STATE_NO_DELAYED_IPUT, &fs_info->fs_state));

   in btrfs_add_delayed_iput()

6) Further below in close_ctree() we will hit the following assertion:

     ASSERT(list_empty(&fs_info->delayed_iputs));

   Since we don't expect any more delayed iputs.

Fix this by flushing delalloc and waiting for the ordered extents right
after we parked the cleaner kthread and waiting for autodefrag in
close_ctree().

Reported-by: syzbot+6a843bf8604711c8fab0@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6a1ee507.b4221f80.1326c5.0004.GAE@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
There's no need to have one list for each loop to defrag each subrange and
then another one to free each subrange (struct defrag_target_range).
We can do it in a single loop, freeing each subrange after defragging,
plus no need to delete each subrange from the list since we immediately
free it.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Use AUTO_KFREE() for the folios array, avoiding two kfree() calls, one of
them in a very specific error path.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
When freeing the entries from the list there is no need to initialize
the list member in an entry, since we are immediately freeing it. So use
simple list_del() instead of list_del_init().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
There's no need to call list_del_init() against each entry when freeing
the list, as the list is local and we are freeing the entry.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.