Test for-next ARM64 64K (regular, SELF)#1630
Open
kdave wants to merge 10000 commits into
Open
Conversation
ad252c6 to
af81080
Compare
30c6cb0 to
73d4bbd
Compare
26f5cfa to
2189fe7
Compare
5280eae to
52d1b61
Compare
40c2283 to
09752d4
Compare
29451dd to
dc188da
Compare
4a55cf6 to
436ac81
Compare
e32c6db to
49a0b34
Compare
4137f02 to
f2ac86e
Compare
db2485b to
0c78978
Compare
…ck groups A swap file on btrfs will pin down block groups that cover the swap file extent. Pinned down block groups will be skipped for scrub and relocation. These degradation on critical btrfs maintenance operations is never properly educated to end users, and have already caused problems including: - Scrub finished too quick Because the enabled swap file has pinned down most of the block groups. Thus any file extents in those block groups, even not utilized by the swap file, will be skipped from scrub. - Unbalanced data and metadata usage, meanwhile relocation won't help The same reason, pinned down block groups will not be considered as relocation target, thus data extents that are not utilized by the swap file can still be skipped from relocation. Although we already have kernel messages for both scrub and balance, the balance one is still info level. To better communicate those potential long term problems, add the following output into dmesg: - Change the message level to warn for __btrfs_balance() - Total pinned down block group number and size during swapfile activation - Total released block group number and size during swapfile deactivation The above messages have info level. - The fact that pinned down block groups will not be scrubbed nor balanced The above message has warning level. The example output would look like the following, for enabling a 1.2G swapfile, which pinned down 2G block groups: BTRFS info (device dm-3): swapfile activated on root 5 ino 257, pinned down 2147483648 bytes from 2 block group(s) BTRFS warning (device dm-3): block groups with swapfile extents will not be scrubbed or balanced Adding 1257468k swap on /mnt/btrfs/foobar. Priority:-1 extents:1 across:1257468k BTRFS info (device dm-3): swapfile deactivated on root 5 ino 257, released 2147483648 bytes from 2 block group(s) Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
The variable-sized buffer buf in struct btrfs_ioctl_search_args_v2 is declared as __u64[], but it holds a packed byte stream of search results, where all offsets into the buffer are in bytes. Declaring buf as __u64[] makes it easy for user space to write incorrect pointer arithmetic: adding a byte offset directly to a __u64 pointer scales the offset by 8, landing at byte position offset*8 instead of offset. This recently caused an infinite loop in btrfs-progs: the accessor read all-zero data from misaddressed items, which fed zeroed search keys back into the ioctl loop and spun forever. The issue was worked around at the time by disabling TREE_SEARCH_V2 entirely in btrfs-progs (d73e69824854: "btrfs-progs: temporarily disable usage of v2 of search tree ioctl"). The kernel side already treats buf as a byte buffer, so change the declaration to __u8[] to match the actual semantics and prevent similar misuse in user space. The change is ABI compatible: both the structure size and alignment are unchanged. Fixes: cc68a8a ("btrfs: new ioctl TREE_SEARCH_V2") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: You-Kai Zheng <ykzheng@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Inside btrfs we always pair -EUCLEAN error with an error message to indicate which data is corrupted. However there are 3 cases inside lzo decompression where there is no error message for corrupted headers. Add those missing error messages to show exactly where the corruption is. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
[BUG] A crafted btrfs image can trigger the following crash: BUG: unable to handle page fault for address: ffffd1dc42884000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page CPU: 9 UID: 0 PID: 1034 Comm: poc Not tainted 7.1.0-rc4-custom+ #383 PREEMPT(full) 46af0a92938a63be7132e0dfd71e62327c51d5c2 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022 RIP: 0010:memcpy+0xc/0x10 Call Trace: <TASK> read_extent_buffer+0xe4/0x100 [btrfs 3cf0785dd58fec8c5ff84633b772f17ce1f92a8f] btrfs_get_name+0x15e/0x1e0 [btrfs 3cf0785dd58fec8c5ff84633b772f17ce1f92a8f] reconnect_path+0x165/0x390 exportfs_decode_fh_raw+0x337/0x400 ? drop_caches_sysctl_handler+0xb0/0xb0 </TASK> ---[ end trace 0000000000000000 ]--- RIP: 0010:memcpy+0xc/0x10 Kernel panic - not syncing: Fatal exception [CAUSE] TThe crafted image has the following corrupted INODE_REF item: item 9 key (258 INODE_REF 257) itemoff 11544 itemsize 4106 index 2 namelen 4096 name: d\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000 The itemsize matches the namelen, but the namelen is 4096, way larger than normal name length limit (BTRFS_NAME_LEN, 255). Meanwhile the memory of the @name is only 255 byte sized, this will cause out-of-boundary access, and cause the above crash. [FIX] Add extra namelen verification for INODE_REF, just like what we have done in ROOT_REF checks. Now the crafted image can be rejected gracefully: BTRFS critical (device dm-2): corrupt leaf: root=5 block=30572544 slot=14 ino=259, invalid inode ref name length, has 4096 expect [1, 255] BTRFS error (device dm-2): read time tree block corruption detected on logical 30572544 mirror 2 Reported-by: Xiang Mei <xmei5@asu.edu> Link: https://lore.kernel.org/linux-btrfs/aik0hEV6ehKx6Ldv@Air.local/ Acked-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: Weiming Shi <bestswngs@gmail.com> [ Rebase, add a Link: tag, add an simple cause analyze ] Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
V2 space cache has been the default mkfs option since btrfs-progs v5.15, and commit 1e7bec1 ("btrfs: emit a warning about space cache v1 being deprecated") has already added a warning to show v1 space cache has been deprecated. It has been long enough that we should remove v1 space cache completely. As the first step, disable v1 space cache by: - Make "space_cache" mount option fallback to "nospace_cache" - Make "space_cache=v1" fall back to "nospace_cache" This is safer than forcing "space_cache=v2", as forcing v2 cache requires removal of v1 cache and regenerating v2 cache. Such operation can be slow, and takes extra metadata space, thus it is not always safe for existing filesystems. With this done, v1 cache mount will always fallback to nospace cache, and mount option will not be able to force v1 space cache usage. For example, even for a fs with v1 cache: # btrfs ins dump-super test.img superblock: bytenr=65536, device=test.img --------------------------------------------------------- csum_type 0 (crc32c) csum_size 4 csum 0xdce44b2c [match] bytenr 65536 flags 0x1 ( WRITTEN ) magic _BHRfS_M [match] fsid 7d7c3bba-8211-4206-868d-10eedd5703f8 metadata_uuid 00000000-0000-0000-0000-000000000000 label generation 9 root 30605312 [...] compat_ro_flags 0x0 <<< No FST feature incompat_flags 0x361 ( MIXED_BACKREF | BIG_METADATA | EXTENDED_IREF | SKINNY_METADATA | NO_HOLES ) cache_generation 9 <<< Matches generation uuid_tree_generation 9 Attempting to mount it will lead to no space cache other than v1 space cache: # mount test.img /mnt/btrfs # dmesg -t | tail -n 5 BTRFS: device fsid 7d7c3bba-8211-4206-868d-10eedd5703f8 devid 1 transid 9 /dev/loop0 (7:0) scanned by mount (1264) BTRFS info (device loop0): first mount of filesystem 7d7c3bba-8211-4206-868d-10eedd5703f8 BTRFS info (device loop0): using crc32c checksum algorithm BTRFS info (device loop0): turning on async discard BTRFS info (device loop0): last unmount of filesystem 7d7c3bba-8211-4206-868d-10eedd5703f8 Even forcing v1 cache will not work, but fallback to the usual nospace_cache: # mount test.img -o space_cache=v1 /mnt/btrfs # dmesg -t | tail -n 6 BTRFS warning: v1 space cache is deprecated, fallback to no space cache BTRFS: device fsid 7d7c3bba-8211-4206-868d-10eedd5703f8 devid 1 transid 9 /dev/loop0 (7:0) scanned by mount (1264) BTRFS info (device loop0): first mount of filesystem 7d7c3bba-8211-4206-868d-10eedd5703f8 BTRFS info (device loop0): using crc32c checksum algorithm BTRFS info (device loop0): turning on async discard BTRFS info (device loop0): last unmount of filesystem 7d7c3bba-8211-4206-868d-10eedd5703f8 And there will be no way to force converting a v2 cache back to v1, such attempt will only clear free space tree and fallback to no space cache. # mkfs.btrfs -f -O fst,^bgt test.img # mount -o clear_cache,space_cache=v1 test.img /mnt/btrfs # dmesg -t | tail -n 11 BTRFS warning: v1 space cache is deprecated, fallback to no space cache BTRFS: device fsid f59daad2-3ab5-4f33-b752-a36cfb09b674 devid 1 transid 8 /dev/loop0 (7:0) scanned by mount (1419) BTRFS info (device loop0): first mount of filesystem f59daad2-3ab5-4f33-b752-a36cfb09b674 BTRFS info (device loop0): using crc32c checksum algorithm BTRFS info (device loop0): rebuilding free space tree BTRFS info (device loop0): disabling free space tree BTRFS info (device loop0): clearing compat-ro feature flag for FREE_SPACE_TREE (0x1) BTRFS info (device loop0): clearing compat-ro feature flag for FREE_SPACE_TREE_VALID (0x2) BTRFS info (device loop0): checking UUID tree BTRFS info (device loop0): turning on async discard BTRFS info (device loop0): force clearing of disk cache # mount | grep /mnt/btrfs /mnt/test.img on /mnt/btrfs type btrfs (rw,relatime,discard=async,nospace_cache,subvolid=5,subvol=/) Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Since commit bac3c29 ("btrfs: remove 2K block size support") there is no 2K block size support inside btrfs anymore. Remove the stale comments of btrfs_supported_blocksize(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Since v5.15 btrfs has support for block size < page size, but we still only support 4K block size, while there is no special reason that we cannot support 8K/16K/32K block sizes for 64K page size. That 4K limit is completely arbitrary, and mostly to reduce test runtime so we do not need to test all the extra block size combinations. However that also limits the user choices, some users may understand what they are doing, and want larger block sizes. In that case, fixed 4K block size for subpage routine is blocking our way. Just remove that fixed 4K requirement for block size < page size. This should not affect regular end users, since mkfs is already using 4K block size as default for quite a while, and the existing bs == ps support is always there. But for power users, this allows extra block size support, and may provide extra test coverage. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Decentralize transaction aborts in create_reloc_root(), so that it is obvious which call failed and what caused the transaction abort. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
71c1ff7 to
876e956
Compare
When dumping a tree block, btrfs_header::owner is printed as unsigned, which can result in numbers that are hard to read, e.g.: BTRFS info (device loop0): leaf 8908800 gen 16 total ptrs 28 free space 1676 owner 18446744073709551607 For the above output, 18446744073709551607 is (s64)-9, the root id of data reloc tree. Despite those predefined root ids that are already negative, existing subvolume trees will not have any negative values, as subvolume trees can only utilize the lower 48 bits, so there will be no output change for existing subvolumes, thus no extra confusion. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
…erge On a zoned FS, btrfs_delayed_refs_rsv_refill() returns -EAGAIN whenever the over-committed metadata plus the zone_unusable bytes exceeds the usable size in a metadata block-group to avoid heavy over-commit of metadata and early ENOSPC in one transaction. If this happens while doing reclaim, the transaction is getting aborted. Treat -EAGAIN as a soft, retryable condition in case of block-group reclaim. Reported-by: Damien Le Moal <dlemoal@kernel.org> Fixes: 7bcb04d ("btrfs: zoned: cap delayed refs metadata reservation to avoid overcommit") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
The comment is wrong, because it's not about storing the ID of new directories that were already created, instead it's about storing utimes values for directories (both new and existing). The comment is wrong because it was copy pasted from SEND_MAX_DIR_CREATED_CACHE_SIZE, but forgot to update it afterwards. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>
…mon prefixes In case the current inode's path is a prefix of the given path, the helper is_current_inode_path() will return true, which causes the single caller to reset the current inode's path. While this is not a functional issue, it makes the caller recompute the current inode's path later. It could also become a problem in the future in case get new callers for is_current_inode_path() in more sensitive contexts. Example: the current inode path is "/foo/bar" and the path we compare against is "/foo/bar_xyz". Fix this by returning true only if we have exact matches. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>
[BUG] There is a syzbot report that the check inside get_new_location() triggered: BTRFS info (device loop0): found 31 extents, stage: move data extents BTRFS info (device loop0): leaf 8908800 gen 16 total ptrs 28 free space 1676 owner 18446744073709551607 item 0 key (256 INODE_ITEM 0) itemoff 3835 itemsize 160 inode generation 5 transid 0 size 0 nbytes 0 block group 0 mode 40755 links 1 uid 0 gid 0 rdev 0 sequence 0 flags 0x0 atime 1669132761.0 ctime 1669132761.0 mtime 1669132761.0 otime 0.0 item 1 key (256 INODE_REF 256) itemoff 3823 itemsize 12 index 0 name_len 2 item 2 key (258 INODE_ITEM 0) itemoff 3663 itemsize 160 inode generation 1 transid 16 size 733184 nbytes 106496 block group 0 mode 100600 links 0 uid 0 gid 0 rdev 0 sequence 24 flags 0x18 item 3 key (258 EXTENT_DATA 0) itemoff 3595 itemsize 68 generation 16 type 0 inline extent data size 47 ram_bytes 4096 compression 1 [...] item 27 key (18446744073709551611 ORPHAN_ITEM 258) itemoff 2376 itemsize 0 BTRFS error (device loop0): unexpected non-zero offset in file extent item for data reloc inode 258 key offset 0 offset 9277520992061368337 ------------[ cut here ]------------ btrfs_abort_should_print_stack(__error) [CAUSE] The above dump tree shows the first file extent item is inlined, which should make no sense for data reloc inodes, as such inodes just represent where the data extents are in the relocation destination chunk. However the relocation path preallocates space for each block, then dirties them, cluster by cluster. It's possible to have a single block at the beginning of the block group, and no other block in the same cluster. So relocation will preallocate a file extent for that block and dirty the first block. Then memory pressure forces the data reloc inode to be written back, before any other blocks are dirtied/allocated. Finally commit 3eaf5f0 ("btrfs: extract inlined creation into a dedicated delalloc helper") changed the sequence of delalloc. Before that commit we always tried NOCOW first, so that dirtied block will be written back into the preallocated space, and appear as a regular extent. But with that commit, we always try inline first, and since compression is forced, we try compressing the first block, and then inline the compressed data, resulting in the above inlined file extent in the data reloc tree. Then the check in get_new_location() will check the file offset, without checking if the file extent is inlined or not, resulting in the above failure. [FIX] Do not allow compression for data reloc inodes. Since data reloc inode sizes are always block aligned, as long as we do not compress, @data_len will always be at least one block, and that will cause can_cow_file_range_inline() to return false, thus no inlined extent will be created. Reported-by: syzbot+d950c6ba09b79f6e1864@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/6a373dc5.764cf64f.168fbe.0001.GAE@google.com/ Fixes: 3eaf5f0 ("btrfs: extract inlined creation into a dedicated delalloc helper") Cc: stable@vger.kernel.org Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com>
Commit a6908f8 ("btrfs: validate data reloc tree file extent item members") introduced extra checks on file extent items for data reloc inodes, but it checks the file extent offset without checking if the file extent is inlined. This can lead to either false alerts (as the offset member is inside the inlined data) or even reading beyond the item range. This has already triggered a warning in a syzbot report. Although the root fix is to avoid compression for data reloc inodes, for the sake of consistency, reject inlined file extents first. Fixes: a6908f8 ("btrfs: validate data reloc tree file extent item members") Cc: stable@vger.kernel.org Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com>
Btrfs' nodesize and sectorsize are all u32 values, there is no need to use u64 for local usage. Furthermore some call sites also use "blocksize" or "bs" for sectorsize, also change them to use the minimal type u32 instead. Since we're here, also reorder those local variables so that they won't cause extra holes for stack memory, and consitify the sectorsize/nodesize/blocksize/bs usage. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com>
Btrfs does not support variable stripe length yet, all RAID0/5/6/10 chunks have the fixed stripe length 64K for now. Furthermore, btrfs_fs_info::stripesize is not the real chunk stripe length, it's always the same value as sectorsize. Remove btrfs_fs_info::stripesize, and for the only callsite utilizing that member, replace it with fs_info->sectorsize instead. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com>
…etattr()
btrfs_getattr() unconditionally reads BTRFS_I(inode)->new_delalloc_bytes
and adds it (sector-aligned) to stat->blocks for every inode type.
However, new_delalloc_bytes lives in a union with last_dir_index_offset:
union {
u64 new_delalloc_bytes; /* files only */
u64 last_dir_index_offset; /* directories only */
};
For a directory inode this memory holds last_dir_index_offset, which is
set during directory logging (e.g. flush_dir_items_batch()) to the
offset of the last logged BTRFS_DIR_INDEX_KEY. That offset grows with
the number of entries ever created in the directory (dir indexes are
monotonic and never reused), so it can be arbitrarily large.
As a result, after a directory has been logged (e.g. via an fsync that
triggers directory logging), btrfs_getattr() reports inflated st_blocks
for that directory. The inflation is purely in-core and disappears
after the inode is evicted and reloaded (btrfs_alloc_inode() zeroes the
union), e.g. after a remount.
Reproducer (on a btrfs filesystem):
D=/mnt/btrfs/d
mkdir -p $D
for i in $(seq 1 20000); do touch $D/f$i; done
sync # commit, push dir index high
touch $D/trigger # dirty the dir in a new transaction
xfs_io -c fsync $D # log the directory -> sets last_dir_index_offset
stat -c '%b' $D # st_blocks is now inflated (e.g. 40)
# umount + mount -> st_blocks drops back to the correct value
The evict path already knows this union is type-dependent and guards the
corresponding WARN_ON with !S_ISDIR() in btrfs_destroy_inode(); only
btrfs_getattr() was missing the equivalent check.
Only read new_delalloc_bytes for regular files, which are the only
inodes that ever set it.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Dave Chen <davechen@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
…tion While running fsstress with autodefrag and flushoncommit, hit a deadlock due to the fact that defrag reserves delalloc space while it's holding dirty and locked folios, besides the extent range lock. The stack traces are the following: [430958.624136] task:kworker/u50:3 state:D stack:0 pid:20365 tgid:20365 ppid:2 task_flags:0x4208060 flags:0x00080000 [430958.626267] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] [430958.627821] Call Trace: [430958.628351] <TASK> [430958.628990] __schedule+0x4be/0x10f0 [430958.629791] ? preempt_count_add+0x69/0xa0 [430958.630605] schedule+0x26/0xd0 [430958.631327] wait_current_trans+0x102/0x160 [btrfs] [430958.632414] ? __pfx_autoremove_wake_function+0x10/0x10 [430958.633515] start_transaction+0x374/0x900 [btrfs] [430958.634601] btrfs_commit_current_transaction+0x1d/0x70 [btrfs] [430958.635982] flush_space+0xca/0x5e0 [btrfs] [430958.636996] ? _raw_spin_unlock+0x15/0x30 [430958.637894] ? btrfs_reduce_alloc_profile+0x8c/0x190 [btrfs] [430958.639217] ? _raw_spin_unlock+0x15/0x30 [430958.640030] ? calc_available_free_space.isra.0+0x6f/0x110 [btrfs] [430958.641462] do_async_reclaim_metadata_space+0x84/0x190 [btrfs] [430958.642711] btrfs_async_reclaim_metadata_space+0x64/0x80 [btrfs] [430958.644015] process_one_work+0x19d/0x3a0 [430958.644873] worker_thread+0x1c4/0x330 [430958.645668] ? __pfx_worker_thread+0x10/0x10 [430958.646535] kthread+0xfc/0x130 [430958.647285] ? __pfx_kthread+0x10/0x10 [430958.648068] ret_from_fork+0x1f7/0x2c0 [430958.648894] ? __pfx_kthread+0x10/0x10 [430958.649713] ret_from_fork_asm+0x1a/0x30 [430958.650536] </TASK> [430958.651036] task:kworker/u49:7 state:D stack:0 pid:52990 tgid:52990 ppid:2 task_flags:0x4208060 flags:0x00080000 [430958.653709] Workqueue: writeback wb_workfn (flush-btrfs-334) [430958.655110] Call Trace: [430958.655737] <TASK> [430958.656284] __schedule+0x4be/0x10f0 [430958.657178] ? __blk_flush_plug+0xe9/0x140 [430958.658188] schedule+0x26/0xd0 [430958.658982] io_schedule+0x42/0x70 [430958.659850] folio_wait_bit_common+0x12b/0x330 [430958.660954] ? folio_wait_bit_common+0x100/0x330 [430958.662157] ? __pfx_wake_page_function+0x10/0x10 [430958.663328] extent_write_cache_pages+0x599/0x830 [btrfs] [430958.664496] ? acpi_fwnode_get_reference_args+0x1fa/0x270 [430958.665579] btrfs_writepages+0x77/0x130 [btrfs] [430958.666614] ? __pfx_end_bbio_data_write+0x10/0x10 [btrfs] [430958.667846] do_writepages+0xc6/0x160 [430958.668596] __writeback_single_inode+0x42/0x310 [430958.669535] writeback_sb_inodes+0x231/0x570 [430958.670583] wb_writeback+0x8a/0x340 [430958.671383] wb_workfn+0xbf/0x450 [430958.672058] ? finish_task_switch.isra.0+0xc1/0x350 [430958.673026] process_one_work+0x19d/0x3a0 [430958.673814] worker_thread+0x1c4/0x330 [430958.674565] ? __pfx_worker_thread+0x10/0x10 [430958.675440] kthread+0xfc/0x130 [430958.676084] ? __pfx_kthread+0x10/0x10 [430958.676832] ret_from_fork+0x1f7/0x2c0 [430958.677582] ? __pfx_kthread+0x10/0x10 [430958.678369] ret_from_fork_asm+0x1a/0x30 [430958.679171] </TASK> [430958.679644] task:btrfs-cleaner state:D stack:0 pid:296750 tgid:296750 ppid:2 task_flags:0x208040 flags:0x00080000 [430958.681812] Call Trace: [430958.682318] <TASK> [430958.682762] __schedule+0x4be/0x10f0 [430958.683542] schedule+0x26/0xd0 [430958.684264] handle_reserve_ticket+0x1b9/0x2c0 [btrfs] [430958.685366] ? __pfx_autoremove_wake_function+0x10/0x10 [430958.686520] reserve_bytes+0x283/0x4c0 [btrfs] [430958.687610] btrfs_reserve_metadata_bytes+0x18/0xb0 [btrfs] [430958.688860] btrfs_delalloc_reserve_metadata+0x121/0x320 [btrfs] [430958.690263] btrfs_delalloc_reserve_space+0x46/0xb0 [btrfs] [430958.691675] btrfs_defrag_file+0x903/0x1110 [btrfs] [430958.692879] btrfs_run_defrag_inodes+0x334/0x430 [btrfs] [430958.694005] cleaner_kthread+0x97/0x1c0 [btrfs] [430958.694969] ? __pfx_cleaner_kthread+0x10/0x10 [btrfs] [430958.696232] kthread+0xfc/0x130 [430958.696954] ? __pfx_kthread+0x10/0x10 [430958.697763] ret_from_fork+0x1f7/0x2c0 [430958.698521] ? __pfx_kthread+0x10/0x10 [430958.699348] ret_from_fork_asm+0x1a/0x30 [430958.700217] </TASK> [430958.716533] task:fsstress state:D stack:0 pid:296769 tgid:296769 ppid:296768 task_flags:0x400140 flags:0x00080000 [430958.718780] Call Trace: [430958.719366] <TASK> [430958.719817] __schedule+0x4be/0x10f0 [430958.720611] ? preempt_count_add+0x69/0xa0 [430958.721465] schedule+0x26/0xd0 [430958.722150] wb_wait_for_completion+0x79/0xc0 [430958.723109] ? __pfx_autoremove_wake_function+0x10/0x10 [430958.724173] __writeback_inodes_sb_nr+0xc5/0xf0 [430958.725081] try_to_writeback_inodes_sb+0x55/0x70 [430958.726075] btrfs_commit_transaction+0x19d/0xeb0 [btrfs] [430958.727337] ? start_transaction+0x343/0x900 [btrfs] [430958.728422] btrfs_mksubvol+0x28b/0x4e0 [btrfs] [430958.729445] btrfs_mksnapshot+0x74/0xa0 [btrfs] [430958.730511] __btrfs_ioctl_snap_create+0x194/0x210 [btrfs] [430958.732245] btrfs_ioctl_snap_create_v2+0xef/0x150 [btrfs] [430958.733636] btrfs_ioctl+0x7ec/0x2a70 [btrfs] [430958.734665] ? __virt_addr_valid+0xe4/0x180 [430958.735534] ? __check_object_size+0x1cd/0x1f0 [430958.736613] ? kmem_cache_free+0x146/0x380 [430958.737645] ? _raw_spin_unlock+0x15/0x30 [430958.738660] ? do_sys_openat2+0x83/0xd0 [430958.739637] __x64_sys_ioctl+0x92/0xe0 [430958.740576] do_syscall_64+0x60/0x590 [430958.741512] ? clear_bhb_loop+0x60/0xb0 [430958.742485] entry_SYSCALL_64_after_hwframe+0x76/0x7e [430958.743772] RIP: 0033:0x7f4431e108db [430958.744668] RSP: 002b:00007ffcd147db20 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [430958.746327] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f4431e108db [430958.747816] RDX: 00007ffcd147eb90 RSI: 0000000050009417 RDI: 0000000000000005 [430958.749479] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [430958.751216] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffcd147fbf0 [430958.752929] R13: 00007ffcd147eb90 R14: 0000000000000005 R15: 0000000000000003 [430958.754684] </TASK> What happens is the following: 1) The cleaner kthread is running autodefrag, and in defrag_one_range() it acquired all the folios for the range and locked them. Then it locked the extent range in the inode's iotree. It got two subranges from defrag_collect_targets(), the first one with folio A and the second one with folio B. After it defraged the first subrange, folio A remains locked and dirty - it's only unlocked when defrag_one_range() returns. When it attempts to defrag the second subrange (containing folio B), btrfs_delalloc_reserve_space() creates a space reservation ticket, due to lack of free metadata space and blocks waiting for the async metadata reclaim task to free space and wake it up; 2) The async reclaim metadata task attempts to commit the current transaction, but it blocks because there is another task that started the commit first; 3) A task creating a snapshot is committing the transaction and because the fs was mounted with flushoncommit, it calls try_to_writeback_inodes_sb(), which spawns a task to flush delalloc and waits for it to complete; 4) The task flushing delalloc (kworker/u49:7), finds that folio A for the inode being defragged is dirty, so it tries to lock it... But it blocks because folio A is locked by the defrag task (the cleaner kthread) which is blocked waiting for the reservation ticket to be served, but the async reclaim metadata task is blocked waiting for the transaction commit, which in turn is blocked waiting for the delalloc flush task, which is trying to lock folio A, resulting in a deadlock. The same type of problem can happen if the async reclaim task starts to flush delalloc, as that requires both locking the folio and the extent rannge in the inode's io tree, and in this case we don't need the fs to be mounted with flushoncommit. This type of problem has ocurred several times in the past with reflinks for example, where we had a dirty folio while holding the extent range locked and then starting a transaction blocked waiting for the async reclaim task due to lack of free metadata space. So fix this by reserving delalloc space before locking folios and locking the extent range in the inode's iotree. We can not simply unlock the folios for each subrange given by defrag_collect_targets() after we defrag it because the same folio may be present too in the next subrange (due to large folios). Fixes: 22b398e ("btrfs: defrag: introduce helper to defrag a contiguous prepared range") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>
Syzbot reported the following warning recently:
[ 157.672472][ T6611] BTRFS info (device loop0): turning on flush-on-commit
[ 157.672488][ T6611] BTRFS info (device loop0): enabling free space tree
[ 157.672504][ T6611] BTRFS info (device loop0): enabling auto defrag
[ 157.672555][ T6611] BTRFS info (device loop0): use lzo compression, level 1
[ 157.672574][ T6611] BTRFS info (device loop0): max_inline set to 4096
[ 158.094512][ T5608] BTRFS info (device loop2): last unmount of filesystem c9fe44da-de57-406a-8241-57ec7d4412cf
[ 160.073968][ T6656] BTRFS info (device loop0 state M): max_inline set to 4096
[ 160.418911][ T5611] BTRFS info (device loop0): last unmount of filesystem ab8108e1-bea5-4a9f-94c9-a3ff208d732a
[ 160.432287][ T6662] loop2: detected capacity change from 0 to 32768
[ 160.438859][ T6662] BTRFS: device fsid c9fe44da-de57-406a-8241-57ec7d4412cf devid 1 transid 8 /dev/loop2 (7:2) scanned by syz.2.74 (6662)
[ 160.459589][ T6662] BTRFS info (device loop2): first mount of filesystem c9fe44da-de57-406a-8241-57ec7d4412cf
[ 160.459616][ T6662] BTRFS info (device loop2): using crc32c checksum algorithm
[ 160.634366][ T1187] ------------[ cut here ]------------
[ 160.634376][ T1187] test_bit(BTRFS_FS_STATE_NO_DELAYED_IPUT, &fs_info->fs_state)
[ 160.634387][ T1187] WARNING: fs/btrfs/inode.c:3596 at btrfs_add_delayed_iput+0x2e3/0x340, CPU#0: kworker/u8:10/1187
[ 160.634412][ T1187] Modules linked in:
[ 160.634423][ T1187] CPU: 0 UID: 0 PID: 1187 Comm: kworker/u8:10 Not tainted syzkaller #0 PREEMPT_{RT,(full)}
[ 160.634435][ T1187] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026
[ 160.634442][ T1187] Workqueue: btrfs-endio-write btrfs_work_helper
[ 160.634456][ T1187] RIP: 0010:btrfs_add_delayed_iput+0x2e3/0x340
[ 160.634468][ T1187] Code: 53 a3 45 (...)
[ 160.634482][ T1187] RSP: 0018:ffffc900065d77c8 EFLAGS: 00010293
[ 160.634490][ T1187] RAX: ffffffff83e5f502 RBX: ffff88805aba0000 RCX: ffff888029768000
[ 160.634497][ T1187] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 160.634503][ T1187] RBP: dffffc0000000000 R08: 0000000000000000 R09: 0000000000000000
[ 160.634509][ T1187] R10: dffffc0000000000 R11: ffffed100b574497 R12: 0000000000000001
[ 160.634516][ T1187] R13: dffffc0000000000 R14: ffff888061194788 R15: 0000000000000200
[ 160.634523][ T1187] FS: 0000000000000000(0000) GS:ffff888126186000(0000) knlGS:0000000000000000
[ 160.634531][ T1187] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 160.634537][ T1187] CR2: 00007fe553a3f000 CR3: 00000000596c2000 CR4: 00000000003526f0
[ 160.634547][ T1187] Call Trace:
[ 160.634551][ T1187] <TASK>
[ 160.634560][ T1187] btrfs_put_ordered_extent+0x18f/0x430
[ 160.634577][ T1187] btrfs_finish_one_ordered+0xf63/0x2680
[ 160.634598][ T1187] ? __pfx_btrfs_finish_one_ordered+0x10/0x10
[ 160.634611][ T1187] ? do_raw_spin_lock+0x12b/0x2f0
[ 160.634622][ T1187] ? lock_acquire+0x106/0x350
[ 160.634636][ T1187] ? __pfx_do_raw_spin_lock+0x10/0x10
[ 160.634650][ T1187] btrfs_work_helper+0x38b/0xc20
[ 160.634666][ T1187] ? process_scheduled_works+0xa70/0x1860
[ 160.634679][ T1187] process_scheduled_works+0xb5d/0x1860
[ 160.634703][ T1187] ? __pfx_process_scheduled_works+0x10/0x10
[ 160.634716][ T1187] ? assign_work+0x3d5/0x5e0
[ 160.634729][ T1187] worker_thread+0xa53/0xfc0
[ 160.634752][ T1187] kthread+0x388/0x470
[ 160.634765][ T1187] ? __pfx_worker_thread+0x10/0x10
[ 160.635870][ T1187] ? __pfx_kthread+0x10/0x10
[ 160.635891][ T1187] ret_from_fork+0x514/0xb70
[ 160.635907][ T1187] ? __pfx_ret_from_fork+0x10/0x10
[ 160.635917][ T1187] ? __switch_to+0xc79/0x1410
[ 160.635934][ T1187] ? __pfx_kthread+0x10/0x10
[ 160.635948][ T1187] ret_from_fork_asm+0x1a/0x30
[ 160.635969][ T1187] </TASK>
[ 160.635975][ T1187] Kernel panic - not syncing: kernel: panic_on_warn set ...
It means we add a delayed iput created after we last ran delayed iputs in
close_ctree() and set the flag BTRFS_FS_STATE_NO_DELAYED_IPUT in fs_info.
This happens when using autodefrag and more likely to happen if we use
flushoncommit too. The steps are the following:
1) Unmount starts, all delalloc is flushed and we enter close_ctree();
2) In close_ctree() we park the cleaner kthread, but while we wait for it
to park, it's in:
btrfs_run_defrag_inodes()
btrfs_run_defrag_inode()
btrfs_defrag_file()
defrag_one_cluster()
defrag_one_range()
defrag_one_locked_target()
And dirties some folios from an inode;
3) The cleaner kthread parks and we proceed in close_ctree(), waiting
for all ordered extents, running delayed iputs and setting the flag
BTRFS_FS_STATE_NO_DELAYED_IPUT in fs_info;
4) Later in close_ctree() we call btrfs_commit_super(), which commits the
current transaction. Because we are mounted with flushoncommit, the
transaction commit flushes delalloc and waits for the resulting ordered
extent to complete;
5) The ordered extents from the flushed dealloc created by autodefrag
complete and create delayed iputs, triggering the warning:
WARN_ON_ONCE(test_bit(BTRFS_FS_STATE_NO_DELAYED_IPUT, &fs_info->fs_state));
in btrfs_add_delayed_iput()
6) Further below in close_ctree() we will hit the following assertion:
ASSERT(list_empty(&fs_info->delayed_iputs));
Since we don't expect any more delayed iputs.
Fix this by flushing delalloc and waiting for the ordered extents right
after we parked the cleaner kthread and waiting for autodefrag in
close_ctree().
Reported-by: syzbot+6a843bf8604711c8fab0@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6a1ee507.b4221f80.1326c5.0004.GAE@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
There's no need to have one list for each loop to defrag each subrange and then another one to free each subrange (struct defrag_target_range). We can do it in a single loop, freeing each subrange after defragging, plus no need to delete each subrange from the list since we immediately free it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>
Use AUTO_KFREE() for the folios array, avoiding two kfree() calls, one of them in a very specific error path. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>
When freeing the entries from the list there is no need to initialize the list member in an entry, since we are immediately freeing it. So use simple list_del() instead of list_del_init(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>
There's no need to call list_del_init() against each entry when freeing the list, as the list is local and we are freeing the entry. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.