Conversation
2d4aefb to
c9e380a
Compare
c56343b to
1cab137
Compare
6613f3c to
b30a0ce
Compare
d205ebd to
c0bd9d9
Compare
15022b1 to
c22750c
Compare
28d9855 to
e18d8ce
Compare
Fix a copy-paste error in check_extent_data_ref(): we're printing root as in the message above, we should be printing objectid. Fixes: f333a3c ("btrfs: tree-checker: validate dref root and objectid") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Fix the superblock offset mismatch error message in btrfs_validate_super(): we changed it so that it considers all the superblocks, but the message still assumes we're only looking at the first one. The change from %u to %llu is because we're changing from a constant to a u64. Fixes: 069ec95 ("btrfs: Refactor btrfs_check_super_valid") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Commit b471965 fixed the comparison in scrub_verify_one_metadata to use metadata_uuid rather than fsid, but left the warning as it was. Fix it so it matches what we're doing. Fixes: b471965 ("btrfs: fix replace/scrub failure with metadata_uuid") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
There's no need to COW the root node of the subvolume we are snapshotting because we then call btrfs_copy_root(), which creates a copy of the root node and sets its generation to the current transaction. So remove this redundant COW right before calling btrfs_copy_root(), saving one extent allocation, memory allocation, copying things, etc, and making the code less confusing. Also rename the extent buffer variable from "old" to "root_eb" since that name no longer makes any sense after removing the unnecessary COW operation. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
[BUG] When running btrfs/284 with 64K page size and 4K fs block size, the following ASSERT() can be triggered: assertion failed: cb->bbio.bio.bi_iter.bi_size == disk_num_bytes :: 0, in inode.c:9991 ------------[ cut here ]------------ kernel BUG at inode.c:9991! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP CPU: 5 UID: 0 PID: 6787 Comm: btrfs Tainted: G OE 6.19.0-rc8-custom+ #1 PREEMPT(voluntary) Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 pc : btrfs_do_encoded_write+0x9b0/0x9c0 [btrfs] lr : btrfs_do_encoded_write+0x9b0/0x9c0 [btrfs] Call trace: btrfs_do_encoded_write+0x9b0/0x9c0 [btrfs] (P) btrfs_do_write_iter+0x1d8/0x208 [btrfs] btrfs_ioctl_encoded_write+0x3c8/0x6d0 [btrfs] btrfs_ioctl+0xeb0/0x2b60 [btrfs] __arm64_sys_ioctl+0xac/0x110 invoke_syscall.constprop.0+0x64/0xe8 el0_svc_common.constprop.0+0x40/0xe8 do_el0_svc+0x24/0x38 el0_svc+0x3c/0x1b8 el0t_64_sync_handler+0xa0/0xe8 el0t_64_sync+0x1a4/0x1a8 Code: 91180021 90001080 9111a000 94039d54 (d4210000) ---[ end trace 0000000000000000 ]--- [CAUSE] After commit e1bc83f ("btrfs: get rid of compressed_folios[] usage for encoded writes"), the encoded write is changed to copy the content from the iov into a folio, and queue the folio into the compressed bio. However we always queue the full folio into the compressed bio, which can make the compressed bio larger than the on-disk extent, if the folio size is larger than the fs block size. Although we have an ASSERT() to catch such problem, for kernels without CONFIG_BTRFS_ASSERT, such larger than expected bio will just be submitted, possibly overwrite the next data extent, causing data corruption. [FIX] Instead of blindly queuing the full folio into the compressed bio, only queue the rounded up range, which is the old behavior before that offending commit. This also means we no longer need to zero the tailing range until the folio end (but still to the block boundary), as such range will not be submitted anyway. And since we're here, add a final ASSERT() into btrfs_submit_compressed_write() as the last safety net for kernels with btrfs assertions enabled Fixes: e1bc83f ("btrfs: get rid of compressed_folios[] usage for encoded writes") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
[BUG] When running btrfs/284, the following ASSERT() will be triggered with 64K page size and 4K fs block size: assertion failed: folio_test_writeback(folio) :: 0, in subpage.c:476 ------------[ cut here ]------------ kernel BUG at subpage.c:476! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP CPU: 4 UID: 0 PID: 2313 Comm: kworker/u37:2 Tainted: G OE 6.19.0-rc8-custom+ #185 PREEMPT(voluntary) Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 Workqueue: btrfs-endio simple_end_io_work [btrfs] pc : btrfs_subpage_clear_writeback+0x148/0x160 [btrfs] lr : btrfs_subpage_clear_writeback+0x148/0x160 [btrfs] Call trace: btrfs_subpage_clear_writeback+0x148/0x160 [btrfs] (P) btrfs_folio_clamp_clear_writeback+0xb4/0xd0 [btrfs] end_compressed_writeback+0xe0/0x1e0 [btrfs] end_bbio_compressed_write+0x1e8/0x218 [btrfs] btrfs_bio_end_io+0x108/0x258 [btrfs] simple_end_io_work+0x68/0xa8 [btrfs] process_one_work+0x168/0x3f0 worker_thread+0x25c/0x398 kthread+0x154/0x250 ret_from_fork+0x10/0x20 ---[ end trace 0000000000000000 ]--- [CAUSE] The offending bio is from an encoded write, where the compressed data is directly written as a data extent, without touching the page cache. However the encoded write still utilizes the regular buffered write path for compressed data, by setting the compressed_bio::writeback flag. When that flag is set, at end_bbio_compressed_write() btrfs will go clearing the writeback flag of the folios in the page cache. However for bs < ps cases, the subpage helper has one extra check to make sure the folio has a writeback flag set in the first place. But since it's an encoded write, we never go through page cache, thus the folio has no writeback flag and triggers the ASSERT(). [FIX] Do not set compressed_bio::writeback flag for encoded writes, and change the ASSERT() in btrfs_submit_compressed_write() to make sure that flag is not set. Fixes: e1bc83f ("btrfs: get rid of compressed_folios[] usage for encoded writes") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
[BUG] When running btrfs/284 with 64K page size and 4K fs block size, it crashes with the following ASSERT() triggered: assertion failed: folio_size(fi.folio) == blocksize :: 0, in fs/btrfs/zstd.c:603 ------------[ cut here ]------------ kernel BUG at fs/btrfs/zstd.c:603! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP CPU: 2 UID: 0 PID: 1183 Comm: kworker/u35:4 Not tainted 6.19.0-rc8-custom+ #185 PREEMPT(voluntary) Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 Workqueue: btrfs-endio simple_end_io_work [btrfs] pc : zstd_decompress_bio+0x4f0/0x508 [btrfs] lr : zstd_decompress_bio+0x4f0/0x508 [btrfs] Call trace: zstd_decompress_bio+0x4f0/0x508 [btrfs] (P) end_bbio_compressed_read+0x260/0x2c0 [btrfs] btrfs_bio_end_io+0xc4/0x258 [btrfs] btrfs_check_read_bio+0x424/0x7e0 [btrfs] simple_end_io_work+0x40/0xa8 [btrfs] process_one_work+0x168/0x3f0 worker_thread+0x25c/0x398 kthread+0x154/0x250 ret_from_fork+0x10/0x20 ---[ end trace 0000000000000000 ]--- [CAUSE] Commit 1914b94 ("btrfs: zstd: use folio_iter to handle zstd_decompress_bio()") added the ASSERT() to make sure the folio size matches the fs block size. But the check is completely wrong, the original intention is to make sure for bs > ps cases, we always got a large folio that covers a full fs block. However for bs < ps cases, a folio can never be smaller than page size, and the ASSERT() gets triggered immediately. [FIX] Check the folio size against @min_folio_size instead, which will never be smaller than PAGE_SIZE, and still cover bs > ps cases. Fixes: 1914b94 ("btrfs: zstd: use folio_iter to handle zstd_decompress_bio()") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
[BUG] When running btrfs/284 with 64K page size and 4K fs block size, it crashes with the following ASSERT() triggered: BTRFS info (device dm-3): use lzo compression, level 1 assertion failed: folio_size(fi.folio) == sectorsize :: 0, in lzo.c:450 ------------[ cut here ]------------ kernel BUG at lzo.c:450! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP CPU: 4 UID: 0 PID: 329 Comm: kworker/u37:2 Tainted: G OE 6.19.0-rc8-custom+ #185 PREEMPT(voluntary) Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 Workqueue: btrfs-endio simple_end_io_work [btrfs] pc : lzo_decompress_bio+0x61c/0x630 [btrfs] lr : lzo_decompress_bio+0x61c/0x630 [btrfs] Call trace: lzo_decompress_bio+0x61c/0x630 [btrfs] (P) end_bbio_compressed_read+0x2a8/0x2c0 [btrfs] btrfs_bio_end_io+0xc4/0x258 [btrfs] btrfs_check_read_bio+0x424/0x7e0 [btrfs] simple_end_io_work+0x40/0xa8 [btrfs] process_one_work+0x168/0x3f0 worker_thread+0x25c/0x398 kthread+0x154/0x250 ret_from_fork+0x10/0x20 Code: 912a2021 b0000e00 91246000 940244e9 (d4210000) ---[ end trace 0000000000000000 ]--- [CAUSE] Commit 37cc07c ("btrfs: lzo: use folio_iter to handle lzo_decompress_bio()") added the ASSERT() to make sure the folio size matches the fs block size. But the check is completely wrong, the original intention is to make sure for bs > ps cases, we always got a large folio that covers a full fs block. However for bs < ps cases, a folio can never be smaller than page size, and the ASSERT() gets triggered immediately. [FIX] Check the folio size against @min_folio_size instead, which will never be smaller than PAGE_SIZE, and still cover bs > ps cases. Fixes: 37cc07c ("btrfs: lzo: use folio_iter to handle lzo_decompress_bio()") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
There are several functions that take pointer arguments but don't need to modify the objects they point to, so add the const qualifiers. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
It's useless to print the result of the condition, it's always 0 if the assertion is triggered, so it doesn't provide any useful information. Examples: assertion failed: cb->bbio.bio.bi_iter.bi_size == disk_num_bytes :: 0, in inode.c:9991 assertion failed: folio_test_writeback(folio) :: 0, in subpage.c:476 So stop printing that, it's always ":: 0" for any assertion triggered (except for conditions that are just an identifier). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
When logging that an inode exists, as part of logging a new name or logging new dir entries for a directory, we always set the generation of the logged inode item to 0. This is to signal during log replay (in overwrite_item()), that we should not set the i_size since we only logged that an inode exists, so the i_size of the inode in the subvolume tree must be preserved (as when we log new names or that an inode exists, we don't log extents). This works fine except when we have already logged an inode in full mode or it's the first time we are logging an inode created in a past transaction, that inode has a new i_size of 0 and then we log a new name for the inode (due to a new hardlink or a rename), in which case we log an i_size of 0 for the inode and a generation of 0, which causes the log replay code to not update the inode's i_size to 0 (in overwrite_item()). An example scenario: mkdir /mnt/dir xfs_io -f -c "pwrite 0 64K" /mnt/dir/foo sync xfs_io -c "truncate 0" -c "fsync" /mnt/dir/foo ln /mnt/dir/foo /mnt/dir/bar xfs_io -c "fsync" /mnt/dir <power fail> After log replay the file remains with a size of 64K. This is because when we first log the inode, when we fsync file foo, we log its current i_size of 0, and then when we create a hard link we log again the inode in exists mode (LOG_INODE_EXISTS) but we set a generation of 0 for the inode item we add to the log tree, so during log replay overwrite_item() sees that the generation is 0 and i_size is 0 so we skip updating the inode's i_size from 64K to 0. Fix this by making sure at fill_inode_item() we always log the real generation of the inode if it was logged in the current transaction with the i_size we logged before. Also if an inode created in a previous transaction is logged in exists mode only, make sure we log the i_size stored in the inode item located from the commit root, so that if we log multiple times that the inode exists we get the correct i_size. A test case for fstests will follow soon. Reported-by: Vyacheslav Kovalevsky <slava.kovalevskiy.2014@gmail.com> Link: https://lore.kernel.org/linux-btrfs/af8c15fa-4e41-4bb2-885c-0bc4e97532a6@gmail.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
All internal functions should be given a btrfs_inode for consistency and not a VFS inode. So pass a btrfs_inode instead of a VFS inode. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Fix the error message in btrfs_delete_subvolume() if we can't delete a subvolume because it has an active swapfile: we were printing the number of the parent rather than the target. Fixes: 60021bd ("btrfs: prevent subvol with swapfile from being deleted") Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Commit d7f67ac ("btrfs: relax block-group-tree feature dependency checks") introduced a regression when it comes to handling unsupported incompat or compat_ro flags. Beforehand we only printed the flags that we didn't recognize, afterwards we printed them all, which is less useful. Fix the error handling so it behaves like it used to. Fixes: d7f67ac ("btrfs: relax block-group-tree feature dependency checks") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Fix the unlikely added to btrfs_insert_one_raid_extent() by commit a929904 ("btrfs: add unlikely annotations to branches leading to transaction abort"): the exclamation point is in the wrong place, so we are telling the compiler that allocation failure is actually expected. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
…num_copies() Fix a chunk map leak in btrfs_map_block(): if we return early with -EINVAL, we're not freeing the chunk map that we've just looked up. Fixes: 0ae653f ("btrfs: reduce chunk_map lookups in btrfs_map_block()") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
…remap() If the call to btrfs_translate_remap() in btrfs_map_block() returns an error code, we were leaking the chunk map. Fix it by jumping to out rather than returning directly. Reported-by: Chris Mason <clm@fb.com> Link: https://lore.kernel.org/linux-btrfs/20260125125830.2352988-1-clm@meta.com/ Fixes: 18ba649 ("btrfs: redirect I/O for remapped block groups") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_abort_transaction(), unlike btrfs_commit_transaction(), doesn't also free the transaction handle. Fix the instances in btrfs_last_identity_remap_gone() where we're also leaking the transaction on abort. Reported-by: Chris Mason <clm@fb.com> Link: https://lore.kernel.org/linux-btrfs/20260125125129.2245240-1-clm@meta.com/ Fixes: 979e1dc ("btrfs: handle deletions from remapped block group") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Add a check in remove_range_from_remap_tree() after we call btrfs_lookup_block_group(), to check if it is NULL. This shouldn't happen, but if it does we at least get an error rather than a segfault. Reported-by: Chris Mason <clm@fb.com> Link: https://lore.kernel.org/linux-btrfs/20260125125129.2245240-1-clm@meta.com/ Fixes: 979e1dc ("btrfs: handle deletions from remapped block group") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Add the definitions for the remap tree to print-tree.c, so that we get more useful information if a tree is dumped to dmesg. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
The nowait flag is always false in this context, making the conditional check unnecessary. Simplify the code by directly assigning -ENOTBLK. Found by Linux Verification Center (linuxtesting.org) with SVACE. Signed-off-by: Alexey Velichayshiy <a.velichayshiy@ispras.ru> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_set_periodic_reclaim_ready() requires space_info->lock to be held, as enforced by lockdep_assert_held(). However, btrfs_reclaim_sweep() was calling it after do_reclaim_sweep() returns, at which point space_info->lock is no longer held. Fix this by explicitly acquiring space_info->lock before clearing the periodic reclaim ready flag in btrfs_reclaim_sweep(). Reported-by: Chris Mason <clm@meta.com> Link: https://lore.kernel.org/linux-btrfs/20260208182556.891815-1-clm@meta.com/ Fixes: 19eff93 ("btrfs: fix periodic reclaim condition") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
We have several call sites doing the same work to calculate the size of a bio: struct bio_vec *bvec; u32 bio_size = 0; int i; bio_for_each_bvec_all(bvec, bio, i) bio_size += bvec->bv_len; We can use a common helper instead of open-coding it everywhere. This also allows us to constify the @bio_size variables used in all the call sites. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
The member compressed_bio::compressed_len can be replaced by the bio
size, as we always submit the full compressed data without any partial
read/write.
Furthermore we already have enough ASSERT()s making sure the bio size
matches the ordered extent or the extent map.
This saves 8 bytes from compressed_bio:
Before:
struct compressed_bio {
u64 start; /* 0 8 */
unsigned int len; /* 8 4 */
unsigned int compressed_len; /* 12 4 */
u8 compress_type; /* 16 1 */
bool writeback; /* 17 1 */
/* XXX 6 bytes hole, try to pack */
struct btrfs_bio * orig_bbio; /* 24 8 */
struct btrfs_bio bbio __attribute__((__aligned__(8))); /* 32 304 */
/* XXX last struct has 1 bit hole */
/* size: 336, cachelines: 6, members: 7 */
/* sum members: 330, holes: 1, sum holes: 6 */
/* member types with bit holes: 1, total: 1 */
/* forced alignments: 1 */
/* last cacheline: 16 bytes */
} __attribute__((__aligned__(8)));
After:
struct compressed_bio {
u64 start; /* 0 8 */
unsigned int len; /* 8 4 */
u8 compress_type; /* 12 1 */
bool writeback; /* 13 1 */
/* XXX 2 bytes hole, try to pack */
struct btrfs_bio * orig_bbio; /* 16 8 */
struct btrfs_bio bbio __attribute__((__aligned__(8))); /* 24 304 */
/* XXX last struct has 1 bit hole */
/* size: 328, cachelines: 6, members: 6 */
/* sum members: 326, holes: 1, sum holes: 2 */
/* member types with bit holes: 1, total: 1 */
/* forced alignments: 1 */
/* last cacheline: 8 bytes */
} __attribute__((__aligned__(8)));
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…start btrfs_zoned_reserve_data_reloc_bg() is called on each mount of a file system and allocates a new block-group, to assign it to be the dedicated relocation target, if no pre-existing usable block-group for this task is found. If for some reason the transaction is aborted, btrfs_end_transaction() will wake up the transaction kthread. But the transaction kthread is not yet initialized at the time btrfs_zoned_reserve_data_reloc_bg() is called, leading to the following NULL-pointer dereference: RSP: 0018:ffffc9000c617c98 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 000000000000073c RCX: 0000000000000002 RDX: 0000000000000001 RSI: 0000000000000003 RDI: 0000000000000001 RBP: 0000000000000207 R08: ffffffff8223c71d R09: 0000000000000635 R10: ffff888108588000 R11: 0000000000000003 R12: 0000000000000003 R13: 000000000000073c R14: 0000000000000000 R15: ffff888114dd6000 FS: 00007f2993745840(0000) GS:ffff8882b508d000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000000073c CR3: 0000000121a82006 CR4: 0000000000770eb0 PKRU: 55555554 Call Trace: <TASK> try_to_wake_up (./include/linux/spinlock.h:557 kernel/sched/core.c:4106) __btrfs_end_transaction (fs/btrfs/transaction.c:1115 (discriminator 2)) btrfs_zoned_reserve_data_reloc_bg (fs/btrfs/zoned.c:2840) open_ctree (fs/btrfs/disk-io.c:3588) btrfs_get_tree.cold (fs/btrfs/super.c:982 fs/btrfs/super.c:1944 fs/btrfs/super.c:2087 fs/btrfs/super.c:2121) vfs_get_tree (fs/super.c:1752) __do_sys_fsconfig (fs/fsopen.c:231 fs/fsopen.c:295 fs/fsopen.c:473) do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1)) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:131) RIP: 0033:0x7f299392740e Move the call to btrfs_zoned_reserve_data_reloc_bg() after the transaction_kthread has been initialized to fix this problem. Fixes: 694ce5e ("btrfs: zoned: reserve data_reloc block group on mount") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
We have recently observed a number of subvolumes with broken dentries.
ls-ing the parent dir looks like:
drwxrwxrwt 1 root root 16 Jan 23 16:49 .
drwxr-xr-x 1 root root 24 Jan 23 16:48 ..
d????????? ? ? ? ? ? broken_subvol
and similarly stat-ing the file fails.
In this state, deleting the subvol fails with ENOENT, but attempting to
create a new file or subvol over it errors out with EEXIST and even
aborts the fs. Which leaves us a bit stuck.
dmesg contains a single notable error message reading:
"could not do orphan cleanup -2"
2 is ENOENT and the error comes from the failure handling path of
btrfs_orphan_cleanup(), with the stack leading back up to
btrfs_lookup().
btrfs_lookup
btrfs_lookup_dentry
btrfs_orphan_cleanup // prints that message and returns -ENOENT
After some detailed inspection of the internal state, it became clear
that:
- there are no orphan items for the subvol
- the subvol is otherwise healthy looking, it is not half-deleted or
anything, there is no drop progress, etc.
- the subvol was created a while ago and does the meaningful first
btrfs_orphan_cleanup() call that sets BTRFS_ROOT_ORPHAN_CLEANUP much
later.
- after btrfs_orphan_cleanup() fails, btrfs_lookup_dentry() returns -ENOENT,
which results in a negative dentry for the subvolume via
d_splice_alias(NULL, dentry), leading to the observed behavior. The
bug can be mitigated by dropping the dentry cache, at which point we
can successfully delete the subvolume if we want.
i.e.,
btrfs_lookup()
btrfs_lookup_dentry()
if (!sb_rdonly(inode->vfs_inode)->vfs_inode)
btrfs_orphan_cleanup(sub_root)
test_and_set_bit(BTRFS_ROOT_ORPHAN_CLEANUP)
btrfs_search_slot() // finds orphan item for inode N
...
prints "could not do orphan cleanup -2"
if (inode == ERR_PTR(-ENOENT))
inode = NULL;
return d_splice_alias(NULL, dentry) // NEGATIVE DENTRY for valid subvolume
btrfs_orphan_cleanup() does test_and_set_bit(BTRFS_ROOT_ORPHAN_CLEANUP)
on the root when it runs, so it cannot run more than once on a given
root, so something else must run concurrently. However, the obvious
routes to deleting an orphan when nlinks goes to 0 should not be able to
run without first doing a lookup into the subvolume, which should run
btrfs_orphan_cleanup() and set the bit.
The final important observation is that create_subvol() calls
d_instantiate_new() but does not set BTRFS_ROOT_ORPHAN_CLEANUP, so if
the dentry cache gets dropped, the next lookup into the subvolume will
make a real call into btrfs_orphan_cleanup() for the first time. This
opens up the possibility of concurrently deleting the inode/orphan items
but most typical evict() paths will be holding a reference on the parent
dentry (child dentry holds parent->d_lockref.count via dget in
d_alloc(), released in __dentry_kill()) and prevent the parent from
being removed from the dentry cache.
The one exception is delayed iputs. Ordered extent creation calls
igrab() on the inode. If the file is unlinked and closed while those
refs are held, iput() in __dentry_kill() decrements i_count but does
not trigger eviction (i_count > 0). The child dentry is freed and the
subvol dentry's d_lockref.count drops to 0, making it evictable while
the inode is still alive.
Since there are two races (the race between writeback and unlink and
the race between lookup and delayed iputs), and there are too many moving
parts, the following three diagrams show the complete picture.
(Only the second and third are races)
Phase 1:
Create Subvol in dentry cache without BTRFS_ROOT_ORPHAN_CLEANUP set
btrfs_mksubvol()
lookup_one_len()
__lookup_slow()
d_alloc_parallel()
__d_alloc() // d_lockref.count = 1
create_subvol(dentry)
// doesn't touch the bit..
d_instantiate_new(dentry, inode) // dentry in cache with d_lockref.count == 1
Phase 2:
Create a delayed iput for a file in the subvol but leave the subvol in
state where its dentry can be evicted (d_lockref.count == 0)
T1 (task) T2 (writeback) T3 (OE workqueue)
write() // dirty pages
btrfs_writepages()
btrfs_run_delalloc_range()
cow_file_range()
btrfs_alloc_ordered_extent()
igrab() // i_count: 1 -> 2
btrfs_unlink_inode()
btrfs_orphan_add()
close()
__fput()
dput()
finish_dput()
__dentry_kill()
dentry_unlink_inode()
iput() // 2 -> 1
--parent->d_lockref.count // 1 -> 0; evictable
finish_ordered_fn()
btrfs_finish_ordered_io()
btrfs_put_ordered_extent()
btrfs_add_delayed_iput()
Phase 3:
Once the delayed iput is pending and the subvol dentry is evictable,
the shrinker can free it, causing the next lookup to go through
btrfs_lookup() and call btrfs_orphan_cleanup() for the first time.
If the cleaner kthread processes the delayed iput concurrently, the
two race:
T1 (shrinker) T2 (cleaner kthread) T3 (lookup)
super_cache_scan()
prune_dcache_sb()
__dentry_kill()
// subvol dentry freed
btrfs_run_delayed_iputs()
iput() // i_count -> 0
evict() // sets I_FREEING
btrfs_evict_inode()
// truncation loop
btrfs_lookup()
btrfs_lookup_dentry()
btrfs_orphan_cleanup()
// first call (bit never set)
btrfs_iget()
// blocks on I_FREEING
btrfs_orphan_del()
// inode freed
// returns -ENOENT
btrfs_del_orphan_item()
// -ENOENT
// "could not do orphan cleanup -2"
d_splice_alias(NULL, dentry)
// negative dentry for valid subvol
The most straightforward fix is to ensure the invariant that a dentry
for a subvolume can exist if and only if that subvolume has
BTRFS_ROOT_ORPHAN_CLEANUP set on its root (and is known to have no
orphans or ran btrfs_orphan_cleanup()).
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
…tent_buffer() Call rcu_read_lock() before exiting the loop in try_release_subpage_extent_buffer() because there is a rcu_read_unlock() call past the loop. This has been detected by the Clang thread-safety analyzer. Fixes: ad580df ("btrfs: fix subpage deadlock in try_release_subpage_extent_buffer()") Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: David Sterba <dsterba@suse.com>
Currently we zero out all the remaining bytes of the last folio of the compressed bio, then round the bio size to fs block boundary. But that is done in two different functions, zero_last_folio() to zero the remaining bytes of the last folio, and round_up_last_block() to round up the bio to fs block boundary. There are some minor problems: - zero_last_folio() is zeroing ranges we won't submit This is mostly affecting block size < page size cases, where we can have a large folio (e.g. 64K), but the fs block size is only 4K. In that case, we may only want to submit the first 4K of the folio, the remaining range won't matter, but we still zero them all. This causes unnecessary CPU usage just to zero out some bytes we won't utilized. - compressed_bio_last_folio() is called twice in two different functions Which in theory we only need to call it once. Enhance the situation by: - Only zero out bytes up to the fs block boundary Thus this will reduce some overhead for bs < ps cases. - Move the folio_zero_range() call into round_up_last_block() So that we can reuse the same folio returned by compressed_bio_last_folio(). Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Keep this open, the build tests are hosted on github CI.