btrfs-progs: convert: fix a long bug that bgt feature never works #988

adam900710 · 2025-05-24T01:28:54Z

There is a long bug that, "btrfs-convert -O bgt" doesn't create a fs
with bgt feature at all.

In fact "mkfs.btrfs -O bgt" is no different than "mkfs.btrfs -O ^bgt".

The root cause is explained and fixed in the second last patch.

The first 7 patches are mostly preparation and cleanup for properly
intorduce block group tree at temprory fs creation time.

[BUG] When a device replace failed, e.g. try to replace a device on a RO mounted btrfs, the error message is incorrectly broken into two lines: [adam@btrfs-vm ~]$ sudo btrfs replace start -fB 1 /dev/test/scratch3 /mnt/btrfs/ Performing full device TRIM /dev/mapper/test-scratch3 (10.00GiB) ... ERROR: ioctl(DEV_REPLACE_START) failed on "/mnt/btrfs/": Read-only file system [adam@btrfs-vm ~]$ Note the newline after the "Read-only file system" error message. [CAUSE] Inside cmd_replace_start(), if the ioctl failed we need to handle the error messages different depeneding on start_args.result. If the result is not BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_RESULT we will append extra info to the error message. But the initial error message is using error(), which implies a newline. This results the above incorrectly splitted error message. [FIX] Instead of manually appending an extra reason to the existing error message, just do different output depending on the start_args.result in the first place. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com>

It's a long known problem that direct IO can lead to data checksum mismatches if the user space is also modifying its buffer during writeback. Although it's fixed in the recent v6.15 release (and backported to older kernels), we still need a user friendly way to fix those problems. This patch introduce the dryrun version of "btrfs rescue fix-data-checksum", which reports the logical bytenr and corrupted mirrors. Signed-off-by: Qu Wenruo <wqu@suse.com>

@ioff

[BUG] There is a seldomly utilized function, btrfs_find_item(), which has no document and is not behaving correctly. Inside backref.c, iterate_inode_refs() and btrfs_ref_to_path() both utilize this function, to find the parent inode. However btrfs_find_item() will never return 0 if @ioff is passed as 0 for such usage, result early failure for all kinds of inode iteration functions. [CAUSE] Both functions pass 0 as the @ioff parameter initially, e.g: We have the following fs tree root: item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160 generation 3 transid 9 size 6 nbytes 16384 block group 0 mode 40755 links 1 uid 0 gid 0 rdev 0 sequence 1 flags 0x0(none) item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12 index 0 namelen 2 name: .. item 2 key (256 DIR_ITEM 2507850652) itemoff 16078 itemsize 33 location key (257 INODE_ITEM 0) type FILE transid 9 data_len 0 name_len 3 name: foo item 3 key (256 DIR_INDEX 2) itemoff 16045 itemsize 33 location key (257 INODE_ITEM 0) type FILE transid 9 data_len 0 name_len 3 name: foo item 4 key (257 INODE_ITEM 0) itemoff 15885 itemsize 160 generation 9 transid 9 size 16384 nbytes 16384 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 4 flags 0x0(none) item 5 key (257 INODE_REF 256) itemoff 15872 itemsize 13 index 2 namelen 3 name: foo item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53 generation 9 type 1 (regular) extent data disk byte 13631488 nr 16384 extent data offset 0 nr 16384 ram 16384 extent compression 0 (none) Then we call paths_from_inode() with: - @Inum = 257 - ipath = {.fs_root = 5} Then we got the following sequence: iterate_irefs(257, fs_root, inode_to_path) |- iterate_inode_refs() |- inode_ref_info() |- btrfs_find_item(257, 0, fs_root) | Returned 1, with @found_key updated to | (257, INODE_REF, 256). This makes iterate_irefs() exit immediately, but obviously that btrfs_find_item() call is to find any INODE_REF, not to find the exact match. [FIX] If btrfs_find_item() found an item matching the objectid and type, then it should return 0 other than 1. Fix it and keep the behavior the same across btrfs-progs and the kernel. Since we're here, also add some comments explaining the function. Signed-off-by: Qu Wenruo <wqu@suse.com>

Previously "btrfs rescue fix-data-checksum" only show affected logical bytenr, which is not helpful to determine which files are affected. Enhance the output by also outputting the affected subvolumes (in numeric form), and the file paths inside that subvolume. The example looks like this: logical=13631488 corrtuped mirrors=1 affected files: (subvolume 5)/foo (subvolume 5)/dir/bar logical=13635584 corrtuped mirrors=1 affected files: (subvolume 5)/foo (subvolume 5)/dir/bar Although the end result is still not perfect, it's still much easier to find out which files are affected. Signed-off-by: Qu Wenruo <wqu@suse.com>

This mode will ask user for how to fix each block. User input can match the first letter or the whole action name to specify given action, the input is verified case insensitive. If no user input is provided, the default action is to ignore the corrupted block. If the input matches no action, a warning is outputted and user must retry until a valid input is provided. Signed-off-by: Qu Wenruo <wqu@suse.com>

This adds a new group of action in the interactive mode to fix a csum mismatch. The output looks like this: logical=13631488 corrtuped mirrors=1 affected files: (subvolume 5)/foo (subvolume 5)/dir/bar <<I>>gnore/<1>:1 Csum item for logical 13631488 updated using data from mirror 1 In the interactive mode, the update-csum-item behavior is outputted as all available mirror numbers. Considering all the existing (and future) action should starts with an alphabet, it's pretty easy to distinguish mirror number from other actions. The update-csum-item action itself is pretty straight-forward, just read out the data from specified mirror, then calculate a new checksum, and update the corresponding csum item in csum tree. Signed-off-by: Qu Wenruo <wqu@suse.com>

This option allows "btrfs rescue fix-data-checksum" to use the specified mirror number to update checksum item for all corrupted mirrors. If the specified number is larger than the max number of mirrors, the real mirror number will be `num % (num_mirrors + 1)`. Signed-off-by: Qu Wenruo <wqu@suse.com>

Block group tree requires no_holes and free-space-tree features, add such check just like mkfs. Signed-off-by: Qu Wenruo <wqu@suse.com>

The bytenr sequence of all roots are controlled by our code, so if something went wrong with the sequence, it's a bug. A UASSERT() is more suitable for this case. Signed-off-by: Qu Wenruo <wqu@suse.com>

@slot

The function requires parameters @slot and @itemoff to record where the next item should land. But this is overkilled, as after inserting an item, the temporary extent buffer will have its header nritems and the item pointer updated. We can use that header nritems and item pointer to get where the next item should land. This removes the external counter to record @slot and @itemoff. Signed-off-by: Qu Wenruo <wqu@suse.com>

@slot

…_chunk_item() These functions require parameters @slot and @itemoff to record where the next item should land. But this is overkilled, as after inserting an item, the temporary extent buffer will have its header nritems and the item pointer updated. We can use that header nritems and item pointer to get where the next item should land. This removes the external counter to record @slot and @itemoff. Signed-off-by: Qu Wenruo <wqu@suse.com>

@slot

This function requires parameters @slot and @itemoff to record where the next item should land. But this is overkilled, as after inserting an item, the temporary extent buffer will have its header nritems and the item pointer updated. We can use that header nritems and item pointer to get where the next item should land. This removes the external counter to record @slot and @itemoff. Signed-off-by: Qu Wenruo <wqu@suse.com>

@slot

…emp_block_group() These functions require parameters @slot and @itemoff to record where the next item should land. But this is overkilled, as after inserting an item, the temporary extent buffer will have its header nritems and the item pointer updated. We can use that header nritems and item pointer to get where the next item should land. This removes the external counter to record @slot and @itemoff. Signed-off-by: Qu Wenruo <wqu@suse.com>

…tree() Both fs and csum trees are empty at make_convert_btrfs(), no need to use two different functions to do that. Merge them into a common setup_temp_empty_tree() instead. Signed-off-by: Qu Wenruo <wqu@suse.com>

Previously there are some problems related to btrfs-convert bgt support, that it doesn't work at all, caused by the following reasons: - We never update the super block with extra compat ro flags Even if we set "-O bgt" flags, it will not set the compat ro flags, and everything just go non-bgt routine. Meanwhile other compat ro flags are for free-space-tree, and free-space-tree is rebuilt after the full convert is done. Thus this bug won't cause any problem for fst features, but only affecting bgt so far. - No extra handling to create block group tree Fix above problems by: - Set the proper compat RO flag for the temporary super block We should only set the compat RO flags except the two FST related bits. As FST is handled after conversion, we should not set the flag at that timing. - Add block group tree root item and its backrefs So the initial temporary fs will have a proper block group tree. The only tricky part is for the extent tree population, where we have to put all block group items into the block group tree other than the extent tree. With these two points addressed, now block group tree can be properly enabled for btrfs-convert. Signed-off-by: Qu Wenruo <wqu@suse.com>

Previously "btrfs-convert -O bgt" will not cause any error, but the results fs has no block-group-tree feature at all, making it no different than "btrfs-convert -O ^bgt". This is a big bug that never caught by our existing convert runs. 001-ext2-basic and 003-ext4-basic all tested bgt feature, but doesn't really check if the resulted fs really have bgt flags set. To patch the hole, add a new test case, which will do the regular bgt convert, but at the end also do a super block dump and verify the BLOCK_GROUP_TREE flag is properly set. Signed-off-by: Qu Wenruo <wqu@suse.com>

kdave · 2025-05-30T20:10:02Z

Merged to devel, thanks.

adam900710 added 16 commits May 14, 2025 09:47

btrfs-progs: convert: add feature dependency checks for bgt

f85108f

Block group tree requires no_holes and free-space-tree features, add such check just like mkfs. Signed-off-by: Qu Wenruo <wqu@suse.com>

btrfs-progs: convert: replace the bytenr check with a UASSERT()

40dcedd

The bytenr sequence of all roots are controlled by our code, so if something went wrong with the sequence, it's a bug. A UASSERT() is more suitable for this case. Signed-off-by: Qu Wenruo <wqu@suse.com>

kdave force-pushed the devel branch 6 times, most recently from ef43ce6 to 3eff852 Compare May 30, 2025 14:14

kdave added this to the v6.15 milestone May 30, 2025

kdave closed this May 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

btrfs-progs: convert: fix a long bug that bgt feature never works #988

btrfs-progs: convert: fix a long bug that bgt feature never works #988

Uh oh!

adam900710 commented May 24, 2025

Uh oh!

kdave commented May 30, 2025

Uh oh!

Uh oh!

btrfs-progs: convert: fix a long bug that bgt feature never works #988

btrfs-progs: convert: fix a long bug that bgt feature never works #988

Uh oh!

Conversation

adam900710 commented May 24, 2025

Uh oh!

kdave commented May 30, 2025

Uh oh!

Uh oh!