CVE-2022-49999
Publication date:
18/06/2025
In the Linux kernel, the following vulnerability has been resolved:<br />
<br />
btrfs: fix space cache corruption and potential double allocations<br />
<br />
When testing space_cache v2 on a large set of machines, we encountered a<br />
few symptoms:<br />
<br />
1. "unable to add free space :-17" (EEXIST) errors.<br />
2. Missing free space info items, sometimes caught with a "missing free<br />
space info for X" error.<br />
3. Double-accounted space: ranges that were allocated in the extent tree<br />
and also marked as free in the free space tree, ranges that were<br />
marked as allocated twice in the extent tree, or ranges that were<br />
marked as free twice in the free space tree. If the latter made it<br />
onto disk, the next reboot would hit the BUG_ON() in<br />
add_new_free_space().<br />
4. On some hosts with no on-disk corruption or error messages, the<br />
in-memory space cache (dumped with drgn) disagreed with the free<br />
space tree.<br />
<br />
All of these symptoms have the same underlying cause: a race between<br />
caching the free space for a block group and returning free space to the<br />
in-memory space cache for pinned extents causes us to double-add a free<br />
range to the space cache. This race exists when free space is cached<br />
from the free space tree (space_cache=v2) or the extent tree<br />
(nospace_cache, or space_cache=v1 if the cache needs to be regenerated).<br />
struct btrfs_block_group::last_byte_to_unpin and struct<br />
btrfs_block_group::progress are supposed to protect against this race,<br />
but commit d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when<br />
waiting for a transaction commit") subtly broke this by allowing<br />
multiple transactions to be unpinning extents at the same time.<br />
<br />
Specifically, the race is as follows:<br />
<br />
1. An extent is deleted from an uncached block group in transaction A.<br />
2. btrfs_commit_transaction() is called for transaction A.<br />
3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed<br />
ref for the deleted extent.<br />
4. __btrfs_free_extent() -> do_free_extent_accounting() -><br />
add_to_free_space_tree() adds the deleted extent back to the free<br />
space tree.<br />
5. do_free_extent_accounting() -> btrfs_update_block_group() -><br />
btrfs_cache_block_group() queues up the block group to get cached.<br />
block_group->progress is set to block_group->start.<br />
6. btrfs_commit_transaction() for transaction A calls<br />
switch_commit_roots(). It sets block_group->last_byte_to_unpin to<br />
block_group->progress, which is block_group->start because the block<br />
group hasn&#39;t been cached yet.<br />
7. The caching thread gets to our block group. Since the commit roots<br />
were already switched, load_free_space_tree() sees the deleted extent<br />
as free and adds it to the space cache. It finishes caching and sets<br />
block_group->progress to U64_MAX.<br />
8. btrfs_commit_transaction() advances transaction A to<br />
TRANS_STATE_SUPER_COMMITTED.<br />
9. fsync calls btrfs_commit_transaction() for transaction B. Since<br />
transaction A is already in TRANS_STATE_SUPER_COMMITTED and the<br />
commit is for fsync, it advances.<br />
10. btrfs_commit_transaction() for transaction B calls<br />
switch_commit_roots(). This time, the block group has already been<br />
cached, so it sets block_group->last_byte_to_unpin to U64_MAX.<br />
11. btrfs_commit_transaction() for transaction A calls<br />
btrfs_finish_extent_commit(), which calls unpin_extent_range() for<br />
the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by<br />
transaction B!), so it adds the deleted extent to the space cache<br />
again!<br />
<br />
This explains all of our symptoms above:<br />
<br />
* If the sequence of events is exactly as described above, when the free<br />
space is re-added in step 11, it will fail with EEXIST.<br />
* If another thread reallocates the deleted extent in between steps 7<br />
and 11, then step 11 will silently re-add that space to the space<br />
cache as free even though it is actually allocated. Then, if that<br />
space is allocated *again*, the free space tree will be corrupted<br />
(namely, the wrong item will be deleted).<br />
* If we don&#39;t catch this free space tree corr<br />
---truncated---
Severity CVSS v4.0: Pending analysis
Last modification:
14/11/2025