CVE-2025-38242
Publication date:
09/07/2025
In the Linux kernel, the following vulnerability has been resolved:<br />
<br />
mm: userfaultfd: fix race of userfaultfd_move and swap cache<br />
<br />
This commit fixes two kinds of races, they may have different results:<br />
<br />
Barry reported a BUG_ON in commit c50f8e6053b0, we may see the same<br />
BUG_ON if the filemap lookup returned NULL and folio is added to swap<br />
cache after that.<br />
<br />
If another kind of race is triggered (folio changed after lookup) we<br />
may see RSS counter is corrupted:<br />
<br />
[ 406.893936] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0<br />
type:MM_ANONPAGES val:-1<br />
[ 406.894071] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0<br />
type:MM_SHMEMPAGES val:1<br />
<br />
Because the folio is being accounted to the wrong VMA.<br />
<br />
I&#39;m not sure if there will be any data corruption though, seems no. <br />
The issues above are critical already.<br />
<br />
<br />
On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache<br />
lookup, and tries to move the found folio to the faulting vma. Currently,<br />
it relies on checking the PTE value to ensure that the moved folio still<br />
belongs to the src swap entry and that no new folio has been added to the<br />
swap cache, which turns out to be unreliable.<br />
<br />
While working and reviewing the swap table series with Barry, following<br />
existing races are observed and reproduced [1]:<br />
<br />
In the example below, move_pages_pte is moving src_pte to dst_pte, where<br />
src_pte is a swap entry PTE holding swap entry S1, and S1 is not in the<br />
swap cache:<br />
<br />
CPU1 CPU2<br />
userfaultfd_move<br />
move_pages_pte()<br />
entry = pte_to_swp_entry(orig_src_pte);<br />
// Here it got entry = S1<br />
... ...<br />
<br />
// folio A is a new allocated folio<br />
// and get installed into src_pte<br />
<br />
// src_pte now points to folio A, S1<br />
// has swap count == 0, it can be freed<br />
// by folio_swap_swap or swap<br />
// allocator&#39;s reclaim.<br />
<br />
// folio B is a folio in another VMA.<br />
<br />
// S1 is freed, folio B can use it<br />
// for swap out with no problem.<br />
...<br />
folio = filemap_get_folio(S1)<br />
// Got folio B here !!!<br />
... ...<br />
<br />
// Now S1 is free to be used again.<br />
<br />
// Now src_pte is a swap entry PTE<br />
// holding S1 again.<br />
folio_trylock(folio)<br />
move_swap_pte<br />
double_pt_lock<br />
is_pte_pages_stable<br />
// Check passed because src_pte == S1<br />
folio_move_anon_rmap(...)<br />
// Moved invalid folio B here !!!<br />
<br />
The race window is very short and requires multiple collisions of multiple<br />
rare events, so it&#39;s very unlikely to happen, but with a deliberately<br />
constructed reproducer and increased time window, it can be reproduced<br />
easily.<br />
<br />
This can be fixed by checking if the folio returned by filemap is the<br />
valid swap cache folio after acquiring the folio lock.<br />
<br />
Another similar race is possible: filemap_get_folio may return NULL, but<br />
folio (A) could be swapped in and then swapped out again using the same<br />
swap entry after the lookup. In such a case, folio (A) may remain in the<br />
swap cache, so it must be moved too:<br />
<br />
CPU1 CPU2<br />
userfaultfd_move<br />
move_pages_pte()<br />
entry = pte_to_swp_entry(orig_src_pte);<br />
// Here it got entry = S1, and S1 is not in swap cache<br />
folio = filemap_get<br />
---truncated---
Severity CVSS v4.0: Pending analysis
Last modification:
09/07/2025