CVE-2023-52934
Publication date:
27/03/2025
In the Linux kernel, the following vulnerability has been resolved:<br />
<br />
mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups<br />
<br />
In commit 34488399fa08 ("mm/madvise: add file and shmem support to<br />
MADV_COLLAPSE") we make the following change to find_pmd_or_thp_or_none():<br />
<br />
- if (!pmd_present(pmde))<br />
- return SCAN_PMD_NULL;<br />
+ if (pmd_none(pmde))<br />
+ return SCAN_PMD_NONE;<br />
<br />
This was for-use by MADV_COLLAPSE file/shmem codepaths, where<br />
MADV_COLLAPSE might identify a pte-mapped hugepage, only to have<br />
khugepaged race-in, free the pte table, and clear the pmd. Such codepaths<br />
include:<br />
<br />
A) If we find a suitably-aligned compound page of order HPAGE_PMD_ORDER<br />
already in the pagecache.<br />
B) In retract_page_tables(), if we fail to grab mmap_lock for the target<br />
mm/address.<br />
<br />
In these cases, collapse_pte_mapped_thp() really does expect a none (not<br />
just !present) pmd, and we want to suitably identify that case separate<br />
from the case where no pmd is found, or it&#39;s a bad-pmd (of course, many<br />
things could happen once we drop mmap_lock, and the pmd could plausibly<br />
undergo multiple transitions due to intervening fault, split, etc). <br />
Regardless, the code is prepared install a huge-pmd only when the existing<br />
pmd entry is either a genuine pte-table-mapping-pmd, or the none-pmd.<br />
<br />
However, the commit introduces a logical hole; namely, that we&#39;ve allowed<br />
!none- && !huge- && !bad-pmds to be classified as genuine<br />
pte-table-mapping-pmds. One such example that could leak through are swap<br />
entries. The pmd values aren&#39;t checked again before use in<br />
pte_offset_map_lock(), which is expecting nothing less than a genuine<br />
pte-table-mapping-pmd.<br />
<br />
We want to put back the !pmd_present() check (below the pmd_none() check),<br />
but need to be careful to deal with subtleties in pmd transitions and<br />
treatments by various arch.<br />
<br />
The issue is that __split_huge_pmd_locked() temporarily clears the present<br />
bit (or otherwise marks the entry as invalid), but pmd_present() and<br />
pmd_trans_huge() still need to return true while the pmd is in this<br />
transitory state. For example, x86&#39;s pmd_present() also checks the<br />
_PAGE_PSE , riscv&#39;s version also checks the _PAGE_LEAF bit, and arm64 also<br />
checks a PMD_PRESENT_INVALID bit.<br />
<br />
Covering all 4 cases for x86 (all checks done on the same pmd value):<br />
<br />
1) pmd_present() && pmd_trans_huge()<br />
All we actually know here is that the PSE bit is set. Either:<br />
a) We aren&#39;t racing with __split_huge_page(), and PRESENT or PROTNONE<br />
is set.<br />
=> huge-pmd<br />
b) We are currently racing with __split_huge_page(). The danger here<br />
is that we proceed as-if we have a huge-pmd, but really we are<br />
looking at a pte-mapping-pmd. So, what is the risk of this<br />
danger?<br />
<br />
The only relevant path is:<br />
<br />
madvise_collapse() -> collapse_pte_mapped_thp()<br />
<br />
Where we might just incorrectly report back "success", when really<br />
the memory isn&#39;t pmd-backed. This is fine, since split could<br />
happen immediately after (actually) successful madvise_collapse().<br />
So, it should be safe to just assume huge-pmd here.<br />
<br />
2) pmd_present() && !pmd_trans_huge()<br />
Either:<br />
a) PSE not set and either PRESENT or PROTNONE is.<br />
=> pte-table-mapping pmd (or PROT_NONE)<br />
b) devmap. This routine can be called immediately after<br />
unlocking/locking mmap_lock -- or called with no locks held (see<br />
khugepaged_scan_mm_slot()), so previous VMA checks have since been<br />
invalidated.<br />
<br />
3) !pmd_present() && pmd_trans_huge()<br />
Not possible.<br />
<br />
4) !pmd_present() && !pmd_trans_huge()<br />
Neither PRESENT nor PROTNONE set<br />
=> not present<br />
<br />
I&#39;ve checked all archs that implement pmd_trans_huge() (arm64, riscv,<br />
powerpc, longarch, x86, mips, s390) and this logic roughly translates<br />
(though devmap treatment is unique to x86 and powerpc, and (3) doesn&#39;t<br />
necessarily hold in general -- but that doesn&#39;t matter since<br />
!pmd_present() always takes failure path).<br />
<br />
Also, add a comment above find_pmd_or_thp_or_none()<br />
---truncated---
Severity CVSS v4.0: Pending analysis
Last modification:
28/10/2025