CVE-2021-47209
Publication date:
10/04/2024
In the Linux kernel, the following vulnerability has been resolved:<br />
<br />
sched/fair: Prevent dead task groups from regaining cfs_rq&#39;s<br />
<br />
Kevin is reporting crashes which point to a use-after-free of a cfs_rq<br />
in update_blocked_averages(). Initial debugging revealed that we&#39;ve<br />
live cfs_rq&#39;s (on_list=1) in an about to be kfree()&#39;d task group in<br />
free_fair_sched_group(). However, it was unclear how that can happen.<br />
<br />
His kernel config happened to lead to a layout of struct sched_entity<br />
that put the &#39;my_q&#39; member directly into the middle of the object<br />
which makes it incidentally overlap with SLUB&#39;s freelist pointer.<br />
That, in combination with SLAB_FREELIST_HARDENED&#39;s freelist pointer<br />
mangling, leads to a reliable access violation in form of a #GP which<br />
made the UAF fail fast.<br />
<br />
Michal seems to have run into the same issue[1]. He already correctly<br />
diagnosed that commit a7b359fc6a37 ("sched/fair: Correctly insert<br />
cfs_rq&#39;s to list on unthrottle") is causing the preconditions for the<br />
UAF to happen by re-adding cfs_rq&#39;s also to task groups that have no<br />
more running tasks, i.e. also to dead ones. His analysis, however,<br />
misses the real root cause and it cannot be seen from the crash<br />
backtrace only, as the real offender is tg_unthrottle_up() getting<br />
called via sched_cfs_period_timer() via the timer interrupt at an<br />
inconvenient time.<br />
<br />
When unregister_fair_sched_group() unlinks all cfs_rq&#39;s from the dying<br />
task group, it doesn&#39;t protect itself from getting interrupted. If the<br />
timer interrupt triggers while we iterate over all CPUs or after<br />
unregister_fair_sched_group() has finished but prior to unlinking the<br />
task group, sched_cfs_period_timer() will execute and walk the list of<br />
task groups, trying to unthrottle cfs_rq&#39;s, i.e. re-add them to the<br />
dying task group. These will later -- in free_fair_sched_group() -- be<br />
kfree()&#39;ed while still being linked, leading to the fireworks Kevin<br />
and Michal are seeing.<br />
<br />
To fix this race, ensure the dying task group gets unlinked first.<br />
However, simply switching the order of unregistering and unlinking the<br />
task group isn&#39;t sufficient, as concurrent RCU walkers might still see<br />
it, as can be seen below:<br />
<br />
CPU1: CPU2:<br />
: timer IRQ:<br />
: do_sched_cfs_period_timer():<br />
: :<br />
: distribute_cfs_runtime():<br />
: rcu_read_lock();<br />
: :<br />
: unthrottle_cfs_rq():<br />
sched_offline_group(): :<br />
: walk_tg_tree_from(…,tg_unthrottle_up,…):<br />
list_del_rcu(&tg->list); :<br />
(1) : list_for_each_entry_rcu(child, &parent->children, siblings)<br />
: :<br />
(2) list_del_rcu(&tg->siblings); :<br />
: tg_unthrottle_up():<br />
unregister_fair_sched_group(): struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];<br />
: :<br />
list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); :<br />
: :<br />
: if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running)<br />
(3) : list_add_leaf_cfs_rq(cfs_rq);<br />
: :<br />
: :<br />
: :<br />
: :<br />
: <br />
---truncated---
Severity CVSS v4.0: Pending analysis
Last modification:
27/03/2025