CVE-2026-46223
Fecha de publicación:
28/05/2026
*** Pendiente de traducción *** In the Linux kernel, the following vulnerability has been resolved:<br />
<br />
cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated<br />
<br />
A chain of commits going back to v7.0 reworked rmdir to satisfy the<br />
controller invariant that a subsystem&#39;s ->css_offline() must not run while<br />
tasks are still doing kernel-side work in the cgroup.<br />
<br />
[1] d245698d727a ("cgroup: Defer task cgroup unlink until after the task is done switching out")<br />
[2] a72f73c4dd9b ("cgroup: Don&#39;t expose dead tasks in cgroup")<br />
[3] 1b164b876c36 ("cgroup: Wait for dying tasks to leave on rmdir")<br />
[4] 4c56a8ac6869 ("cgroup: Fix cgroup_drain_dying() testing the wrong condition")<br />
[5] 13e786b64bd3 ("cgroup: Increment nr_dying_subsys_* from rmdir context")<br />
<br />
[1] moved task cset unlink from do_exit() to finish_task_switch() so a<br />
task&#39;s cset link drops only after the task has fully stopped scheduling.<br />
That made tasks past exit_signals() linger on cset->tasks until their final<br />
context switch, which led to a series of problems as what userspace expected<br />
to see after rmdir diverged from what the kernel needs to wait for. [2]-[5]<br />
tried to bridge that divergence: [2] filtered the exiting tasks from<br />
cgroup.procs; [3] had rmdir(2) sleep in TASK_UNINTERRUPTIBLE for them; [4]<br />
fixed the wait&#39;s condition; [5] made nr_dying_subsys_* visible<br />
synchronously.<br />
<br />
The cgroup_drain_dying() wait in [3] turned out to be a dead end. When the<br />
rmdir caller is also the reaper of a zombie that pins a pidns teardown (e.g.<br />
host PID 1 systemd reaping orphan pids that were re-parented to it during<br />
the same teardown), rmdir blocks in TASK_UNINTERRUPTIBLE waiting for those<br />
pids to free, the pids can&#39;t free because PID 1 is the reaper and it&#39;s stuck<br />
in rmdir, and the system A-A deadlocks. No internal lock ordering breaks<br />
this; the wait itself is the bug.<br />
<br />
The css killing side that drove the original reorder, however, can be made<br />
cleanly asynchronous: ->css_offline() is already async, run from<br />
css_killed_work_fn() driven by percpu_ref_kill_and_confirm(). The fix is to<br />
make that chain start only after all tasks have left the cgroup. rmdir&#39;s<br />
user-visible side then returns as soon as cgroup.procs and friends are<br />
empty, while ->css_offline() still runs only after the cgroup is fully<br />
drained.<br />
<br />
Verified by the original reproducer (pidns teardown + zombie reaper, runs<br />
under vng) which hangs vanilla and succeeds here, and by per-commit<br />
deterministic repros for [2], [3], [4], [5] with a boot parameter that<br />
widens the post-exit_signals() window so each state is reliably reachable.<br />
Some stress tests on top of that.<br />
<br />
cgroup_apply_control_disable() has the same shape of pre-existing race:<br />
when a controller is disabled via subtree_control, kill_css() ran<br />
synchronously while tasks past exit_signals() could still be linked to<br />
the cgroup&#39;s csets, and ->css_offline() could fire before they drained.<br />
This patch preserves the existing synchronous behavior at that call site<br />
(kill_css_sync() + kill_css_finish() back-to-back) and a follow-up patch<br />
will defer kill_css_finish() there using a per-css trigger.<br />
<br />
This seems like the right approach and I don&#39;t see problems with it. The<br />
changes are somewhat invasive but not excessively so, so backporting to<br />
-stable should be okay. If something does turn out to be wrong, the fallback<br />
is to revert the entire chain ([1]-[5]) and rework in the development branch<br />
instead.<br />
<br />
v2: Pin cgrp across the deferred destroy work with explicit<br />
cgroup_get()/cgroup_put() around queue_work() and the work_fn. v1<br />
wasn&#39;t actually broken (ordered cgroup_offline_wq + queue_work order<br />
in cgroup_task_dead() saved it) but the explicit ref removes the<br />
dependency on those non-obvious invariants. Also note the<br />
pre-existing cgroup_apply_control_disable() race in the description;<br />
a follow-up will defer kill_css_finish() there.
Gravedad: Pendiente de análisis
Última modificación:
28/05/2026