Instituto Nacional de ciberseguridad. Sección Incibe
Instituto Nacional de Ciberseguridad. Sección INCIBE-CERT

CVE-2026-46223

Gravedad:
Pendiente de análisis
Tipo:
No Disponible / Otro tipo
Fecha de publicación:
28/05/2026
Última modificación:
28/05/2026

Descripción

*** Pendiente de traducción *** In the Linux kernel, the following vulnerability has been resolved:<br /> <br /> cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated<br /> <br /> A chain of commits going back to v7.0 reworked rmdir to satisfy the<br /> controller invariant that a subsystem&amp;#39;s -&gt;css_offline() must not run while<br /> tasks are still doing kernel-side work in the cgroup.<br /> <br /> [1] d245698d727a ("cgroup: Defer task cgroup unlink until after the task is done switching out")<br /> [2] a72f73c4dd9b ("cgroup: Don&amp;#39;t expose dead tasks in cgroup")<br /> [3] 1b164b876c36 ("cgroup: Wait for dying tasks to leave on rmdir")<br /> [4] 4c56a8ac6869 ("cgroup: Fix cgroup_drain_dying() testing the wrong condition")<br /> [5] 13e786b64bd3 ("cgroup: Increment nr_dying_subsys_* from rmdir context")<br /> <br /> [1] moved task cset unlink from do_exit() to finish_task_switch() so a<br /> task&amp;#39;s cset link drops only after the task has fully stopped scheduling.<br /> That made tasks past exit_signals() linger on cset-&gt;tasks until their final<br /> context switch, which led to a series of problems as what userspace expected<br /> to see after rmdir diverged from what the kernel needs to wait for. [2]-[5]<br /> tried to bridge that divergence: [2] filtered the exiting tasks from<br /> cgroup.procs; [3] had rmdir(2) sleep in TASK_UNINTERRUPTIBLE for them; [4]<br /> fixed the wait&amp;#39;s condition; [5] made nr_dying_subsys_* visible<br /> synchronously.<br /> <br /> The cgroup_drain_dying() wait in [3] turned out to be a dead end. When the<br /> rmdir caller is also the reaper of a zombie that pins a pidns teardown (e.g.<br /> host PID 1 systemd reaping orphan pids that were re-parented to it during<br /> the same teardown), rmdir blocks in TASK_UNINTERRUPTIBLE waiting for those<br /> pids to free, the pids can&amp;#39;t free because PID 1 is the reaper and it&amp;#39;s stuck<br /> in rmdir, and the system A-A deadlocks. No internal lock ordering breaks<br /> this; the wait itself is the bug.<br /> <br /> The css killing side that drove the original reorder, however, can be made<br /> cleanly asynchronous: -&gt;css_offline() is already async, run from<br /> css_killed_work_fn() driven by percpu_ref_kill_and_confirm(). The fix is to<br /> make that chain start only after all tasks have left the cgroup. rmdir&amp;#39;s<br /> user-visible side then returns as soon as cgroup.procs and friends are<br /> empty, while -&gt;css_offline() still runs only after the cgroup is fully<br /> drained.<br /> <br /> Verified by the original reproducer (pidns teardown + zombie reaper, runs<br /> under vng) which hangs vanilla and succeeds here, and by per-commit<br /> deterministic repros for [2], [3], [4], [5] with a boot parameter that<br /> widens the post-exit_signals() window so each state is reliably reachable.<br /> Some stress tests on top of that.<br /> <br /> cgroup_apply_control_disable() has the same shape of pre-existing race:<br /> when a controller is disabled via subtree_control, kill_css() ran<br /> synchronously while tasks past exit_signals() could still be linked to<br /> the cgroup&amp;#39;s csets, and -&gt;css_offline() could fire before they drained.<br /> This patch preserves the existing synchronous behavior at that call site<br /> (kill_css_sync() + kill_css_finish() back-to-back) and a follow-up patch<br /> will defer kill_css_finish() there using a per-css trigger.<br /> <br /> This seems like the right approach and I don&amp;#39;t see problems with it. The<br /> changes are somewhat invasive but not excessively so, so backporting to<br /> -stable should be okay. If something does turn out to be wrong, the fallback<br /> is to revert the entire chain ([1]-[5]) and rework in the development branch<br /> instead.<br /> <br /> v2: Pin cgrp across the deferred destroy work with explicit<br /> cgroup_get()/cgroup_put() around queue_work() and the work_fn. v1<br /> wasn&amp;#39;t actually broken (ordered cgroup_offline_wq + queue_work order<br /> in cgroup_task_dead() saved it) but the explicit ref removes the<br /> dependency on those non-obvious invariants. Also note the<br /> pre-existing cgroup_apply_control_disable() race in the description;<br /> a follow-up will defer kill_css_finish() there.

Impacto