CVE-2026-46223

Published: Mag 28, 2026 Last Modified: Mag 28, 2026

ExploitDB:

Other exploit source:

Google Dorks:

Description

AI Translation Available

In the Linux kernel, the following vulnerability has been resolved:

cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated

A chain of commits going back to v7.0 reworked rmdir to satisfy the
controller invariant that a subsystem's ->css_offline() must not run while
tasks are still doing kernel-side work in the cgroup.

[1] d245698d727a ('cgroup: Defer task cgroup unlink until after the task is done switching out')
[2] a72f73c4dd9b ('cgroup: Don't expose dead tasks in cgroup')
[3] 1b164b876c36 ('cgroup: Wait for dying tasks to leave on rmdir')
[4] 4c56a8ac6869 ('cgroup: Fix cgroup_drain_dying() testing the wrong condition')
[5] 13e786b64bd3 ('cgroup: Increment nr_dying_subsys_* from rmdir context')

[1] moved task cset unlink from do_exit() to finish_task_switch() so a
task's cset link drops only after the task has fully stopped scheduling.
That made tasks past exit_signals() linger on cset->tasks until their final
context switch, which led to a series of problems as what userspace expected
to see after rmdir diverged from what the kernel needs to wait for. [2]-[5]
tried to bridge that divergence: [2] filtered the exiting tasks from
cgroup.procs; [3] had rmdir(2) sleep in TASK_UNINTERRUPTIBLE for them; [4]
fixed the wait's condition; [5] made nr_dying_subsys_* visible
synchronously.

The cgroup_drain_dying() wait in [3] turned out to be a dead end. When the
rmdir caller is also the reaper of a zombie that pins a pidns teardown (e.g.
host PID 1 systemd reaping orphan pids that were re-parented to it during
the same teardown), rmdir blocks in TASK_UNINTERRUPTIBLE waiting for those
pids to free, the pids can't free because PID 1 is the reaper and it's stuck
in rmdir, and the system A-A deadlocks. No internal lock ordering breaks
this; the wait itself is the bug.

The css killing side that drove the original reorder, however, can be made
cleanly asynchronous: ->css_offline() is already async, run from
css_killed_work_fn() driven by percpu_ref_kill_and_confirm(). The fix is to
make that chain start only after all tasks have left the cgroup. rmdir's
user-visible side then returns as soon as cgroup.procs and friends are
empty, while ->css_offline() still runs only after the cgroup is fully
drained.

Verified by the original reproducer (pidns teardown + zombie reaper, runs
under vng) which hangs vanilla and succeeds here, and by per-commit
deterministic repros for [2], [3], [4], [5] with a boot parameter that
widens the post-exit_signals() window so each state is reliably reachable.
Some stress tests on top of that.

cgroup_apply_control_disable() has the same shape of pre-existing race:
when a controller is disabled via subtree_control, kill_css() ran
synchronously while tasks past exit_signals() could still be linked to
the cgroup's csets, and ->css_offline() could fire before they drained.
This patch preserves the existing synchronous behavior at that call site
(kill_css_sync() + kill_css_finish() back-to-back) and a follow-up patch
will defer kill_css_finish() there using a per-css trigger.

This seems like the right approach and I don't see problems with it. The
changes are somewhat invasive but not excessively so, so backporting to
-stable should be okay. If something does turn out to be wrong, the fallback
is to revert the entire chain ([1]-[5]) and rework in the development branch
instead.

v2: Pin cgrp across the deferred destroy work with explicit
cgroup_get()/cgroup_put() around queue_work() and the work_fn. v1
wasn't actually broken (ordered cgroup_offline_wq + queue_work order
in cgroup_task_dead() saved it) but the explicit ref removes the
dependency on those non-obvious invariants. Also note the
pre-existing cgroup_apply_control_disable() race in the description;
a follow-up will defer kill_css_finish() there.

AI Generated Translation

Nel kernel Linux, è stata risolta la seguente vulnerabilità:

cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated

Una catena di commit risalente a v7.0 ha rielaborato rmdir per soddisfare l'invariante del controller secondo cui ->css_offline() di un subsystem non deve essere eseguito mentre ci sono task che svolgono ancora lavori a livello kernel nel cgroup.

[1] d245698d727a ("cgroup: Defer task cgroup unlink until after the task is done switching out")
[2] a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup")
[3] 1b164b876c36 ("cgroup: Wait for dying tasks to leave on rmdir")
[4] 4c56a8ac6869 ("cgroup: Fix cgroup_drain_dying() testing the wrong condition")
[5] 13e786b64bd3 ("cgroup: Increment nr_dying_subsys_* from rmdir context")

[1] spostò l'unlink del task cset da do_exit() a finish_task_switch() in modo che il link cset di un task venga rimosso solo dopo che il task ha completamente smesso di pianificare. Ciò faceva sì che i task oltrepassassero exit_signals() e rimanessero in cset->tasks fino all'ultimo context switch, causando una serie di problemi poiché ciò che l'utente si aspettava di vedere dopo rmdir divergeva da ciò che il kernel deve attendere. [2]-[5] hanno cercato di colmare questa divergenza: [2] filtrava i task in uscita da cgroup.procs; [3] faceva sì che rmdir(2) si mettesse in sleep in TASK_UNINTERRUPTIBLE per loro; [4] ha corretto la condizione di attesa; [5] ha reso visibili in modo sincrono nr_dying_subsys_*.

L'attesa in cgroup_drain_dying() in [3] si è rivelata un vicolo cieco. Quando il chiamante di rmdir è anche il reaper di uno zombie che blocca il teardown di pidns (ad esempio systemd che reaps orphan pids re-reattati durante lo stesso teardown), rmdir si blocca in TASK_UNINTERRUPTIBLE in attesa che quei pid vengano liberati, ma i pid non possono essere liberati perché PID 1 è il reaper e si trova bloccato in rmdir, causando un deadlock a livello di sistema. Nessuna gerarchia di lock interna lo interrompe; l'attesa stessa è il bug.

Il lato di css kill che ha originato il riordino originale può invece essere reso facilmente asincrono: ->css_offline() è già asincrono, eseguito da css_killed_work_fn() guidato da percpu_ref_kill_and_confirm(). La correzione consiste nel far partire questa catena solo dopo che tutti i task hanno lasciato il cgroup. La parte visibile all'utente di rmdir quindi ritorna non appena cgroup.procs e simili sono vuoti, mentre ->css_offline() continua a essere eseguito solo dopo che il cgroup è completamente svuotato.

Verificato dal riproduttore originale (teardown di pidns + zombie reaper, eseguito sotto vng) che si blocca in vanilla e ha successo qui, e tramite riproduzioni deterministiche per [2], [3], [4], [5] con un parametro di avvio che amplia la finestra post-exit_signals() in modo che ogni stato sia raggiungibile in modo affidabile. Sono stati eseguiti alcuni test di stress su questo.

cgroup_apply_control_disable() presenta lo stesso tipo di race preesistente: quando un controller viene disabilitato tramite subtree_control, kill_css() veniva eseguito in modo sincrono mentre task oltre exit_signals() potevano ancora essere collegati ai csets del cgroup, e ->css_offline() poteva essere attivato prima che si svuotassero. Questa patch preserva il comportamento sincrono esistente in quel punto di chiamata (kill_css_sync() + kill_css_finish() eseguiti uno dopo l'altro) e una patch successiva deferredirà kill_css_finish() usando un trigger per-css.

Questa sembra la soluzione corretta e non vedo problemi in merito. Le modifiche sono abbastanza invasive ma non eccessive, quindi il backport su -stable dovrebbe essere fattibile. Se dovesse emergere qualche problema, il fallback è revertire l'intera catena ([1]-[5]) e rielaborarla nel branch di sviluppo.

v2: Blocca cgrp durante il lavoro di distruzione differito con cgroup_get()/cgroup_put() espliciti attorno a queue_work() e alla work_fn. La versione 1 non era effettivamente rotta (l'ordine di cgroup_offline_wq + queue_work in cgroup_task_dead() lo salvava), ma il ref esplicito rimuove la dipendenza da quegli invarianti non ovvi. Nota anche la race preesistente in cgroup_apply_control_disable() descritta sopra; una patch successiva deferredirà kill_css_finish() in quella sede.

https://git.kernel.org/stable/c/33fa2e6b1507a0a377a151a8826…

https://git.kernel.org/stable/c/33fa2e6b1507a0a377a151a8826438bedad1d0b0

https://git.kernel.org/stable/c/93618edf753838a727dbff63c7c…

https://git.kernel.org/stable/c/93618edf753838a727dbff63c7c291dee22d656b

CVE-2026-46223

Description

Iscriviti alla newsletter