CVE-2024-53219
Publication date:
27/12/2024
In the Linux kernel, the following vulnerability has been resolved:<br />
<br />
virtiofs: use pages instead of pointer for kernel direct IO<br />
<br />
When trying to insert a 10MB kernel module kept in a virtio-fs with cache<br />
disabled, the following warning was reported:<br />
<br />
------------[ cut here ]------------<br />
WARNING: CPU: 1 PID: 404 at mm/page_alloc.c:4551 ......<br />
Modules linked in:<br />
CPU: 1 PID: 404 Comm: insmod Not tainted 6.9.0-rc5+ #123<br />
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ......<br />
RIP: 0010:__alloc_pages+0x2bf/0x380<br />
......<br />
Call Trace:<br />
<br />
? __warn+0x8e/0x150<br />
? __alloc_pages+0x2bf/0x380<br />
__kmalloc_large_node+0x86/0x160<br />
__kmalloc+0x33c/0x480<br />
virtio_fs_enqueue_req+0x240/0x6d0<br />
virtio_fs_wake_pending_and_unlock+0x7f/0x190<br />
queue_request_and_unlock+0x55/0x60<br />
fuse_simple_request+0x152/0x2b0<br />
fuse_direct_io+0x5d2/0x8c0<br />
fuse_file_read_iter+0x121/0x160<br />
__kernel_read+0x151/0x2d0<br />
kernel_read+0x45/0x50<br />
kernel_read_file+0x1a9/0x2a0<br />
init_module_from_file+0x6a/0xe0<br />
idempotent_init_module+0x175/0x230<br />
__x64_sys_finit_module+0x5d/0xb0<br />
x64_sys_call+0x1c3/0x9e0<br />
do_syscall_64+0x3d/0xc0<br />
entry_SYSCALL_64_after_hwframe+0x4b/0x53<br />
......<br />
<br />
---[ end trace 0000000000000000 ]---<br />
<br />
The warning is triggered as follows:<br />
<br />
1) syscall finit_module() handles the module insertion and it invokes<br />
kernel_read_file() to read the content of the module first.<br />
<br />
2) kernel_read_file() allocates a 10MB buffer by using vmalloc() and<br />
passes it to kernel_read(). kernel_read() constructs a kvec iter by<br />
using iov_iter_kvec() and passes it to fuse_file_read_iter().<br />
<br />
3) virtio-fs disables the cache, so fuse_file_read_iter() invokes<br />
fuse_direct_io(). As for now, the maximal read size for kvec iter is<br />
only limited by fc->max_read. For virtio-fs, max_read is UINT_MAX, so<br />
fuse_direct_io() doesn&#39;t split the 10MB buffer. It saves the address and<br />
the size of the 10MB-sized buffer in out_args[0] of a fuse request and<br />
passes the fuse request to virtio_fs_wake_pending_and_unlock().<br />
<br />
4) virtio_fs_wake_pending_and_unlock() uses virtio_fs_enqueue_req() to<br />
queue the request. Because virtiofs need DMA-able address, so<br />
virtio_fs_enqueue_req() uses kmalloc() to allocate a bounce buffer for<br />
all fuse args, copies these args into the bounce buffer and passed the<br />
physical address of the bounce buffer to virtiofsd. The total length of<br />
these fuse args for the passed fuse request is about 10MB, so<br />
copy_args_to_argbuf() invokes kmalloc() with a 10MB size parameter and<br />
it triggers the warning in __alloc_pages():<br />
<br />
if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))<br />
return NULL;<br />
<br />
5) virtio_fs_enqueue_req() will retry the memory allocation in a<br />
kworker, but it won&#39;t help, because kmalloc() will always return NULL<br />
due to the abnormal size and finit_module() will hang forever.<br />
<br />
A feasible solution is to limit the value of max_read for virtio-fs, so<br />
the length passed to kmalloc() will be limited. However it will affect<br />
the maximal read size for normal read. And for virtio-fs write initiated<br />
from kernel, it has the similar problem but now there is no way to limit<br />
fc->max_write in kernel.<br />
<br />
So instead of limiting both the values of max_read and max_write in<br />
kernel, introducing use_pages_for_kvec_io in fuse_conn and setting it as<br />
true in virtiofs. When use_pages_for_kvec_io is enabled, fuse will use<br />
pages instead of pointer to pass the KVEC_IO data.<br />
<br />
After switching to pages for KVEC_IO data, these pages will be used for<br />
DMA through virtio-fs. If these pages are backed by vmalloc(),<br />
{flush|invalidate}_kernel_vmap_range() are necessary to flush or<br />
invalidate the cache before the DMA operation. So add two new fields in<br />
fuse_args_pages to record the base address of vmalloc area and the<br />
condition indicating whether invalidation is needed. Perform the flush<br />
in fuse_get_user_pages() for write operations and the invalidation in<br />
fuse_release_user_pages() for read operations.<br />
<br />
It may seem necessary to introduce another fie<br />
---truncated---
Severity CVSS v4.0: Pending analysis
Last modification:
27/12/2024