CVE-2024-40927
Publication date:
12/07/2024
In the Linux kernel, the following vulnerability has been resolved:<br />
<br />
xhci: Handle TD clearing for multiple streams case<br />
<br />
When multiple streams are in use, multiple TDs might be in flight when<br />
an endpoint is stopped. We need to issue a Set TR Dequeue Pointer for<br />
each, to ensure everything is reset properly and the caches cleared.<br />
Change the logic so that any N>1 TDs found active for different streams<br />
are deferred until after the first one is processed, calling<br />
xhci_invalidate_cancelled_tds() again from xhci_handle_cmd_set_deq() to<br />
queue another command until we are done with all of them. Also change<br />
the error/"should never happen" paths to ensure we at least clear any<br />
affected TDs, even if we can&#39;t issue a command to clear the hardware<br />
cache, and complain loudly with an xhci_warn() if this ever happens.<br />
<br />
This problem case dates back to commit e9df17eb1408 ("USB: xhci: Correct<br />
assumptions about number of rings per endpoint.") early on in the XHCI<br />
driver&#39;s life, when stream support was first added.<br />
It was then identified but not fixed nor made into a warning in commit<br />
674f8438c121 ("xhci: split handling halted endpoints into two steps"),<br />
which added a FIXME comment for the problem case (without materially<br />
changing the behavior as far as I can tell, though the new logic made<br />
the problem more obvious).<br />
<br />
Then later, in commit 94f339147fc3 ("xhci: Fix failure to give back some<br />
cached cancelled URBs."), it was acknowledged again.<br />
<br />
[Mathias: commit 94f339147fc3 ("xhci: Fix failure to give back some cached<br />
cancelled URBs.") was a targeted regression fix to the previously mentioned<br />
patch. Users reported issues with usb stuck after unmounting/disconnecting<br />
UAS devices. This rolled back the TD clearing of multiple streams to its<br />
original state.]<br />
<br />
Apparently the commit author was aware of the problem (yet still chose<br />
to submit it): It was still mentioned as a FIXME, an xhci_dbg() was<br />
added to log the problem condition, and the remaining issue was mentioned<br />
in the commit description. The choice of making the log type xhci_dbg()<br />
for what is, at this point, a completely unhandled and known broken<br />
condition is puzzling and unfortunate, as it guarantees that no actual<br />
users would see the log in production, thereby making it nigh<br />
undebuggable (indeed, even if you turn on DEBUG, the message doesn&#39;t<br />
really hint at there being a problem at all).<br />
<br />
It took me *months* of random xHC crashes to finally find a reliable<br />
repro and be able to do a deep dive debug session, which could all have<br />
been avoided had this unhandled, broken condition been actually reported<br />
with a warning, as it should have been as a bug intentionally left in<br />
unfixed (never mind that it shouldn&#39;t have been left in at all).<br />
<br />
> Another fix to solve clearing the caches of all stream rings with<br />
> cancelled TDs is needed, but not as urgent.<br />
<br />
3 years after that statement and 14 years after the original bug was<br />
introduced, I think it&#39;s finally time to fix it. And maybe next time<br />
let&#39;s not leave bugs unfixed (that are actually worse than the original<br />
bug), and let&#39;s actually get people to review kernel commits please.<br />
<br />
Fixes xHC crashes and IOMMU faults with UAS devices when handling<br />
errors/faults. Easiest repro is to use `hdparm` to mark an early sector<br />
(e.g. 1024) on a disk as bad, then `cat /dev/sdX > /dev/null` in a loop.<br />
At least in the case of JMicron controllers, the read errors end up<br />
having to cancel two TDs (for two queued requests to different streams)<br />
and the one that didn&#39;t get cleared properly ends up faulting the xHC<br />
entirely when it tries to access DMA pages that have since been unmapped,<br />
referred to by the stale TDs. This normally happens quickly (after two<br />
or three loops). After this fix, I left the `cat` in a loop running<br />
overnight and experienced no xHC failures, with all read errors<br />
recovered properly. Repro&#39;d and tested on an Apple M1 Mac Mini<br />
(dwc3 host).<br />
<br />
On systems without an IOMMU, this bug would instead silently corrupt<br />
freed memory, making this a<br />
---truncated---
Severity CVSS v4.0: Pending analysis
Last modification:
03/11/2025