Oh I've debugged this before. Native memory allocator had a scavenge function which suspended all other threads. Managed language runtime had a stop the world phase which suspended all mutator threads. They ran at about the same time and ended up suspending each other. To fix this you need to enforce some sort of hierarchy or mutual exclusion for suspension requests.
> Why you should never suspend a thread in your own process.
This sounds like a good general princple but suspending threads in your own process is kind of necessary for e.g. many GC algorithms. Now imagine multiple of those runtimes running in the same process.
> suspending threads in your own process is kind of necessary for e.g. many GC algorithms
I think this is typically done by having the compiler/runtime insert safepoints, which cooperatively yield at specified points to allow the GC to run without mutator threads being active. Done correctly, this shouldn't be subject to the problem the original post highlighted, because it doesn't rely on the OS's ability to suspend threads when they aren't expecting it.
This is a good approach but can be tricky.
E.g. what if your thread spends a lot of time in a tight loop, e.g. doing a big inlined matmul kernel? Since you never hit a function call you don't get safepoints that way -- you can add them to the back-edge of every loop, but that can be a bit unappetizing from a performance perspective.
> suspending threads in your own process is kind of necessary for e.g. many GC algorithms
True. Maybe the more precise rule is “only suspend threads for a short amount of time and don’t acquire any locks while doing it”?
The way the .NET runtime follows this rule is it only suspends threads for a very short time. After suspending, the thread is immediately resumed if it not running managed code (in a random native library or syscall). If the thread is running managed code, the thread is hijacked by replacing either the instruction pointer or the return address with a the address of a function that will wait for the GC to finish. The thread is then immediately resumed. See the details here:
It's expensive. Really expensive. I remember a major bank calling me and my buddy's 2-man consultancy team and telling me they had spent a small fortune on whatever the top-level access to MS developers is, to get some outdated MS COM component to interface with .NET, and MS had failed.
(We charged ~$20K and estimated two weeks. We had it working in two hours.)
I gotta ask, did you spend a week sucking your teeth after that, or did you hand it to them and say "hey, you're paying for expertise and we got it to you faster than we estimated"?
The correct way is the send the customer the almost-final version and wait for the bug report. This way you show how quickly you can tackle the problem but don't make the task look too easy.
This was back in 2004 (?), so too long ago to remember the details. I remember the phone call though because the chap that called us said he been told we never used the word "impossible."
I worked on a team that did. We had a monthly call with a MS rep and access to devs working on the platform features we were working on (for MS Teams specifically). It is probably more common than you think.
I remember being able to file support cases just by buying one for a couple of hundred dollars. They'd also promise that if it turned out to be a bug in the product the fee would be refunded.
(My case wasn't solved. It was something about variable delays in getting packets off the network and into userspace but we never got to the bottom of it).
I worked for a small shop that provided something MS couldn’t/wouldn’t, but which was essential for their international business anyway. So we too had engineering support.
On Linux you'd do this by sending a signal to the thread you want to analyze, and then the signal handler would take the stack trace and send it back to the watchdog.
The tricky part is ensuring that the signal handler code is async-signal-safe (which pretty much boils down to "ensure you're not acquiring any locks and be careful about reentrant code"), but at least that only has to be verified for a self-contained small function.
The 2 implies an older API, its predecessor QueueUserAPC has been around since the XP days.
The older API is less like signals and more like cooperative scheduling in that it waits for the target thread to be in an "alertable" state before it runs (the thread executes a sleep or a wait for something)
> The 2 implies an older API, its predecessor QueueUserAPC has been around since the XP days.
I wasn’t implying that APCs were new, I was implying that the ability to enqueue special (as opposed to normal) APCs from user-mode is new. And of course, that has always been possible from kernel-mode with NT.
Why not recommended? As far as things close to signals go, this is how you implement signals in user land on Windows (along with pause/resume thread). You can even take locks later during the process, as long as you also took them before sending the signal (same exact restrictions as fork actually, but unfortunately atfork hooks are not accessible and often full of fork-unsafe data race and deadlock implementation bugs themselves in my experience with all the popular libc)
I’ve implemented them as you describe, but it’s still a bit hacky due to lots of corner cases — what if your target thread is currently executing in the kernel?
The special APC is nicer because the OS is then aware of what you’re doing— it will perform the user-mode stack changes while transitioning back to user-mode and handle cleanup once the APC queue is drained.
I thought the same thing. It’s usually content that’s well outside my areas of familiarity, often even outside my areas of interest. But I usually find his writing interesting enough to read through anyway, and clear enough that I can usually follow it even without familiarity with the subject matter.
I had the same thought too. I wonder if this his role at Microsoft now? Kind of a human institutional knowledge repository, plus a kind of brand ambassador to the developer community, plus mentor to younger engineers, plus chronicler.
I hope he keeps going, no doubt he could choose to finish up whenever he wants to.
Reminds me of a hang in the Settings UI that was because it would get stuck on an RPC call to some service.
Why was the service holding things up? Because it was waiting on acquiring a lock held by one of its other threads.
What was that other thread doing? It was deadlocked because it tried to recursively acquire an exclusive srwlock (exactly what the docs say will happen if you try).
Why was it even trying to reacquire said lock? Ultimately because of a buffer overrun that ended up overwriting some important structures.
I had a support issue once at a well known and big US defense firm. We got kernel hangs consistently in kernel space from normal user-level code. Crazy shit. I opened a support issue which eventually got closed because we used an old compiler. Fun times.
Suspending threads is generally not that expensive, especially if you don't do it very often. Like, it's not free, and don't do it every frame, but even if it takes even a millisecond (wildly overestimated) that's fine if you don't do it very often. Even if you're hitting a 120 Hz deadline.
> Corporate software wouldn't care that the UI hung, you're getting paid to sit there and look at it.
The article says the thread had been hung for 5 hours. And if you understand the root cause, once it entered into the hung state, then absent some rather dramatic intervention (e.g. manually resuming the suspended UI thread), it would remain hung indefinitely.
The proper solution, as Raymond Chen notes, is to move the monitoring thread into a separate process, that would avoid this deadlock.
> If you want to suspend a thread and capture stacks from it, you’ll have to do it from another process, so that you don’t deadlock with the thread you suspended.
Unfortunately sometimes you don't have the luxury of being able to do this (e.g. on iOS, especially pre-MetricKit). We shipped one such implementation in the Twitter app (which was still there last I checked) and as far as I can tell it's safe but mostly by accident–I didn't want to to pause things for very long, so the code just suspends the thread, grabs register state, then writes the backtrace to a stack buffer before resuming. I originally wanted to grab traces without suspending the process, which is something you can actually "do" because getting register state doesn't require suspension and you need to put guards on your frame decoding anyway ("is this address I am about to dereference actually in the stack?"). But unfortunately after thinking about it I added the suspension back because trying to collect a trace from a running thread could give you a fragmented backtrace as it modifies it out from under you.
I've been able to get managed code to BSOD my machine by simply having a lot of thread instances that are aggressively communicating with each other (i.e., via Channel<T>). It's probably more of a hardware thing than a software thing. My Spotify fails to keep the audio buffer filled when I've got it fully saturated. I feel like the kernel occasionally panics when something doesn't resolve fast enough with regard to threads across core complexes.
This is a complicated question. If you "suspend" a GCD queue using the traditional APIs then it will happen between block execution, which is unlikely to cause problems, because people do not typically take locks between different items. But if you suspend the thread that backs the queue (using thread_suspend) you will definitely run into problems unless you're really careful.
Oh I've debugged this before. Native memory allocator had a scavenge function which suspended all other threads. Managed language runtime had a stop the world phase which suspended all mutator threads. They ran at about the same time and ended up suspending each other. To fix this you need to enforce some sort of hierarchy or mutual exclusion for suspension requests.
> Why you should never suspend a thread in your own process.
This sounds like a good general princple but suspending threads in your own process is kind of necessary for e.g. many GC algorithms. Now imagine multiple of those runtimes running in the same process.
> suspending threads in your own process is kind of necessary for e.g. many GC algorithms
I think this is typically done by having the compiler/runtime insert safepoints, which cooperatively yield at specified points to allow the GC to run without mutator threads being active. Done correctly, this shouldn't be subject to the problem the original post highlighted, because it doesn't rely on the OS's ability to suspend threads when they aren't expecting it.
This is a good approach but can be tricky. E.g. what if your thread spends a lot of time in a tight loop, e.g. doing a big inlined matmul kernel? Since you never hit a function call you don't get safepoints that way -- you can add them to the back-edge of every loop, but that can be a bit unappetizing from a performance perspective.
If you don’t create any GC-able objects in the loop, why would you need to call the GC? And if you are, that should involve a function call.
And if you do need to call the GC, you could manually insert function calls every x loop iterations.
> suspending threads in your own process is kind of necessary for e.g. many GC algorithms
True. Maybe the more precise rule is “only suspend threads for a short amount of time and don’t acquire any locks while doing it”?
The way the .NET runtime follows this rule is it only suspends threads for a very short time. After suspending, the thread is immediately resumed if it not running managed code (in a random native library or syscall). If the thread is running managed code, the thread is hijacked by replacing either the instruction pointer or the return address with a the address of a function that will wait for the GC to finish. The thread is then immediately resumed. See the details here:
https://github.com/dotnet/runtime/blob/main/docs/design/core...
> Now imagine multiple of those runtimes running in the same process.
Can that possibly reliably work? Sounds messy.
Who are these customers that get developer support from Microsoft engineering teams?
It's expensive. Really expensive. I remember a major bank calling me and my buddy's 2-man consultancy team and telling me they had spent a small fortune on whatever the top-level access to MS developers is, to get some outdated MS COM component to interface with .NET, and MS had failed.
(We charged ~$20K and estimated two weeks. We had it working in two hours.)
I gotta ask, did you spend a week sucking your teeth after that, or did you hand it to them and say "hey, you're paying for expertise and we got it to you faster than we estimated"?
The correct way is the send the customer the almost-final version and wait for the bug report. This way you show how quickly you can tackle the problem but don't make the task look too easy.
Yeah, exactly a week. There was no way we could send it immediately, despite the fact it was ethically dubious to hold back.
This sounds pretty challenging. Could you please elaborate on your experience?
This was back in 2004 (?), so too long ago to remember the details. I remember the phone call though because the chap that called us said he been told we never used the word "impossible."
I worked on a team that did. We had a monthly call with a MS rep and access to devs working on the platform features we were working on (for MS Teams specifically). It is probably more common than you think.
I remember being able to file support cases just by buying one for a couple of hundred dollars. They'd also promise that if it turned out to be a bug in the product the fee would be refunded.
(My case wasn't solved. It was something about variable delays in getting packets off the network and into userspace but we never got to the bottom of it).
I worked for a small shop that provided something MS couldn’t/wouldn’t, but which was essential for their international business anyway. So we too had engineering support.
On Linux you'd do this by sending a signal to the thread you want to analyze, and then the signal handler would take the stack trace and send it back to the watchdog.
The tricky part is ensuring that the signal handler code is async-signal-safe (which pretty much boils down to "ensure you're not acquiring any locks and be careful about reentrant code"), but at least that only has to be verified for a self-contained small function.
Is there anything similar to signals on Windows?
The closest thing is a special APC enqueued via QueueUserAPC2 [1], but that's relatively new functionality in user-mode.
[1] https://learn.microsoft.com/en-us/windows/win32/api/processt...
The 2 implies an older API, its predecessor QueueUserAPC has been around since the XP days.
The older API is less like signals and more like cooperative scheduling in that it waits for the target thread to be in an "alertable" state before it runs (the thread executes a sleep or a wait for something)
> The 2 implies an older API, its predecessor QueueUserAPC has been around since the XP days.
I wasn’t implying that APCs were new, I was implying that the ability to enqueue special (as opposed to normal) APCs from user-mode is new. And of course, that has always been possible from kernel-mode with NT.
Or SetThreadContext() if you want to be hardcore. (not recommended)
Why not recommended? As far as things close to signals go, this is how you implement signals in user land on Windows (along with pause/resume thread). You can even take locks later during the process, as long as you also took them before sending the signal (same exact restrictions as fork actually, but unfortunately atfork hooks are not accessible and often full of fork-unsafe data race and deadlock implementation bugs themselves in my experience with all the popular libc)
I’ve implemented them as you describe, but it’s still a bit hacky due to lots of corner cases — what if your target thread is currently executing in the kernel?
The special APC is nicer because the OS is then aware of what you’re doing— it will perform the user-mode stack changes while transitioning back to user-mode and handle cleanup once the APC queue is drained.
I knew from seeing a title like that on microsoft.com that it was going to be a Raymond Chen post! He writes fascinating stuff.
I thought the same thing. It’s usually content that’s well outside my areas of familiarity, often even outside my areas of interest. But I usually find his writing interesting enough to read through anyway, and clear enough that I can usually follow it even without familiarity with the subject matter.
I had the same thought. I imagine the percentage of hacker news links to microsoft.com that are Raymond Chen links is high.
I had the same thought too. I wonder if this his role at Microsoft now? Kind of a human institutional knowledge repository, plus a kind of brand ambassador to the developer community, plus mentor to younger engineers, plus chronicler.
I hope he keeps going, no doubt he could choose to finish up whenever he wants to.
Reminds me of a hang in the Settings UI that was because it would get stuck on an RPC call to some service.
Why was the service holding things up? Because it was waiting on acquiring a lock held by one of its other threads.
What was that other thread doing? It was deadlocked because it tried to recursively acquire an exclusive srwlock (exactly what the docs say will happen if you try).
Why was it even trying to reacquire said lock? Ultimately because of a buffer overrun that ended up overwriting some important structures.
I had a support issue once at a well known and big US defense firm. We got kernel hangs consistently in kernel space from normal user-level code. Crazy shit. I opened a support issue which eventually got closed because we used an old compiler. Fun times.
An old compiler that was…miscompiling the kernel? It's hard to imagine any other situation that would be a valid reason to close the bug.
Although I understand nothing from these posts, read Raymond's posts somehow always "tranquil" my inner struggles.
Just curious, is this customer a game studio? I have never done any serious system programming but the gist feels like one.
I would guess it's something corporate. They can afford to pause the UI and ship debugging traces home more than a real-time game might.
Suspending threads is generally not that expensive, especially if you don't do it very often. Like, it's not free, and don't do it every frame, but even if it takes even a millisecond (wildly overestimated) that's fine if you don't do it very often. Even if you're hitting a 120 Hz deadline.
Stack tracing, however, can be very costly. At least if you trace every thread.
(well, when I did it it was in python on python threads, which I have to assume is a lot worse. Not sure about native threads.)
It's like a hundred dereferences for native threads, that's not very much
Id actually expect a customer facing program more. Corporate software wouldn't care that the UI hung, you're getting paid to sit there and look at it.
> Corporate software wouldn't care that the UI hung, you're getting paid to sit there and look at it.
The article says the thread had been hung for 5 hours. And if you understand the root cause, once it entered into the hung state, then absent some rather dramatic intervention (e.g. manually resuming the suspended UI thread), it would remain hung indefinitely.
The proper solution, as Raymond Chen notes, is to move the monitoring thread into a separate process, that would avoid this deadlock.
The banker trying to close a deal isn't paid by the hour.
Unless the user's boss complained to the programmer's boss
> If you want to suspend a thread and capture stacks from it, you’ll have to do it from another process, so that you don’t deadlock with the thread you suspended.
Unfortunately sometimes you don't have the luxury of being able to do this (e.g. on iOS, especially pre-MetricKit). We shipped one such implementation in the Twitter app (which was still there last I checked) and as far as I can tell it's safe but mostly by accident–I didn't want to to pause things for very long, so the code just suspends the thread, grabs register state, then writes the backtrace to a stack buffer before resuming. I originally wanted to grab traces without suspending the process, which is something you can actually "do" because getting register state doesn't require suspension and you need to put guards on your frame decoding anyway ("is this address I am about to dereference actually in the stack?"). But unfortunately after thinking about it I added the suspension back because trying to collect a trace from a running thread could give you a fragmented backtrace as it modifies it out from under you.
Reminds me of a bug that would bluescreen windows if I stopped Visual Studio debugging if it was in the middle of calling the native Ping from C#
I've been able to get managed code to BSOD my machine by simply having a lot of thread instances that are aggressively communicating with each other (i.e., via Channel<T>). It's probably more of a hardware thing than a software thing. My Spotify fails to keep the audio buffer filled when I've got it fully saturated. I feel like the kernel occasionally panics when something doesn't resolve fast enough with regard to threads across core complexes.
I have the weirdest hunch that the customer in question was Valve :D
Such a clean breakdown. "Don’t suspend your own threads" should be tattooed on every Windows dev’s arm at this point
Looking at the title, at first I thought “uh?”, but then I saw microsoft and it made sense.
>Naturally, a suspended UI thread is going to manifest itself as a hang.
The correct terminology is 'stopped responding' Raymond. You need to consult the style guide.
Can this happen with Grand Central Dispatch ?
This is a complicated question. If you "suspend" a GCD queue using the traditional APIs then it will happen between block execution, which is unlikely to cause problems, because people do not typically take locks between different items. But if you suspend the thread that backs the queue (using thread_suspend) you will definitely run into problems unless you're really careful.
did... did you understand what the bug was?