C and C++ developers requiring high performance often employ various concurrent computing techniques in which several threads of execution operate simultaneously.
Recent Modernes C++ blog posts explain how to avoid concurrency bugs in multithreaded C++ systems by using Latches, Semaphores, and Barriers and Atomic Smart Pointers.
These practices are important because if one does not pay careful attention to thread locking and memory access, there's a risk of introducing new failure modes (e.g. nondeterministic behavior, race conditions, etc.), which can make the application very difficult to debug.
In this article, we will take a closer look at concurrency bugs and explain how Time Travel Debugging (a.k.a. reverse debugging) can be used to resolve them more quickly.
What are concurrency bugs?
Concurrency bugs are nondeterministic defects that arise when the execution of a thread disrupts the behavior of other threads running at the same time.
Nondeterministic factors such as thread-switching or external data can affect the order or timing of thread execution, resulting in unpredictable application behavior such as miscalculations, crashes, or hangs. Below are a few concurrency defect examples.
If the execution of Thread 1 below is interrupted by Thread 2 immediately after passing the if test, then Thread 1 will crash with a memory access violation.
foo->bar = nullptr;
It is possible for Thread 1 to wait indefinitely for Thread 2 to unlock L2, while Thread 2 is waiting for Thread 1 to release L2.
std::lock_guard<std::mutex> lock(L1); std::lock_guard<std::mutex> lock(L2);
std::lock_guard<std::mutex> lock(L2); std::lock_guard<std::mutex> lock(L1);
A race condition is a type of software defect that occurs when separate threads interact in an unforeseen way and disrupt the expected timing and ordering of operations.
An example is where two threads try to change shared data at the same time, leading to unpredictable system behavior. That is, multiple threads are in a race and different threads might win the race depending on nondeterministic events.
The hard way to find and fix concurrency defects
If you have a defective multithreaded program, your starting point is likely to be the program entering a confused state or just crashing. From this starting point, you might find yourself following the steps below to find the cause of a concurrency defect and correct it:
- Recreate the buggy behavior
- Hypothesize a cause
- Log application state extensively to revise or validate the hypothesis
- Identify the data structure being affected by the concurrency defect
- Search code for the parts of the program that change the data structure (often painstaking!)
- Step through code and breakpoints to find the defect happening
- Correct the code
These steps can take days or weeks. What#s worse is that, since concurrency defects can be so hard to reproduce, causes often go undiscovered and uncorrected.
The easier way: Time Travel Debugging
Brian Kernighan famously wrote:
"Debugging involves backward reasoning, like solving murder mysteries. Something impossible occurred, and the only solid information is that it really did occur. So we must think backward from the result to discover the reasons."
In other words, we really want the debugger to be able to tell us what happened in the past. You need to know what your program actually did, as opposed to what you expected it was going to do. This is why debugging typically involves reproducing the bug many times, slowly teasing out more and more information until you pin it down.
Time Travel Debugging takes away all that guesswork and trial and error; the debugger can tell you directly what just happened.
Time Travel Debugging is the ability of a debugger to stop after a failure in a program has been observed and to go back into the history of the execution to uncover the reason for the failure.
For hard-to-reproduce concurrency bugs like race conditions, Time Travel Debugging allows you to start from the point of failure and step backward to find the cause. This is a very different approach from the typical process of running and rerunning a program again and again, adding logging logic as you go until you find the cause.
Let's look at the impact Time Travel Debugging has on debugging a race condition.
You start by loading and running a program (race) in a debugger that supports time travel (such as UDB). In our example below, you can see the program crashes:
race: race.cpp:34: void s_threadfn(): Assertion `s_value == old_value + a' failed.
[New Thread 3441.3452]
Program received signal SIGABRT, Aborted.
[Switching to Thread 3441.3452]
In our example, the error is all about the
Assertion `s_value == old_value + a' failed
We want to know why one thread expects one value but gets another. We then use reverse-finish until we arrive at a line in our program:
79 raise (SIGABRT);
92 abort ();
101 __assert_fail_base (_("%s%s%s:%u: %s%sAssertion `%s' failed.\n%n"),
34 assert( s_value == old_value + a);
This brings us to where our program aborted, but doesn't tell us why. We know that when this line executes, the value of s_value won't be equal to old_value + a, so we want to find out what other part of the program altered the value. At this point, we can start to see that it's a race condition. The few lines leading up to the assert should see `a` be added to s_value but by the time we hit the assert this isn't the case, so another thread must have changed s_value at the same time. So, it must be a race condition.
int a = s_random_int(5);
s_value += a;
assert( s_value == old_value + a);
We find this out by putting a watch point on s_value:
(udb) watch s_value
Now we're paused at the point before the program aborted and we know that the cause of the problem is s_value. Remember the approach before - wading through the code to find references to s_value. Contrast this with what we do next, which is to run the program in reverse (automatically, not manually) watching for where s_value is changed:
[New Thread 8792.8820]
[Switching to Thread 8792.8820]
Hardware watchpoint 1: s_value
Old value = 236250
New value = 236249
0x0000000000400b86 in s_threadfn2 () at race.cpp:48
48 s_value += 1; /* Unsafe. */
Now, the "Unsafe" comment isn't going to be in a real system, but you get the idea. We have arrived at the line which is causing the race conditions. Two different threads both modifying s_value.
44 if (it % (100) == 0)
46 std::cout << __FUNCTION__ << ": it=" << it << "\n";
48 s_value += 1; /* Unsafe. */
And that's it! We've found the offending line.
Of course, when we're recording we will necessarily affect timings - ultimately it's impossible to avoid the Schoedinger Effect. But in practice, it's actually just as often that race conditions occur more frequently with recording than without as the other way round. Some recording systems have features that deliberately cause more thread switching to occur and make race conditions even more likely to happen while recording. (e.g. LiveRecorder's Thread Fuzzing and rr's Chaos Mode).
As application architectures become more complex, your debugging technology has to keep pace. You can try Time Travel Debugging with UDB for free. And there are a number of free and commercial debuggers that feature Time Travel Debugging listed here.
I'm happy to announce. When you mention my name Rainer Grimm while buying a UDB license, you get a 30% discount.