Undo

Resolving C/C++ Concurrency Bugs More Efficiently with Time Travel Debugging

I’m happy to announce a guest post about Time Travel Debugging with UDB. At the end of the post, I have a bonus.


 

Undo

C and C++ developers requiring high performance often employ various concurrent computing techniques in which several threads of execution operate simultaneously.

Recent Modernes C++ blog posts explain how to avoid concurrency bugs in multithreaded C++ systems by using Latches, Semaphores, and Barriers and Atomic Smart Pointers.

These practices are important because if one does not pay careful attention to thread locking and memory access, there’s a risk of introducing new failure modes (e.g. nondeterministic behaviour, race conditions, etc.), which can make the application very difficult to debug.

In this article, we will take a closer look at concurrency bugs and explain how Time Travel Debugging (a.k.a. reverse debugging) can be used to resolve them more quickly.

What are concurrency bugs?

Concurrency bugs are nondeterministic defects that arise when the execution of a thread disrupts the behaviour of other threads running at the same time.

 

Rainer D 6 P2 500x500Modernes C++ Mentoring

Be part of my mentoring programs:

  • "Fundamentals for C++ Professionals" (open)
  • "Design Patterns and Architectural Patterns with C++" (open)
  • "C++20: Get the Details" (open)
  • "Concurrency with Modern C++" (starts March 2024)
  • Do you want to stay informed: Subscribe.

     

    Nondeterministic factors such as thread-switching or external data can affect the order or timing of thread execution, resulting in unpredictable application behaviour such as miscalculations, crashes, or hangs. Below are a few concurrency defect examples.

    Atomicity violation

    If the execution of Thread 1 below is interrupted by Thread 2 immediately after passing the if test, then Thread 1 will crash with a memory access violation.

    Thread 1:

    if (foo->bar)

    {

        do_something(foo->bar);

    }

    Thread 2:

    foo->bar = nullptr;

    Deadlock

    It is possible for Thread 1 to wait indefinitely for Thread 2 to unlock L2, while Thread 2 is waiting for Thread 1 to release L2.

    Thread 1:

    std::lock_guard<std::mutex> lock(L1); std::lock_guard<std::mutex> lock(L2);

    Thread 2:

    std::lock_guard<std::mutex> lock(L2); std::lock_guard<std::mutex> lock(L1);

    Race condition

    A race condition is a type of software defect that occurs when separate threads interact in an unforeseen way and disrupt the expected timing and order of operations.

    An example is where two threads try to change shared data at the same time, leading to unpredictable system behaviour. That is, multiple threads are in a race and different threads might win the race depending on non-deterministic events.

    The hard way to find and fix concurrency defects

    If you have a defective multithreaded program, your starting point is likely to be the program entering a confused state or just crashing. From this starting point, you might find yourself following the steps below to find the cause of a concurrency defect and correct it:

    1. Recreate the buggy behaviour
    2. Hypothesize a cause
    3. Log application state extensively to revise or validate the hypothesis
    4. Identify the data structure being affected by the concurrency defect
    5. Search code for the parts of the program that change the data structure (often painstaking!)
    6. Step through code and breakpoints to find the defect happening
    7. Correct the code

    These steps can take days or weeks. What#s worse is that, since concurrency defects can be so hard to reproduce, causes often go undiscovered and uncorrected.

    The easier way: Time Travel Debugging

    Brian Kernighan famously wrote:

    “Debugging involves backward reasoning, like solving murder mysteries. Something impossible occurred, and the only solid information is that it really did occur. So we must think backwards from the result to discover the reasons.”

    In other words, we really want the debugger to be able to tell us what happened in the past. You need to know what your program actually did, as opposed to what you expected it was going to do. This is why debugging typically involves reproducing the bug many times, slowly teasing out more and more information until you pin it down.

    Time Travel Debugging takes away all that guesswork and trial and error; the debugger can tell you directly what just happened.

    Time Travel Debugging is the ability of a debugger to stop after a failure in a program has been observed and to go back into the history of the execution to uncover the reason for the failure.

    For hard-to-reproduce concurrency bugs like race conditions, Time Travel Debugging allows you to start from the point of failure and step backwards to find the cause. This is a very different approach from the typical process of running and rerunning a program, again and again, adding logging logic as you go until you find the cause.

    Let’s look at the impact Time Travel Debugging has on debugging a race condition.

    You start by loading and running a program (race) in a debugger that supports time travel (such as UDB). In our example below, you can see the program crashes:

    ./udb race

    (udb) run

    s_threadfn: it=110000

    s_threadfn: it=70000

    race: race.cpp:34: void s_threadfn(): Assertion `s_value == old_value + a’ failed.

    [New Thread 3441.3452]

    Program received signal SIGABRT, Aborted.

    [Switching to Thread 3441.3452]

    In our example, the error is all about the

    Assertion `s_value == old_value + a’ failed

    We want to know why one thread expects one value but gets another. We then use reverse-finish until we arrive at a line in our program:

    (udb) rf

    79           raise (SIGABRT);

    (udb) rf

    92       abort ();

    (udb) rf

    101      __assert_fail_base (_(“%s%s%s:%u: %s%sAssertion `%s’ failed.\n%n“),

    (udb) rf

    34            assert( s_value == old_value + a);

    This brings us to where our program aborted but doesn’t tell us why. We know that when this line executes, the value of s_value won’t be equal to old_value + a, so we want to find out what other part of the program altered the value. At this point, we can start to see that it’s a race condition. The few lines leading up to the assert should see `a` be added to s_value but by the time we hit the assert this isn’t the case, so another thread must have changed s_value at the same time. So, it must be a race condition.

    int a = s_random_int(5);

    s_value += a;

    assert( s_value == old_value + a);

    We find this out by putting a watchpoint on s_value:

    (udb) watch s_value

    Now we’re paused at the point before the program aborted and we know that the cause of the problem is s_value. Remember the approach before – wading through the code to find references to s_value. Contrast this with what we do next, which is to run the program in reverse (automatically, not manually) watching for where s_value is changed:

    (udb) reverse-continue

    Continuing.

    [New Thread 8792.8820]

    [Switching to Thread 8792.8820]

    Hardware watchpoint 1: s_value

    Old value = 236250

    New value = 236249

    0x0000000000400b86 in s_threadfn2 () at race.cpp:48

    48            s_value += 1;  /* Unsafe. */

    Now, the “Unsafe” comment isn’t going to be in a real system, but you get the idea. We have arrived at the line which is causing the race conditions. Two different threads both modifying s_value.

    (udb) list

    43         {

    44             if (it % (100) == 0)

    45             {

    46                 std::cout << __FUNCTION__ << “: it=” << it << “\n”;

    47             }

    48             s_value += 1;   /* Unsafe. */

    49             usleep(10*1000);

    50         }

    And that’s it! We’ve found the offending line.

    Schroedinger effect

    Of course, when we’re recording we will necessarily affect timings – ultimately it’s impossible to avoid the Schoedinger Effect. But in practice, it’s actually just as often that race conditions occur more frequently with recording than without as the other way round. Some recording systems have features that deliberately cause more thread switching to occur and make race conditions even more likely to happen while recording. (e.g. LiveRecorder‘s Thread Fuzzing and rr‘s Chaos Mode).

    Summary

    As application architectures become more complex, your debugging technology has to keep pace. You can try Time Travel Debugging with UDB for free. And there are a number of free and commercial debuggers that feature Time Travel Debugging listed here.

    Bonus

    I’m happy to announce. When you mention my name Rainer Grimm while buying a UDB license, you get a 30% discount.

    Thanks a lot to my Patreon Supporters: Matt Braun, Roman Postanciuc, Tobias Zindl, G Prvulovic, Reinhold Dröge, Abernitzke, Frank Grimm, Sakib, Broeserl, António Pina, Sergey Agafyin, Андрей Бурмистров, Jake, GS, Lawton Shoemake, Jozo Leko, John Breland, Venkat Nandam, Jose Francisco, Douglas Tinkham, Kuchlong Kuchlong, Robert Blanch, Truels Wissneth, Mario Luoni, Friedrich Huber, lennonli, Pramod Tikare Muralidhara, Peter Ware, Daniel Hufschläger, Alessandro Pezzato, Bob Perry, Satish Vangipuram, Andi Ireland, Richard Ohnemus, Michael Dunsky, Leo Goodstadt, John Wiederhirn, Yacob Cohen-Arazi, Florian Tischler, Robin Furness, Michael Young, Holger Detering, Bernd Mühlhaus, Stephen Kelley, Kyle Dean, Tusar Palauri, Juan Dent, George Liao, Daniel Ceperley, Jon T Hess, Stephen Totten, Wolfgang Fütterer, Matthias Grün, Phillip Diekmann, Ben Atakora, Ann Shatoff, Rob North, Bhavith C Achar, Marco Parri Empoli, Philipp Lenk, Charles-Jianye Chen, Keith Jeffery,and Matt Godbolt.

    Thanks, in particular, to Jon Hess, Lakshman, Christian Wittenhorst, Sherhy Pyton, Dendi Suhubdy, Sudhakar Belagurusamy, Richard Sargeant, Rusty Fleming, John Nebel, Mipko, Alicja Kaminska, Slavko Radman, and David Poole.

    My special thanks to Embarcadero
    My special thanks to PVS-Studio
    My special thanks to Tipi.build 
    My special thanks to Take Up Code
    My special thanks to SHAVEDYAKS

    Seminars

    I’m happy to give online seminars or face-to-face seminars worldwide. Please call me if you have any questions.

    Standard Seminars (English/German)

    Here is a compilation of my standard seminars. These seminars are only meant to give you a first orientation.

    • C++ – The Core Language
    • C++ – The Standard Library
    • C++ – Compact
    • C++11 and C++14
    • Concurrency with Modern C++
    • Design Pattern and Architectural Pattern with C++
    • Embedded Programming with Modern C++
    • Generic Programming (Templates) with C++
    • Clean Code with Modern C++
    • C++20

    Online Seminars (German)

    Contact Me

    Modernes C++ Mentoring,

     

     

    0 replies

    Leave a Reply

    Want to join the discussion?
    Feel free to contribute!

    Leave a Reply

    Your email address will not be published. Required fields are marked *