Thursday, February 28, 2013

Signal Handlers Plus Locking Equals Evil Squared

Threads Are Evil


I'm not going to go into too much detail on the reasons why threads are evil. This topic has already been covered by more experienced authors than myself hereherehere, and perhaps most importantly here. I think I had better clarify that or I'm going to get nasty letters with inappropriate references to my mother. Threads are evil, but sometimes a necessary evil.

The basic problem with threading is one of race conditions on shared resources and non-atomic operations. If thread A begins an operation on resource R, and a context switch occurs while the operation is in progress, thread B can come in and step on its toes and corrupt resource R resulting in undefined and (usually) undesirable behavior.

There are techniques for preventing these kind of problems, which generally involve some kind of carefully-synchronized locking mechanism. Used incorrectly, these locks can lead to freezing deadlocks as each thread is waiting for the other to complete its work. Employed appropriately, these techniques can enable concurrent processing without interfering with the stability of the program.

A Story From The Real World


Join me in my sorrow as I relive the pain of recent events or, more likely, snicker and laugh at me as I run around in a panic trying to put out fires.

At my work, we run a live service application that has been deployed in production for many months. I won't go into too many details for privacy reasons, but suffice it to say that hundreds of devices connect to this system daily and expect it to work around the clock. It has had, as with all non-trivial software, its share of bugs. Most of them have been minor and reasonably straightforward to reproduce and fix. This one added several new twists.

This service, which had been running for some time now, suddenly stopped frozen and unresponsive. Anything short of the all-powerful kill -9 call was ineffective in terminating the deadlocked process. Fortunately, we have monitoring software in place that detects periods of inactivity and alerts us via email.

Browsing the server logs showed nothing out of the ordinary. The service was under reasonable load, but certainly not pushing the limits of the hardware. A few basic attempts to reproduce the issue came up fruitless. My next course of action was to create a new, massively active stress test. Even running this stress test failed to reproduce the issue. In the meantime, we saw the problem in the production a second time. It was clearly not just somebody's imagination.

The breakthrough moment came when I enhanced the stress test application to randomly break connections and ignore messages. After running the stress test for a few hours, I finally found a frozen process in the test environment. With a little further analysis, I quickly realized that the problem was in the service's signal-handling code.

The following C++ code snippet is a rough simplification of the code being used to process signals:

static std::queue< int > signalQueue;

static void enqueueSignal( int sigNum ) {

    signalQueue.push( sigNum );
    signal( sigNum );

}

int main( int argc, char** argv ) {

    signal( SIGTERM, enqueueSignal );
    signal( SIGCHLD, enqueueSignal );
    ...

    while( true ) {

        if( !signalQueue.empty() ) {

            int sigNum = signalQueue.front();
            processSignal( sigNum ); // Defined elsewhere
            signalQueue.pop();
        }

        ...

    }

}


Deep within the dark and disturbing mysteries of the standard template library lies a secret terror. That terror is designed to deal with the issues of threaded race conditions as I described above. Unfortunately, it can also occasionally ensnare a program that uses STL containers inside of a signal handler.

Signal handlers are special, as they are not run on a separate thread, but still can interrupt the main thread at any point in its processing. They can also interrupt themselves only to be popped off the stack at a later time. This can lead to deadlocks where the signal handler is waiting on a lock that will never be released by the code below it. This is clearly a case of evil squared.

A Way Out


The solution to this problem: use the sig_atomic_t variable type. This special variable type is intended especially for use inside of a signal handler. It is intended to guarantee an atomic global resource that can safely be updated by a signal handler.

Make sig_atomic_t your new best friend when writing C++ signal handler functions. This unfortunately means that you cannot implement particularly complex logic within a signal handler function. My recommendation is to implement a new signal queue as an array of sig_atomic_t values with a cycling index and size counter. These values can then be accessed by the main thread once the signal handler returns. From there you can do whatever complex processing may be necessary for your system.

Good luck with your ventures into the dangerous world of concurrency and signal processing.

Cheers,

Joshua Ganes

3 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. My guess is that this was not an STL issue, but a malloc issue. You can't safely call malloc (or any functions that call malloc) in a signal handler, because if the program was interrupted in main() after a call to malloc, the signal handler COULD deadlock the lock that malloc uses to protect the data structures it uses for heap accounting.

    The man pages describe what functions are safe in signal handlers, and what are not. Readers should definitely read the manual.

    Lastly, a colleague of mine pointed out that using signalfd on Linux (or the kqueue equivalent method on BSD) is a far superior method to handling signals than using a shared atomic integer.

    The reason it is superior is that it allows you to process the signal while in "normal" excecution context rather than in "signal" context, where significant restrictions apply.

    ReplyDelete
  3. Yes, this was indeed a malloc() issue. The program was getting deadlocked on the malloc call waiting on a lock that would never be released.

    The man pages do cover this, and that's always a good place to start. I don't always buy the standard RTFM response, however. Man pages can be confusing and exceptionally technical, especially to people new to the topic. Those are the same people most likely to need to reference the man pages.

    ReplyDelete